Data Engineering is less inclusive for R.
Most data engineers’ daily work in the industry involves SQL and Python. We occasionally write Scala or Java for Spark and Flink jobs. Rarely do data engineers practice R in their work. Many people leave R behind as a school project or tag R as a data scientist only.
I regret underestimating R’s power early in my data engineering career. I was encouraged to use R for a work-related project three years ago. I realized it’s much more powerful than I thought for data engineering.
This article intends to discuss something other than the debate on R vs. Python for data engineering. I write this article to bring R to the data engineering sights and show you where R could add potential benefits to help the data engineering community.
Embrace awkwardness for R
R is awkward for those who were taught programming with C derivatives.
The first thing that could throw anyone off is that the assignment operator is different. R community prefers to use an arrow (i.e., ＜-) instead of the widely adopted equal sign (i.e., =). You can still use the equal sign, but it’s not recommended for best practice.
The index is different. It starts with one instead of 0. Especially when people last did programming a long time ago, mixing the start index with other languages is error-prone.
Those are the language difference. As an analogy, think of R as a wise man who speaks with a strong accent. Language accomplishes communication’s goal, and we must listen more carefully to uncover insights. If we fail to listen patiently, we won’t get knowledge from that wise man.
If you have yet to try R, don’t scare away by the strangeness of R. Once you use it for some time, I am sure you will get used to it and start to like it.
R for Data Engineering
You usually work on a tabular data structure for data engineering projects to interact with data instead of handling nested arrays with loops. R comes with native support for Data Frame, similar to Python’s Pandas or Spark’s data frame.
Getting familiar with R’s data frame hides the complexity of R’s language’s features. With the familiarity you gained from data frames in the other library, the learning curve for R is less.
However, the main question becomes: I am a savvy Python data engineer who does everything Pythonic. Why should I learn R, and what R helps me in data engineering? Let me share my views on the following four main reasons:
- The Beauty for Grammer of Graph
- Sophisticated Analytics Package
- Communication with Data Scientists
- Process big data
The Beauty of The Grammar of Graphics
Grammar of graphics is a tool that enables us to concisely describe the components of a graphic — Hadley Wickham. A layered grammar of graphics
Part of the Data engineering has to do data visualization. You might get involved with the analytics side of the data echo system. However, as part of data pipeline development and business logic deriving process, visualization is vital to observe the pattern and identify the potential value from data.
I got lost in creating good data visualization in Python — too many different libraries people use. Many Python codes for the visualizations are obscure to comprehend. I found myself bouncing among Matplotlib, seaborn, plotly, bokeh, and Altair for various projects in which I collaborated with other engineers regarding data visualization in Python.
There are too many comparable options in Python to choose from. I found it hard to have agreement among teams on those awesome data visualization projects. It could bring chaos to your data engineering project and prevent proper and timely communication.
In R, due to the wide adoption of ggplot2, which adopts the grammar of graphics. It brings much neat and easy-to-understand code for data visualization.
You can learn more about why in one of my articles — Why Is ggplot2 So Good For Data Visualization?
Sophisticated Analytics Package
R has a list of sophisticated packages for data analytics. You can use the following to achieve complex data engineering tasks
- dplyr: data wrangling and analysis. It is a similar style to writing in Spark, and it is much more readable as we pipe statements instead of nested multiple function calls.
starwars %>% group_by(species) %>% summarise( n = n(), mass = mean(mass, na.rm = TRUE) ) %>% filter( n > 1, mass > 50 )
- tidyr: an enhanced way to organize data frame.
- data.table: provides an improved version of
data.frame. It is one of the fastest libraries to handle small to medium data. If you refer to H2o’s Database-like ops benchmark, data.table is one of the fastest libraries to process data. In contrast, Pandas can cause Out Of Memory (OOM), and users must find alternatives to process data on a single machine.
- Shiny: build a web application for data, similar to Python Streamlit. It’s straightforward to keep everything in R and deploy. I have written an article about it
Communication with Data Scientists
Communication is critical for the data engineering role.
Data engineers support the downstream data scientists who consume data, assist them in reasoning it, and validate it to ensure high data quality.
Many data scientists love R. Establishing a solid connection by communicating within the same language could be beneficial and add more trust in the relationship when you work data scientist. To reach that stage, data engineers must throw the bias on R and learn how to perform data jobs using R.
Process big data
Data engineering should not be solely in Python. Primarily R has been heavily used in the data fields already. This article can help data engineers who oversee R with fresh eyes and use R as an alternative for solving problems, and it could be more efficient than you think.
I hope my stories are helpful to you.