Why R for Data Engineering is More Powerful Than You Thought

Data Engineering is less inclusive for R.

Most data engineers’ daily work in the industry involves SQL and Python. We occasionally write Scala or Java for Spark and Flink jobs. Rarely do data engineers practice R in their work. Many people leave R behind as a school project or tag R as a data scientist only.

I regret underestimating R’s power early in my data engineering career. I was encouraged to use R for a work-related project three years ago. I realized it’s much more powerful than I thought for data engineering.

This article intends to discuss something other than the debate on R vs. Python for data engineering. I write this article to bring R to the data engineering sights and show you where R could add potential benefits to help the data engineering community.

Embrace awkwardness for R

R is awkward for those who were taught programming with C derivatives.

The first thing that could throw anyone off is that the assignment operator is different. R community prefers to use an arrow (i.e., ＜-) instead of the widely adopted equal sign (i.e., =). You can still use the equal sign, but it’s not recommended for best practice.

The index is different. It starts with one instead of 0. Especially when people last did programming a long time ago, mixing the start index with other languages is error-prone.

Those are the language difference. As an analogy, think of R as a wise man who speaks with a strong accent. Language accomplishes communication’s goal, and we must listen more carefully to uncover insights. If we fail to listen patiently, we won’t get knowledge from that wise man.

If you have yet to try R, don’t scare away by the strangeness of R. Once you use it for some time, I am sure you will get used to it and start to like it.

R for Data Engineering

You usually work on a tabular data structure for data engineering projects to interact with data instead of handling nested arrays with loops. R comes with native support for Data Frame, similar to Python’s Pandas or Spark’s data frame.

Getting familiar with R’s data frame hides the complexity of R’s language’s features. With the familiarity you gained from data frames in the other library, the learning curve for R is less.

However, the main question becomes: I am a savvy Python data engineer who does everything Pythonic. Why should I learn R, and what R helps me in data engineering? Let me share my views on the following four main reasons:

The Beauty for Grammer of Graph
Sophisticated Analytics Package
Communication with Data Scientists
Process big data

The Beauty of The Grammar of Graphics

Grammar of graphics is a tool that enables us to concisely describe the components of a graphic — Hadley Wickham. A layered grammar of graphics

Part of the Data engineering has to do data visualization. You might get involved with the analytics side of the data echo system. However, as part of data pipeline development and business logic deriving process, visualization is vital to observe the pattern and identify the potential value from data.

I got lost in creating good data visualization in Python — too many different libraries people use. Many Python codes for the visualizations are obscure to comprehend. I found myself bouncing among Matplotlib, seaborn, plotly, bokeh, and Altair for various projects in which I collaborated with other engineers regarding data visualization in Python.

There are too many comparable options in Python to choose from. I found it hard to have agreement among teams on those awesome data visualization projects. It could bring chaos to your data engineering project and prevent proper and timely communication.

In R, due to the wide adoption of ggplot2, which adopts the grammar of graphics. It brings much neat and easy-to-understand code for data visualization.

You can learn more about why in one of my articles — Why Is ggplot2 So Good For Data Visualization?

Sophisticated Analytics Package

R has a list of sophisticated packages for data analytics. You can use the following to achieve complex data engineering tasks

dplyr: data wrangling and analysis. It is a similar style to writing in Spark, and it is much more readable as we pipe statements instead of nested multiple function calls.

				
					starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(
    n > 1,
    mass > 50
  )

tidyr: an enhanced way to organize data frame.
data.table: provides an improved version of data.frame . It is one of the fastest libraries to handle small to medium data. If you refer to H2o’s Database-like ops benchmark, data.table is one of the fastest libraries to process data. In contrast, Pandas can cause Out Of Memory (OOM), and users must find alternatives to process data on a single machine.
Shiny: build a web application for data, similar to Python Streamlit. It’s straightforward to keep everything in R and deploy. I have written an article about it

Second Iteration: Interactivity with User Click | Image By Author

How to Engage with Users By Storytelling: Show Data Analytics in R and Shiny

March 11, 2023 No Comments

Using R and Shiny, we can build an app where the end users can interact with the data analysis we have done. I will show ...

Communication with Data Scientists

Communication is critical for the data engineering role.

Data engineers support the downstream data scientists who consume data, assist them in reasoning it, and validate it to ensure high data quality.

Many data scientists love R. Establishing a solid connection by communicating within the same language could be beneficial and add more trust in the relationship when you work data scientist. To reach that stage, data engineers must throw the bias on R and learn how to perform data jobs using R.

Process big data

I don’t think R only works on someone’s local desktop for building models. You can write sparkR for large distributed data processing, develop Mage for data ETL workflow orchestration, and connect DB to pull extensive data. R can work with big data to uncover its potential in data engineering.

Final Thoughts

Data engineering should not be solely in Python. Primarily R has been heavily used in the data fields already. This article can help data engineers who oversee R with fresh eyes and use R as an alternative for solving problems, and it could be more efficient than you think.

About Me

I hope my stories are helpful to you.

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

Mar08

Why R for Data Engineering is More Powerful Than You Thought

Embrace awkwardness for R

R for Data Engineering

The Beauty of The Grammar of Graphics

Sophisticated Analytics Package

How to Engage with Users By Storytelling: Show Data Analytics in R and Shiny

Communication with Data Scientists

Process big data

Final Thoughts

About Me

More Articles

DeepSeek SmallPond: A Game-Changer for Data Engineers Seeking Lightweight Solutions

Stop Breaking Production Data Pipeline: Implementing Write-Audit-Publish (WAP) with Spark and Apache Iceberg

Visualizing Data with ggridges: Techniques to Eliminate Density Plot Overlaps in ggplot2

About The Author

Chengzhi Zhao