Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed

The union operator is one of the set operators to merge two input data frames into one. Union is a convenient operation in Apache Spark for combining rows with the same order of columns. One frequently used case is applying different transformations and then unioning them together.

The ways of using the union operation in Spark are often discussed widely. However, a hidden fact that has been less discussed is the performance caveat associated with the union operator. If we didn’t understand the caveat of the union operator in Spark, we might fall into the trap of doubling the execution time to get the result.

We will focus on the Apache Spark DataFrame union operator in this story with examples, show you the physical query plan, and share techniques for optimization in this story.

Union Operator 101 in Spark

Like Relational Database (RDBMS) SQL, the union is a direct way to combine rows. One important thing to note when dealing with a union operator is to ensure rows follow the same structure:

The number of columns should be identical. The union operation won’t silently work or fill with NULL when the number of columns differs on data frames.
The column data type should match and resolves columns by position. The column name should follow the same sequence for each data frame. Nevertheless, that’s not mandatory. The first data frame will be chosen as the default for the column name. So mixing order can potentially cause an undesired result. Spark unionByName is intended to resolve this issue.

In Spark, the operation unionAll is an alias to union that doesn’t remove duplication. We’d need to add distinct after performing union to perform SQL-like union operations without duplication.

We can also combine multiple data frames to produce a final data frame.

				
					df = df1.union(df2).union(df3)

Performance Bottleneck of Union Operator

One typical pattern of using the union operator is splitting a single data frame into multiple, then applying different transformations, and eventually combining them into the final one.

Here is an example: we have two big tables (fact table) that need to join, and the best way to join is the SortMerged join in Spark. Once we got the SortMerged data frame, we split it into four subsets. Each subset uses different transformations, and eventually, we combine those 4 data frames into the final one.

Spark data frame leverages Catalyst optimizer, which takes the data frame code you had, then performs code analysis, logical optimization, physical planning, and code generation. Catalyst tries to create an optimal plan that executes your Spark job efficiently.

In recent years, Spark has extensively accomplished a lot of optimization on Catalyst to improve performance on Spark join operations. The join operation has more scenarios to use than the union operation, leading to less effort put into the union operation.

If users don’t use union on entirely different data sources, union operators will face a potential performance bottleneck — Catalyst isn’t “smart” to identify the shared data frames to reuse.

In this case, Spark will take each data frame as separate branches, then perform everything from the root multiple times. In our example, we will perform the two big table join four times! It is a huge bottleneck.

Learning Spark: Lightning-Fast Data Analytics

$43.99

Buy Now

We earn a commission if you make a purchase, at no additional cost to you.

06/07/2026 09:02 am GMT

Set up an Example with Union Operator in Spark

It’s straightforward to reproduce a non-optimized physical query plan for the union operator in Spark. We will do the following

Create two data frames from 1 to 1000000. Let’s call them df1 and df2
Perform inner join on df1 and df2
Split the joined result into two data frames: one only contains the odd numbers, another one for the even numbers
Add a transformation with a field called magic_value , which is generated by two dummy transformations.
Union the odd and even number data frames

				
					## Create two data frames from 1 to 1000000. Let's call them df1 and df2
df1 = spark.createDataFrame([i for i in range(1000000)], IntegerType())
df2 = spark.createDataFrame([i for i in range(1000000)], IntegerType())

## Perform inner join on df1 and df2
df = df1.join(df2, how="inner", on="value")

## Split the joined result into two data frames: one only contains the odd numbers, another one for the even numbers
df_odd = df.filter(df.value % 2 == 1)
df_even = df.filter(df.value % 2 == 0)

## Add a transformation with a field called magic_value which is generated by two dummy transformations.
df_odd = df_odd.withColumn("magic_value", df.value+1)
df_even = df_even.withColumn("magic_value", df.value/2)

## Union the odd and even number data frames
df_odd.union(df_even).count()

Here is a high-level view of what the DAG looks like. If we look at the DAG bottom-up, one thing that stands out is the join happened twice, and the upstream almost looks identical.

We have seen where Spark needs to optimize the union operator extensively, and much time is wasted performing unnecessary recomputing if the data source can be reused.

Here is the physical plan that has 50 stages scheduled with AQE enabled. We can see ids 13 and 27. Spark did perform join twice on each branch and recomputed its branch.

Our Pick

Spark: The Definitive Guide: Big Data Processing Made Simple

$49.00

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals.

Youâ ll explore the basic operations and common functions of Sparkâ s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparkâ s scalable machine-learning library.

Get a gentle overview of big data and Spark
Learn about DataFrames, SQL, and Datasetsâ Sparkâ s core APIsâ through worked examples
Dive into Sparkâ s low-level APIs, RDDs, and execution of SQL and DataFrames
Understand how Spark runs on a cluster
Debug, monitor, and tune Spark clusters and applications
Learn the power of Structured Streaming, Sparkâ s stream-processing engine
Learn how you can apply MLlib to a variety of problems, including classification or recommendation

Buy Now

We earn a commission if you make a purchase, at no additional cost to you.

06/07/2026 03:00 pm GMT

How to Improve the Performance of Union Operation

Now we can see this potential bottleneck. How could we resolve this? One option is to double the number of executors to run more concurrent tasks. But there is a better way to hint to Catalyst and let it reuse the joined data frame from memory.

To resolve the issue of the Spark performance of union operation, we can explicitly call a cache to persist the joined data frame in memory. So Catalyst knows the shortcut to fetch the data instead of returning it to the source.

Where to add the cache() ? The recommended place would be the data frame before the filtering and after the join is completed.

Let’s see it in action:

				
					# ...........................
## Perform inner join on df1 and df2
df = df1.join(df2, how="inner", on="value")

## add cache here
df.cache()

## Split the joined result into two data frames: one only contains the odd numbers, another one for the even numbers
df_odd = df.filter(df.value % 2 == 1)
# ...........................

Here is the query plan: InMemoryTableScan is present, so we can reuse the data frame to save other computing.

Now the physical plan is reduced to have only 32 stages, and if we check, ids 1 and 15 both leverage the InMemoryTableScan. This could save much more time if we split the original data frames into smaller datasets and then union them.

Final Thoughts

I hope this story helps provide some insights into why sometimes the union operation becomes a bottleneck for your Spark performance. Due to the lack of optimization in Catalyst for the union operator in Spark, users need to be aware of such caveats to develop Spark code more effectively.

Adding cache can save time in our example, but it won’t help if the union is performed on two completely different data sources and there is no shared place to perform cache.

This story is inspired by Kazuaki Ishizaki’s talk — Goodbye Hell of Unions in Spark SQL, and my experience handling a similar issue for my projects.

About Me

I hope my stories are helpful to you.

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

Jan21

Get Fluent in Python Decorators by Visualizing It

Python decorator is syntactic sugar. You can achieve everything without explicitly using the decorator. However, Using the decorator can help your code be more concise ...

Mar08

DeepSeek SmallPond: A Game-Changer for Data Engineers Seeking Lightweight Solutions

DeepSeek SmallPond is here to shake up data engineering. See how this lightweight open-source framework offers a fresh alternative to Apache Spark and Flink for ...

Jun01

Don’t Get Tripped Up! 10 Common Data Engineering Pitfalls

Learn how to avoid 10 common data engineering pitfalls—like Spark data skew, Airflow retry chaos, schema drift, and more—with practical solutions

0 0 votes

Article Rating

1 Comment

Oldest

Newest Most Voted

cst

3 years ago

As of today I think the bottleneck in the union operator is partially fixed by Catalyst. Instead of redoing the join between 2 fact tables, Spark only does it once but re-read multiple times the result in memory somehow, each time to produce a smaller dataset. Then in the end, join them all together. This is not as optimal as using cache because for example, after my initial join, my DF has 200 partitions, and from that DF, I create 4 smaller DF, then the union stage will still have to process 800 partitions. My dataset is pretty big (70 GB deserialized) so cache is not really an option for me. So, the solution proposed by Spark Catalyst is not bad.