
Developing Apache Spark can sometimes be frustrating when you hit the hidden facts on it. Those facts that fewer people talk about should be addressed in online courses or books. Until one day, you found the unexpected result and dug into the Apache Spark source code.
I want to share 5 hidden facts about Apache Spark that I learned throughout my career. Those can be helpful to you to save you some time reading the Apache Spark source code.
No-op Operation
Working on the Apache Spark data frame is like working with a table. We perform operations to extract more valuable information by transforming the data.
In 95% of the cases, Apache Spark will scream at you and fail your job if the column name is incorrectly provided.
However, that doesn’t mean Spark will always catch the error you had. Those are the case where Spark have no-op operation.
What’s a no-op operation?
In computer science, a NOP, no-op, or NOOP (pronounced “no op”; short for no operation) is a machine language instruction and its assembly language mnemonic, programming language statement, or computer protocol command that does nothing. — Wikipedia
What’s a no-op operation in Spark? Simply put, Spark will check if it can find the existence of a given column. If not, it will skip your instructions and silently do nothing.
There are two operations you’d need to be aware of them:
- withColumnsRenamed
- drop
If you search for them in the dataset, they are explicitly mentioned in the Apache Spark source code scala
This is a no-op if schema doesn’t contain existingName.
If we take a closer look at the source code, for example withColumnsRenamed
, if the resolver
can find the name, it will perform an alias operation; otherwise, the non-existing column will be ignored by shouldRename
check.
val shouldRename = output.exists(f => resolver(f.name, existingName))
if (shouldRename) {
val columns = output.map { col =>
if (resolver(col.name, existingName)) {
Column(col).as(newName)
} else {
Column(col)
}
}
select(columns : _*)
} else {
toDF()
}
If you accidentally have an incorrect name across the Spark job, it will silently execute until it hits the error on a column not found from the other operation. Those bugs are only straightforward to identify once you know the column is not correct in the first place, and some Spark operation is a no-op.
Coalesce(1) and Repartition(1)
How often have you heard people mention using Coalesce
when you want fewer than the current partitions? And you probably know that this is a narrow transformation because Coalesce
It won’t trigger shuffling but Repartition
will do. And we know narrow transformation is faster.
This statement above is true in about 99% of the cases, but there is a caveat you’d need to be aware of.
coalesce: Returns a new Dataset that has exactly numPartitions
partitions, when the fewer partitions are requested.
repartition: Returns a new Dataset that has exactly numPartitions
partitions.
If we look at the Spark source code, repartition is a friendly name that calls coalesce directly but passes shuffle to true explicitly.
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
Many data scientists would like to have a giant CSV file loaded into R to work with. Have you done writing reasonably large data into a single partition as a CSV file with calling coalesce(1)
? It is prolonged.
Why coalesce(1)
is slow? It is supposed to coalesce everything and be faster.
If you coalesce to only one file, this is a drastic coalesce documented in the Spark source code.
However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). — — Spark Documentation
What happened here might only be vague once you see an actual result. I have written an article describing this behavior before. To save you time reading my article and summarizing, the neglected fact is to make coalesce(1)
avoids shuffling. Spark shrinks the previous RDD to use the same number of partitions explicitly. In this case, it is much slower because the previous RDD is performed with much less partition.
To learn more detail about this behavior, please refer to my previous posts:
Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1)
We will discuss a neglected part of Apache Spark Performance between coalesce(1) and repartition(1), and it could be one of the things to be attentive to when you check the Spark job performance.
Magic "2001" Partititon
Tuning spark.sql.shuffle.partitions
is like working on a piece of art in Spark. It is vital to Spark performance tuning. Although we have had adaptive query execution in Spark since 3.0, it is preferable to give some hints to Spark about the default partition to anticipate.
Are you getting OOM? Try to increase the partitions to a magic number — 2001.
Well, this isn’t dark magic. If you have more than 2000 partitions, Spark uses HighlyCompressedMapStatus
, which can help with OOM issues. It is interesting to know that this value has still been hard coded in Spark.
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
This was initially publicly talked about by Mark Grover and Ted Malaska in their tech talk about the top 5 mistakes when writing spark applications.
Join on Null Value/Hot Keys
Joining is one of the most discussed areas for Spark performance bottleneck because it often involves shuffling.
Spark and AQE are smart to figure out when broadcast join should be used. However, you can only sometimes dodge sort-merge join on two large datasets.
When two large datasets are joined, an overlooked fact is the distribution of keys. You are lucky if your dataset is perfectly evenly distributed, with no duplication or null value. However, real-life cases are often more brutal than ideal cases.
If there is a data skew on your keys, you usually discover this problem by checking the Spark UI and finding the last stage stuck on (n-1)/n forever.
Another neglected fact is NULL value will also be treated as the same value by default. If you have a large percentage of NULL values, they end up within a single partition when performing shuffling. In many cases, from my experience, it’s not due to the hotkey causing Spark performance regressed, but the dataset has too many “HOT” NULL values causing the data to skew on join.
I have written a story on handling Spark data skew in more detail and discussed different techniques for managing data skew, including salting. Please refer to Deep Dive into Handling Apache Spark Data Skew.
Deep Dive into Handling Apache Spark Data Skew
“Why my Spark job is running slow?” is an inevitable question. We will cover how to identify Spark data skew and how to handle data skew with different options, including key salting
Map Spark UI Stage to Your Code
Spark UI is convenient for understanding overall Spark job statistics and bottlenecks. However, due to the data frame’s declarative nature and code generation, it needs to be more evident from the Spark UI Stage tab to understand which part of your code is executing.
On the Spark Web UI page, it is documented as follows:
Notably, Whole Stage Code Generation operations are also annotated with the code generation id. For stages belonging to Spark DataFrame or SQL execution, this allows to cross-reference Stage execution details to the relevant details in the Web-UI SQL Tab page where SQL plan graphs and execution plans are reported.
Below is an example of using the cross-reference and hovering on the SQL tab to know which code path is executed. It becomes convenient to understand which part of the code is executed to narrow the performance bottleneck further.

Final Thoughts
Apache Spark has fewer talked cases that could be hard to identify and debug. It needs some hands-on experience working with Spark to be aware of them. I hope this story can provide insight into the five hidden facts about Apache Spark and help you debug and develop a new job.
Please let me know if you have any comments or thoughts about this story. Thanks for your time reading this story!
About Me
I hope my stories are helpful to you.
For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.
More Articles
Five Tips for Saving Money At Target
How to maximize your savings when you shop at Target? I will unveil 5 ways of saving money when you shop at Target to maximize your savings.
Airflow Schedule Interval 101
The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval.
Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed
We will focus on the Apache Spark Union Operator Performance with examples, show you the physical query plan, and share techniques for optimization in this story.