5 Hidden Apache Spark Facts That Fewer People Talk About

Developing Apache Spark can sometimes be frustrating when you hit the hidden facts on it. Those facts that fewer people talk about should be addressed in online courses or books. Until one day, you found the unexpected result and dug into the Apache Spark source code.

I want to share 5 hidden facts about Apache Spark that I learned throughout my career. Those can be helpful to you to save you some time reading the Apache Spark source code.

No-op Operation

Working on the Apache Spark data frame is like working with a table. We perform operations to extract more valuable information by transforming the data.

In 95% of the cases, Apache Spark will scream at you and fail your job if the column name is incorrectly provided.

However, that doesn’t mean Spark will always catch the error you had. Those are the case where Spark have no-op operation.

What’s a no-op operation?

In computer science, a NOP, no-op, or NOOP (pronounced “no op”; short for no operation) is a machine language instruction and its assembly language mnemonic, programming language statement, or computer protocol command that does nothing. — Wikipedia

What’s a no-op operation in Spark? Simply put, Spark will check if it can find the existence of a given column. If not, it will skip your instructions and silently do nothing.

There are two operations you’d need to be aware of them:

withColumnsRenamed
drop

If you search for them in the dataset, they are explicitly mentioned in the Apache Spark source code scala

This is a no-op if schema doesn’t contain existingName.

If we take a closer look at the source code, for example withColumnsRenamed , if the resolver can find the name, it will perform an alias operation; otherwise, the non-existing column will be ignored by shouldRename check.

				
					val shouldRename = output.exists(f => resolver(f.name, existingName))
if (shouldRename) {
  val columns = output.map { col =>
    if (resolver(col.name, existingName)) {
      Column(col).as(newName)
    } else {
      Column(col)
    }
  }
  select(columns : _*)
} else {
  toDF()
}

If you accidentally have an incorrect name across the Spark job, it will silently execute until it hits the error on a column not found from the other operation. Those bugs are only straightforward to identify once you know the column is not correct in the first place, and some Spark operation is a no-op.

Coalesce(1) and Repartition(1)

How often have you heard people mention using Coalesce when you want fewer than the current partitions? And you probably know that this is a narrow transformation because Coalesce It won’t trigger shuffling but Repartition will do. And we know narrow transformation is faster.

This statement above is true in about 99% of the cases, but there is a caveat you’d need to be aware of.

coalesce: Returns a new Dataset that has exactly numPartitions partitions, when the fewer partitions are requested.

repartition: Returns a new Dataset that has exactly numPartitions partitions.

If we look at the Spark source code, repartition is a friendly name that calls coalesce directly but passes shuffle to true explicitly.

				
					def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  coalesce(numPartitions, shuffle = true)
}

Many data scientists would like to have a giant CSV file loaded into R to work with. Have you done writing reasonably large data into a single partition as a CSV file with calling coalesce(1) ? It is prolonged.

Why coalesce(1) is slow? It is supposed to coalesce everything and be faster.

If you coalesce to only one file, this is a drastic coalesce documented in the Spark source code.

However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). — — Spark Documentation

What happened here might only be vague once you see an actual result. I have written an article describing this behavior before. To save you time reading my article and summarizing, the neglected fact is to make coalesce(1) avoids shuffling. Spark shrinks the previous RDD to use the same number of partitions explicitly. In this case, it is much slower because the previous RDD is performed with much less partition.

To learn more detail about this behavior, please refer to my previous posts:

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1)

April 4, 2023 1 Comment

We will discuss a neglected part of Apache Spark Performance between coalesce(1) and repartition(1), and it could be one of the things to be attentive ...

Magic "2001" Partititon

Tuning spark.sql.shuffle.partitions is like working on a piece of art in Spark. It is vital to Spark performance tuning. Although we have had adaptive query execution in Spark since 3.0, it is preferable to give some hints to Spark about the default partition to anticipate.

Are you getting OOM? Try to increase the partitions to a magic number — 2001.

What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

Well, this isn’t dark magic. If you have more than 2000 partitions, Spark uses HighlyCompressedMapStatus , which can help with OOM issues. It is interesting to know that this value has still been hard coded in Spark.

				
					def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
  HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
  new CompressedMapStatus(loc, uncompressedSizes)
}
}

This was initially publicly talked about by Mark Grover and Ted Malaska in their tech talk about the top 5 mistakes when writing spark applications.

Join on Null Value/Hot Keys

Joining is one of the most discussed areas for Spark performance bottleneck because it often involves shuffling.

Spark and AQE are smart to figure out when broadcast join should be used. However, you can only sometimes dodge sort-merge join on two large datasets.

When two large datasets are joined, an overlooked fact is the distribution of keys. You are lucky if your dataset is perfectly evenly distributed, with no duplication or null value. However, real-life cases are often more brutal than ideal cases.

If there is a data skew on your keys, you usually discover this problem by checking the Spark UI and finding the last stage stuck on (n-1)/n forever.

Another neglected fact is NULL value will also be treated as the same value by default. If you have a large percentage of NULL values, they end up within a single partition when performing shuffling. In many cases, from my experience, it’s not due to the hotkey causing Spark performance regressed, but the dataset has too many “HOT” NULL values causing the data to skew on join.

I have written a story on handling Spark data skew in more detail and discussed different techniques for managing data skew, including salting. Please refer to Deep Dive into Handling Apache Spark Data Skew.

Deep Dive into Handling Apache Spark Data Skew

December 30, 2022 No Comments

"Why my Spark job is running slow?" is an inevitable question. We will cover how to identify Spark data skew and how to handle data ...

Map Spark UI Stage to Your Code

Spark UI is convenient for understanding overall Spark job statistics and bottlenecks. However, due to the data frame’s declarative nature and code generation, it needs to be more evident from the Spark UI Stage tab to understand which part of your code is executing.

On the Spark Web UI page, it is documented as follows:

Notably, Whole Stage Code Generation operations are also annotated with the code generation id. For stages belonging to Spark DataFrame or SQL execution, this allows to cross-reference Stage execution details to the relevant details in the Web-UI SQL Tab page where SQL plan graphs and execution plans are reported.

Below is an example of using the cross-reference and hovering on the SQL tab to know which code path is executed. It becomes convenient to understand which part of the code is executed to narrow the performance bottleneck further.

Final Thoughts

Apache Spark has fewer talked cases that could be hard to identify and debug. It needs some hands-on experience working with Spark to be aware of them. I hope this story can provide insight into the five hidden facts about Apache Spark and help you debug and develop a new job.

Please let me know if you have any comments or thoughts about this story. Thanks for your time reading this story!

About Me

I hope my stories are helpful to you.

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

Apr04

5 Hidden Apache Spark Facts That Fewer People Talk About

No-op Operation

Coalesce(1) and Repartition(1)

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1)

Magic "2001" Partititon

Join on Null Value/Hot Keys

Deep Dive into Handling Apache Spark Data Skew

Map Spark UI Stage to Your Code

Final Thoughts

About Me

More Articles

5 Hidden Apache Spark Facts That Fewer People Talk About

Beyond Basic Prompts: LLM + MCP Tackling Real-World Challenges—The Airflow 3.0 Auto-Update Example

I Built a Game By Using Streaming Data: A Fun Way for Data Visualization

About The Author

Chengzhi Zhao