Spark Performance – Chengzhi Zhao

Apache Spark 4.1 features banner showing Python and Streaming | Image By Author

Apache Spark 4.1 is Here: The Next Chapter in Unified Analytics

Blog, Data Engineering / By Chengzhi Zhao / January 11, 2026

Apache Spark 4.1 is here. Discover how Real-Time Mode (RTM), Declarative Pipelines, and Arrow-Native UDFs are transforming data engineering and PySpark performance

Apache Spark 4.1 is Here: The Next Chapter in Unified Analytics Read More »

The Ultimate Apache Spark Guide: Performance Tuning, PySpark Examples, and New 4.0 Features

Data Engineering / By Chengzhi Zhao / June 30, 2025

The ultimate guide to Apache Spark. Learn performance tuning with PySpark examples, fix common issues like data skew, and explore new Spark 4.0 features.

The Ultimate Apache Spark Guide: Performance Tuning, PySpark Examples, and New 4.0 Features Read More »

Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed

Data Engineering / By Chengzhi Zhao / April 20, 2023

We will focus on the Apache Spark Union Operator Performance with examples, show you the physical query plan, and share techniques for optimization in this story.

Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed Read More »

5 Hidden Apache Spark Facts That Fewer People Talk About

Data Engineering / By Chengzhi Zhao / April 4, 2023

I want to share 5 hidden facts about Apache Spark that I learned throughout my career. Those can be helpful to you to save you some time reading the Apache Spark source code.

5 Hidden Apache Spark Facts That Fewer People Talk About Read More »

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1)

Data Engineering / By Chengzhi Zhao / April 4, 2023

We will discuss a neglected part of Apache Spark Performance between coalesce(1) and repartition(1), and it could be one of the things to be attentive to when you check the Spark job performance.

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1) Read More »

Deep Dive into Handling Apache Spark Data Skew

Data Engineering / By Chengzhi Zhao / December 30, 2022

“Why my Spark job is running slow?” is an inevitable question. We will cover how to identify Spark data skew and how to handle data skew with different options, including key salting

Deep Dive into Handling Apache Spark Data Skew Read More »