Apache Spark – Chengzhi Zhao

Apache Spark 4.1 features banner showing Python and Streaming | Image By Author

Apache Spark 4.1 is Here: The Next Chapter in Unified Analytics

Blog, Data Engineering / By Chengzhi Zhao / January 11, 2026

Apache Spark 4.1 is here. Discover how Real-Time Mode (RTM), Declarative Pipelines, and Arrow-Native UDFs are transforming data engineering and PySpark performance

Apache Spark 4.1 is Here: The Next Chapter in Unified Analytics Read More »

The Ultimate Apache Spark Guide: Performance Tuning, PySpark Examples, and New 4.0 Features

Data Engineering / By Chengzhi Zhao / June 30, 2025

The ultimate guide to Apache Spark. Learn performance tuning with PySpark examples, fix common issues like data skew, and explore new Spark 4.0 features.

The Ultimate Apache Spark Guide: Performance Tuning, PySpark Examples, and New 4.0 Features Read More »

Data Engineering Heats Up in June 2025: A Look at the Latest Developments

Data Engineering / By Chengzhi Zhao / June 16, 2025

Stay current with the essential data engineering news from June 2025. This monthly roundup covers the biggest announcements from Databricks’ Data + AI Summit, new Snowflake features, Apache Flink updates, and the growing role of AI and Apache Iceberg in the data landscape.

Data Engineering Heats Up in June 2025: A Look at the Latest Developments Read More »

Don’t Get Tripped Up! 10 Common Data Engineering Pitfalls

Data Engineering / By Chengzhi Zhao / June 1, 2025

Learn how to avoid 10 common data engineering pitfalls—like Spark data skew, Airflow retry chaos, schema drift, and more—with practical solutions

Don’t Get Tripped Up! 10 Common Data Engineering Pitfalls Read More »

Data Engineering in 2025: A Practical Guide for New Grads Entering the AI-First Era

AI, Data Engineering / By Chengzhi Zhao / May 6, 2025

Explore how AI in data engineering is shaping the future. This 2025 guide helps new grads build the skills, tools, and mindset to thrive in a cloud-driven, AI-first world.

Data Engineering in 2025: A Practical Guide for New Grads Entering the AI-First Era Read More »

DeepSeek SmallPond: A Game-Changer for Data Engineers Seeking Lightweight Solutions

Data Engineering / By Chengzhi Zhao / March 8, 2025

DeepSeek SmallPond is here to shake up data engineering. See how this lightweight open-source framework offers a fresh alternative to Apache Spark and Flink for batch and streaming processes.

DeepSeek SmallPond: A Game-Changer for Data Engineers Seeking Lightweight Solutions Read More »

Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed

Data Engineering / By Chengzhi Zhao / April 20, 2023

We will focus on the Apache Spark Union Operator Performance with examples, show you the physical query plan, and share techniques for optimization in this story.

Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed Read More »

5 Hidden Apache Spark Facts That Fewer People Talk About

Data Engineering / By Chengzhi Zhao / April 4, 2023

I want to share 5 hidden facts about Apache Spark that I learned throughout my career. Those can be helpful to you to save you some time reading the Apache Spark source code.

5 Hidden Apache Spark Facts That Fewer People Talk About Read More »

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1)

Data Engineering / By Chengzhi Zhao / April 4, 2023

We will discuss a neglected part of Apache Spark Performance between coalesce(1) and repartition(1), and it could be one of the things to be attentive to when you check the Spark job performance.

Uncovering the Truth About Apache Spark Performance: coalesce(1) vs. repartition(1) Read More »

The Essential Reading List for Data Engineers: 10 Classic Books You Can’t Miss

Data Engineering / By Chengzhi Zhao / February 20, 2023

Discover the Essential Reading List for Data Engineers: 10 Classic Books You Can’t Miss. While many free online resources are available, they often lack the depth and context needed to truly master the field. In this article, I will share ten classic books that cover everything from fundamental technical skills like Python and SQL, to more advanced topics like Apache Spark, Apache Flink, Apache Beam, Apache Airflow, Kubernetes, distributed systems, and dimensional modeling.

The Essential Reading List for Data Engineers: 10 Classic Books You Can’t Miss Read More »