Becoming a data engineer requires in-depth knowledge and expertise, and while there are many free online resources available, they often lack the depth and context needed to master the field. That’s why reading books authored by domain experts is a more promising way to gain the knowledge you need.
If you’re looking for suggestions on what books to read for data engineering in 2023, look no further. As a successful data engineer, I’ve compiled a list of ten classic books that cover everything from fundamental technical skills like Python and SQL. In the 2021 Data/AI Salary Survey, Python and SQL are the most fundamental technical skills. You’d at least be savvy in those two languages to become successful. I have recommended two medium-level books to expand your knowledge of Python and SQL.
However, knowing those two probably only knocked on the door of being a data engineer. Many companies expect you to have experience with batch systems like Apache Spark, streaming systems like Apache Flink or Apache Beam, and workflow orchestration tools like Apache Airflow and Kubernetes. I have four recommendations for each of those topics.
Design principles are required for data engineers to get to the next level of their careers. In general, on the distributed system and dimensional modeling(OLAP), I urge two books for you to learn more about these subjects.
Finally, data engineering would get more bonus of having some data analytics skills as well as data communication skills. I found another two books that will sharpen your skills in them.
Let’s start with my ten recommended classic books for data engineering
1. The Master Book On Distributed System: Designing Data-Intensive Application
My Short Comment: This is my top go-to selection if I have one book to recommend. The book covers broad topics and has many references. You can use this book as the table of content to accumulate in-depth knowledge from what was cited.
This book stitches numerous data-related technologies with three main components: reliability, scalability, and maintainability. The book covers how distributed systems work extensively. Martin is truly knowledgeable in this book and shares his experience, and he cited many books and papers for further reading. Those books and papers can be another definitive list for data and software engineers. From his book, here are some topics to highlight:
- You’ll learn a classic system design interview question like “How to design Twitter?“.
- How can data be scaled out if data doesn’t fit on a single machine? How do the leaders and followers work to ensure high availability?
- A clear ACID explanation (atomicity, consistency, isolation, durability).
- A better understanding of at-least-once, at-most-once, and exactly-once semantics is vital in streaming applications.
The only downside is that since this book was published in March 2017, with many new technologies evolving, some libraries have faded, and a few contents aren’t fresh anymore. However, the core concepts of data system design are still applicable today in 2022. The philosophies that any data system would need to achieve reliability, scalability, and maintainability. It is also usually bought as a reference book to study system design. If you are looking for a fresher book to study system design or go for an interview, I also recommend two books from Alex Xu:
2. The SQL Fundamental Book: T-SQL Querying (Developer Reference)
My Short Comment: Ben-Gan explained SQL from ANSI/ISO Standard perspective and taught you to think from a procedural programming mindset. After reading his book, writing complex queries and debugging any SQL query-related issues becomes accessible as a data engineer.
Ben-Gan had written some classic books on writing T-SQL before. This summarizes his latest expertise in teaching people how to think first and then write SQL queries. Don’t skip this book, even if T-SQL is not your primary SQL language. The core idea of this book isn’t to teach you about the fancy syntax for SQL; instead, it does a fantastic job of explaining how to think SQL from a procedural programming perspective. After understanding the core concepts, you’d feel more comfortable writing SQL logic. Moreover, you become an expert in debugging the SQL statement since you know how it works internally. The material in the book paves your road to setting up a rock-solid foundation to be an expert in writing SQL, which you must be as a data engineer.
3. OLAP/Data Warehouse Must Read: The Data Warehouse Toolkit (The Definitive Guide to Dimensional Modeling, 3rd Edition)
My Short Comment: A classic book to define dimensional modeling / OLAP system. As a data engineer, dimensional modeling and OLAP are topics that require much work to comprehend. Understanding them makes your daily job smooth and gives you a higher chance of getting a job interview done well on data warehouse topics.
Rather than making fake schemas, this book utilizes various industries in each chapter and shows the different data models as examples. It doesn’t jump into the technical part of designing the schema right away. The book begins by asking questions on why you need a data warehouse and how to collaborate with end-users to deliver a model that will be successful in the end. The book teaches you to use “Enterprise Data Warehouse Bus Matrix” to bridge technical design gaps and use cases. Then it covers the four critical steps in dimensional modeling and sets the foundation for the rest of the book.
As a data engineer, you’ll need to build an ETL pipeline and store data in analytics format. Dimensional Modeling isn’t out of date yet. Many industries still heavily use it to develop their OLAP system.
4. In-depth on Python: Fluent Python: Clear, Concise, and Effective Programming 2nd Edition
My Short Comment: A In-depth look into Python. The books demonstrate how to write effective Python as a developer and explore the unique Python features and how they differ from other languages. It reveals how Python works behind the scene, making this book an excellent option to polish your Python skill.
This book is undoubtedly not a Python 101 book. It requires you have an intermediate level of understanding of Python. It targets readers who can write Python but don’t know if they have written effective Python code or follow the Pythonic pattern. These intermediate and advanced levels of knowledge are the gap missed by many data engineers.
Python made it simple to give every user a happy path to onboard. But, writing Python effectively to leverage its best features is only sometimes in place. Python has been adopted broadly as a data engineer and your go-to programming language. This book can genuinely enable you to rethink writing code in Python.
5. Learning From the Creator: Spark: The Definitive Guide: Big Data Processing Made Simple
My Short Comment: Learning Spark from Matei Zaharia (creator of Apache Spark) makes this book a unique reading experience.
Apache Spark and its ecosystem shouldn’t be strangers to any data engineer in those days. This book brought you a creator perspective of Apache Spark. It goes through the history of Spark and digs deeper into the RDD and DataFrame API in both Scala and Python. It goes into more detail about the Spark ecosystem, like batch, streaming, ML and Graph, Also focusing on deployment and Spark job tunning.
Although Spark 3.0 has already been released, the core concepts don’t change much. This definitive guide should still be a good candidate for gaining more profound knowledge of Apache Spark.
6. Workflow Management Tool: Data Pipelines with Apache Airflow
My Short Comment: A step-by-step guide to Apache Airflow, from creating a simple DAG to deploying Airflow to production.
Apache Airflow is becoming the go-to open-source package for workflow management. The book first feels like an enhanced version of the Apache Airflow documentation online. But soon, you’ll realize that author adds more context from the industrial experience of deploying Airflow. Primarily where Airflow doesn’t design correctly initially for some cases, it shows sound tips on how to work around them.
Airflow has specific use cases, not for a use case like streaming data. It also takes some to understand its core concept. I had a few blog posts about Airflow that are well received in the community, and feel free to read them as well:
7. All You Need to Know About Streaming Foundation: Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing
My Short Comment: Read from the Google engineer who built the original Google streaming system Millwheel and systematically explain his paper and concepts of streaming in this book. The book is the go-to book for streaming. It sets the tone for modern popular streaming systems like Beam, Flink, and Spark Streaming.
As a data engineer, besides the traditional batch data process, the streaming system becomes more widespread as more eagerness to get data quicker. A streaming system consistently processes data 24/7. The best company doing this well early is Google.
Google has built an internal streaming system called Millwheel but has not open-sourced yet. A few Google engineers published a paper, “MillWheel: Fault-Tolerant Stream Processing at Internet Scale,” to explain MillWheel as a streaming platform in detail. Tyler is the paper’s first author. The paper inspires a few graduate students in Germany, who later build Apache Flink as an open-source project.
The book covers why we need a streaming system and sets the fundamental ideas behind the streaming system. I’d highly recommend his book and the MilllWheel paper, as he has authority in the streaming area. The digital version of the book shows some cool animation to demonstrate streaming ideas. I highly recommend checking those out as well.
8. Understand Data Science: R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
My Short Comment: Data engineers should work on more than just the ETL data pipeline. One group of data engineers’ users is data scientists. Understanding basic data science ideas will make communication more effortless and help data engineers gain more business sense from the data to provide better solutions.
Mr. Wickham is also the author of one of the most popular data visualization tools in R — ggplot2. If you want to learn why ggplot2 is so fantastic and should spend time learning even if you are not developing in R, check my article — Why Is ggplot2 So Good For Data Visualization?
This book is not all about ggplot2. It talks about data science flow and how to perform exploratory analysis. For any data projects, how to wrangle data effectively. Although starting programming in R is a learning curve, this book makes it less painful.
9. Data Communication: Storytelling with Data: A Data Visualization Guide for Business Professionals
My Short Comment: “The technology doesn’t make the deal. A good story does.” No matter how much time and effort you put into analyzing data and building the pipeline, proper communication could lead to undesirable results. This book shows you do’s and don’ts to help you engage with the audience.
Building a dashboard is simple, but developing a self-explanatory one takes more thought and effort. The author has years of experience at Google coaching data visualization. This is also another learn from the domain expert book.
The book doesn’t leave you with excellent graphics but guides you through good practices when communicating with others on data on different channels like a meeting or documentation. It also taught you to conduct the in-person discussion properly and grab your audience’s attention in 3 minutes to show the data. I found this particularly practical for any data engineer who drives the conversation to gain a better result.
10. Gain Knowledge on Cloud Infra: Kubernetes in Action
My Short Comment: As more data infrastructure moves to the cloud, data engineers should understand how their data flows under the hood. As many infrastructures are containerized, learning Kubernetes becomes necessary for data engineers, especially those in multiple cloud providers.
This book serves well as the first Kubernetes book by introducing Docker and Kubernetes; you build your local cluster and add more features to the Kubernetes cluster. During this process, you also get familiar with
kubectl commands. This is an excellent handy intro book to debug any data engineering bugs on the cloud within the Kubernetes ecosystem.
The books I chose to cover various data engineering topics, including essential programming language, popular frameworks for batch and streaming, data communication, and cloud. There are many books I haven’t shown here, and they are also great. I want to present the most definitive list I think a data engineer should go through. I hope more fabulous books on data engineering are emerging and you have a great time reading those books!
How do you gain all those skills as data engineers? The answer is to observe what other data engineers do, read the classic data engineering book, and practice data engineering concepts and skills. It will take some time to gain knowledge and experience. Once you commit to doing it, it will ultimately reward you.
Please comment on your thoughts about the list and how you enjoy reading it. Cheers!
Disclaimer: The original post was from my Medium.com story for membership only, now I repost it here for free to my personal blog.
I hope my stories are helpful to you.
Let’s bring the data community’s attention to the essential- Building Better Data Warehouses with Dimensional Modeling: A Guide for Data Engineers.
The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval.