You might find writing SQL challenging due to its declarative nature. Especially for engineers familiar with imperative languages like Python, Java, or C, SQL is gear-switching and mind shifts to many people. Thinking in SQL is different than any imperative language and should not be learned and developed the same way.
When working with SQL, do you write in the top to bottom approach? Do you start developing in SQL with the “SELECT” clause first in SQL? In this article, let’s talk about SQL logical query processing order to help you understand why you might want to change that way you write SQL top to bottom. It can also help you think in SQL clearly and develop your query more effectively.
SQL Is a Declarative Language
The core concept of the declarative language is fantastic: humans convey high-level structure and logical instructions without exact flows. As a developer, that means instead of writing step-by-step guide on how to accomplish a goal. We’d write what goal to achieve so we can focus more on the result and let the underlying query parser and engine figure out how to execute the instruction.
When writing in SQL, you focus more on input and output, and execution is a magic black box behind the scene. For detailed execution, user provide preference on plan and optimization. But the execution is still handled by the query engine at run time.
However, it doesn’t mean we can wave a magic wand, and everything will be executed as desired. High-level instructions are still required that provides the following information.
- Where the data sources (FROM)
- What condition to apply (WHERE/HAVING)
- Whether the row level or aggregation (GROUP BY)
- How to show the result (SELECT)
- How to sort the data (ORDER BY)
- How many rows to return (LIMIT)
Unlike imperative language to get data, we don’t need to write complex logic or loop through each row. The SQL query is like a blueprint, and the SQL engine is the builder. As users, we need to wait for the output without worrying about how it gets the results.
SQL Logical Query Processing Order
If you have worked with other data processing frameworks in a imperative language, or even for users of Spark or Pandas, we always need a dataset to work with first. The places to get the data source is usually the entry point for any data processing framework, and the same applies for SQL.
To understand how SQL logical query processing order works, let’s take a SQL query at the syntax order compare it with logical processing order.
- FROM: the entry point for SQL is the FROM clause. It is usually defined right after the “SELECT” statement, and executed as the preparation stage to navigate the query engine which tables to pull the data
- ON: on condition is evaluated next from each table to check which key(s) to join
- JOIN: once knowing which keys to join, the SQL engine will check which types of join (Inner, Outer, Cross join, etc.) needs to be applied here
- WHERE: apply row-level filters
- GROUP BY: indicate the keys to aggregate on and change the view of the original row-level table. After this stage, everything handled forward will be aggregated level instead of original row level. If you use cube or roll-up for aggregation, it also happened at this stage.
- HAVING: apply aggregated-level filters. We can also write a nested query or CTE (common table expression) to perform filter at the aggregated level. Still, it is much more convenient next to the GROUP BY clause and potentially good to SQL engine optimization.
- SELECT: Choose the fields to show to the users. For the derived fields that requires complex logic like window function (rank(), row_number(), etc.) or case statement, or aggregated function, they all happened at this time. As you can see, SELECT is much late in SQL logical query processing order. If you start with SELECT clause first, it becomes hard to anticipate what you will write for the other 6 statements mentioned above, thus likely causing unexpected result or error in SELECT. Let’s talk about more in the next section.
- ORDER BY: sorting order of your final result. At this stage, we can resolve the alias name you defined in the SELECT clause.
- LIMIT: return the number of result, or if you’d like to skip a few top rows combined with ORDER BY
Let’s reshuffle the SQL statement from syntax order into the SQL logical query processing order, and it will become much clear.
Microsoft has excellent documentation on this topic on Logical Processing Order of the SELECT statement.
"SELECT" Shouldn't Be the First Word When Writing SQL
“SELECT” is usually the first clause when reading a SQL statement. Our brain works well with a top-to-bottom approach while reading or writing. Having “SELECT” first defines the result we’d like to show. Writing “SELECT” as the first word follows that pattern. Additionally, putting “SELECT” down is syntax correct to compile and run the entire SQL script.
How many times when you write a SQL query for data analysis, you have the following written down first:
SELECT col1, col2, CASE (…), ROW_NUMER(…)
IT WORKS! An SQL compiled and pulled the result correctly.
However, as you can see above, SELECT is evaluated much late in SQL logical query processing order. Here are some reasons why avoid writing everything in SELECT first:
- It’s hard to anticipate what you will write for the statement before SELECT in the logical query processing order. Do you have all the table names? Do you have the alias defined? Whether you are processing row-level or aggregated-level data? I know someone can prevision everything SQL and all table and alias very well. Unfortunately, I am not good at foreseeing and memorization. Writing down SQL in the logical query processing order gives me references for what will be written next. It also keep a hint for me if I got interrupted when I was coding, and it’s much easier to resume from the breakpoint.
- The logical query processing order fit our brain better. If you write SQL, you probably familiar with ETL concept (Extraction, Transformation, Loading) data. The ETL concept follows the logical order: you take datasets, manipulate the datasets, place the reformed datasets somewhere else. Writing in SQL should follow the same logical order, “SELECT” fits more as transformation or loading stage, and we should get the dataset preparation first to feed “SELECT.”
- It helps debug and reason the error. By following the logical query processing order, some errors becomes obvious. Like this StackOverflow question: “SQL not recognizing column alias in where clause,” From this article, we can answer that quickly: the SELECT alias hasn’t been evaluated, so SQL engine doesn’t have the context of the alias name you gave, so it failed. (some modern SQL vendors do some optimization to avoid this issue, but it does not apply to all SQL providers. We still should be aware of this).
Tips On Writing SQL In Logical Query Processing Order
- We can still write “SELECT” first, but only those 6 letters; use it as a placeholder to remind us to fill in the remaining content when it’s ready.
- Think in the ETL manner, the data preparation is crucial to set the foundation for the rest of work. So the first thing to focus is getting FROM, ON, and JOIN statements right.
- If you encounter errors, follow the logical query processing order to debug.
SELECT *is terrible in production, but we can still use it for debugging. We can further prevent ourselves pulling too much data by LIMIT clause.
If you write SQL top to bottom and this approach gives you some trouble in thinking SQL clearly, you should consider comprehending the SQL logical query processing order and practice this way.
I learned this approach early when I prepare for Microsoft Exam 70–461, and Itzik Ben-Gan has a great book Querying Microsoft SQL Server 2012 (MCSA). What is great about this book is that he explained very well on fundamentals of SQL compared to other books are collection of fancy syntaxes.
Thinking in SQL is critical. You can reason why things happen like this and quickly resolve the errors. Additional, it helps you better structure your code for solving the complex queries in a more organized way.
I hope my stories are helpful to you.
The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval.
Exploratory Data Analysis (EDA) is a methodology in data science as the initial approach to gain insights by visualizing and summarizing data. We will use some exploratory data analysis technics to find the reason behind the bidding war on the housing market.
When it comes to visualizing data with a histogram and dealing with multiple groups, it can be quite challenging. I have recently come across a useful ggplot2 extension called ggridges that has been helpful for my data exploratory tasks.