Think In SQL – Window Function Demisify

Photo by Katerina Pavlyuchkova on Unsplash
Photo by Katerina Pavlyuchkova on Unsplash

Many SQL tutorials mark the window function as the advanced topic. Numerous job interviews on SQL often involve at least one question on window function. Understanding window function is critical for anyone that writes SQL daily (as well as SQL like languages like Spark Dataframe and Pandas). In this story, let’s think in SQL and demystify the window function with examples and diagrams.

What is Window Function?

Many versions of window function definitions are described by the community or the official DB vendors (MySQL, Postgresql). To be clear, I describe the window function in the following way: 

Window Function enables current row to oversee a range of rows, applies function on those rows, and returns result back to the current row. 

However, reading the definition will only resonate on how to think and use in SQL. Let’s put the definition aside. We’d need concrete examples and diagrams to explain how the window function works. 

Window Function Example Setup

To explain how window function works, we don’t need fancy joins or complex conditional filters. All we need to do is a few rows that allow us to slide up and down 😃

I will use DuckDb for the illustration since DuckDb doesn’t require many dependencies, and you can get a SQL interface immediately. 

We can create a dummy table with three fields: key, value, and date. Then export the table into a CSV file, then read the table by DuckDb CLI.

					import pandas as pd
data = {"key": [1, 1, 1, 2, 2, 3, 3, 4],
        "value": [100, 200, 300, 100, 200, 200, 300, 200],
        "date": [1, 2, 3, 1, 2, 2, 1, 1]
window_function_demo_df = pd.DataFrame.from_dict(data)

Then launch DuckDb CLI then read the table using the magic. read_csv_auto 

--once in the DuckDb CLI, create a table first
CREATE TABLE window_function_demo AS SELECT * FROM read_csv_auto ('~/Downloads/window_function_demo.csv');
--select all columns from the newly created table
SELECT * FROM window_function_demo;
Initial Setup | Image By Author
Initial Setup | Image By Author

The Components of the Window Function

Window Function Syntax | Image from DuckDb
Window Function Syntax | Image from DuckDb

A window function can be broken down into three main components: 

  • Partition By
  • Order By
  • Range of the Window Function

Partition By

The “Partition By” clause is for logical grouping. Those keys segregate current rows from the other rows in a logical partition, thus further limiting the window function’s range. 

In our example, we have four distinct values under the key. When we apply window function, each value is going to be treated separately. The partition key provides the logical boundary avoid taking values from same window range but different keys. When the cursor reaches the end of a given partition, it will stop instead of jumping into the next partition. 

Window Function Partition By Diagram | Image By Author
Window Function Partition By Diagram | Image By Author

If we have the following window function to sum the field value partition by field key 

					SELECT key, value, date, SUM(value) Over (PARTITION BY key) AS total
FROM window_function_demo;

The following result is expected that assign each row with the sum of value for a given partition. Key 1 gets 100+200+300=600, Key 2 gets 100+200=300, Key 3 gets 200 + 300 = 500, and Key 4 gets 200. 

SUM Window Function Partition By Diagram | Image By Author
SUM Window Function Partition By Diagram | Image By Author

Order By

Some window functions are sensitive to the sorting order; for example, ROW_NUMBER() and RANK(). Other aggregation functions are less concerned about the exact ordering; for instance, COUNT() and SUM() (if ROW or Range is provided, it will change and will cover in the next section. ).

If the window function relies on the ordering, it’s essential to write the sorting order down explicitly; Otherwise, the returned result isn’t idempotent (every time you rerun the query, you achieve the same result), and it leads to undesired effects.

In our example, if we’d like to return the row number for each partition and order by date in ascending order, for key = ‘3’ we’d need to reorder to get the row number returned correctly. 

Window Function Order By Diagram | Image By Author
Window Function Order By Diagram | Image By Author

To put SQL down, we have 

					SELECT key, value, date, ROW_NUMBER() Over (PARTITION BY key ORDER BY "Date") AS row_number
FROM window_function_demo;
ROW_NUMBER Window Function Partition By Diagram | Image By Author
ROW_NUMBER Window Function Partition By Diagram | Image By Author

If we look closer to key 3, the date is reordered, and row_number for each row assigned is correct by ascending order.

Range of the Window Function

Partition key provides the logical separation by each key. If we also want to change the range of the window for partition, we can achieve this by specifying PRECEDING or FOLLOWING .

  • PRECEDING defines the number of rows before the current row to included in the window function. If the supplied number is less than the specified one, it takes the maximum it can get. For example, if we have 1 PRECEDING , it includes one row before. If the current row is the first one without preceding row, the preceding row will be ignored (if we have value from the start to end, the sum will treat those missing value as 0 instead of NULL). 
					SELECT key, value, date, 
FROM window_function_demo;

Let’s take a look closer to key = 1. Step 1. since we don’t have the preceding 1 row, it treat the missing row as 0 and assign the current row to 100 (0+100). Step 2 and 3 are similar, and we now have the preceding 1 row. We can take that value and add it to the current value. 

Window Frame PRECEDING | Image By Author
Window Frame PRECEDING | Image By Author
Window Frame PRECEDING Result | Image By Author
Window Frame PRECEDING Result | Image By Author
  • FOLLOWING is the same idea as PRECEDING, but look for the rows after the current row to include in the window function. 

Types of Window Functions

Generally, window functions can be classified into two types: aggregate types, and position-based type.

Aggregate Types

The core idea of the window function is to enhance the current row from aggregated rows. All the aggregate functions can be used as part of the window function. However, not every DBMS have equality adoption to the aggregate function. You can still accomplish this by getting some aggregate function by creating a aggregated temporary table or view, then joining with each row. But it has the limitation that window range is only applicable to all unbounded PRECEDING or FOLLOWING rows

Position-Based Type

Those functions describe the position information of the current row within a given partition (i.e., rank, ntile) or get the value for a given position (lag, lead). Those functions are handy to learn about the relevant position of the current row. Many hard SQL interview questions require you to be savvy on various the position-based type window function. 

Frequent Use Cases for Window Function

  • Remove duplication. I found many times distinct isn’t practical to perform deduplicate since it works only on all selected rows. If you want to remove duplication by a set of keys, we can write a CTE (common table expression) and use theROW_NUMBER function. We can deduplicate PARTITION BY with the desired deduplication order, then filter out the row number that isn’t equal to 1.
					WITH CTE AS (
FROM my_dataset
WHERE rn = 1
  • Get Top N for each group. We are often interested in only the top few rows within a partition. For example, “Give a list of top 3 salespeople for each state”
					WITH CTE AS (
FROM sales
SELECT sales_person, state
WHERE rn <= 3

Final Thoughts

Being Proficiency in SQL window function is a critical step to becoming fluent in SQL. This story is helpful to demystify SQL window function, and provide you a step-by-step guide and sample code on how to think window function clearly.

Thinking in SQL is critical. You can reason why things happen like this and quickly resolve the errors. Additionally, it helps you better structure your code for solving the complex queries in a more organized way.

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link