Many SQL tutorials mark the window function as the advanced topic. Numerous job interviews on SQL often involve at least one question on window function. Understanding window function is critical for anyone that writes SQL daily (as well as SQL like languages like Spark Dataframe and Pandas). In this story, let’s think in SQL and demystify the window function with examples and diagrams.

## What is Window Function?

Many versions of window function definitions are described by the community or the official DB vendors (MySQL, Postgresql). To be clear, I describe the window function in the following way:

Window Function enables current row to oversee a range of rows, applies function on those rows, and returns result back to the current row.

However, reading the definition will only resonate on how to think and use in SQL. Let’s put the definition aside. We’d need concrete examples and diagrams to explain how the window function works.

## Window Function Example Setup

To explain how window function works, we don’t need fancy joins or complex conditional filters. All we need to do is a few rows that allow us to slide up and down 😃

I will use DuckDb for the illustration since DuckDb doesn’t require many dependencies, and you can get a SQL interface immediately.

We can create a dummy table with three fields: key, value, and date. Then export the table into a CSV file, then read the table by DuckDb CLI.

` ````
```import pandas as pd
data = {"key": [1, 1, 1, 2, 2, 3, 3, 4],
"value": [100, 200, 300, 100, 200, 200, 300, 200],
"date": [1, 2, 3, 1, 2, 2, 1, 1]
}
window_function_demo_df = pd.DataFrame.from_dict(data)
window_function_demo_df.to_csv('~/Downloads/window_function_demo.csv')

Then launch DuckDb CLI then read the table using the magic. `read_csv_auto`

` ````
```./duckdb
--once in the DuckDb CLI, create a table first
CREATE TABLE window_function_demo AS SELECT * FROM read_csv_auto ('~/Downloads/window_function_demo.csv');
--select all columns from the newly created table
SELECT * FROM window_function_demo;

## The Components of the Window Function

A window function can be broken down into three main components:

**Partition By****Order By****Range of the Window Function**

#### Partition By

The “Partition By” clause is for logical grouping. Those keys segregate current rows from the other rows in a logical partition, thus further limiting the window function’s range.

In our example, we have four distinct values under the key. When we apply window function, each value is going to be treated separately. The partition key provides the logical boundary avoid taking values from same window range but different keys. When the cursor reaches the end of a given partition, it will stop instead of jumping into the next partition.

If we have the following window function to sum the field `value`

partition by field `key`

` ````
```SELECT key, value, date, SUM(value) Over (PARTITION BY key) AS total
FROM window_function_demo;

The following result is expected that assign each row with the sum of `value`

for a given partition. Key 1 gets 100+200+300=600, Key 2 gets 100+200=300, Key 3 gets 200 + 300 = 500, and Key 4 gets 200.

#### Order By

Some window functions are sensitive to the sorting order; for example, `ROW_NUMBER()`

and `RANK()`

. Other aggregation functions are less concerned about the exact ordering; for instance, `COUNT()`

and `SUM()`

(if ROW or Range is provided, it will change and will cover in the next section. ).

If the window function relies on the ordering, it’s essential to write the sorting order down explicitly; Otherwise, the returned result isn’t idempotent (*every time you rerun the query, you achieve the same result*), and it leads to undesired effects.

In our example, if we’d like to return the row number for each partition and order by date in ascending order, for `key = ‘3’`

we’d need to reorder to get the row number returned correctly.

To put SQL down, we have

` ````
```SELECT key, value, date, ROW_NUMBER() Over (PARTITION BY key ORDER BY "Date") AS row_number
FROM window_function_demo;

If we look closer to key 3, the date is reordered, and row_number for each row assigned is correct by ascending order.

**Range of the Window Function**

Partition key provides the logical separation by each key. If we also want to change the range of the window for partition, we can achieve this by specifying `PRECEDING`

or `FOLLOWING`

.

`PRECEDING`

defines the number of rows before the current row to included in the window function. If the supplied number is less than the specified one, it takes the maximum it can get. For example, if we have`1 PRECEDING`

, it includes one row before. If the current row is the first one without preceding row, the preceding row will be ignored (if we have value from the start to end, the sum will treat those missing value as 0 instead of NULL).

` ````
```SELECT key, value, date,
SUM(value) Over (PARTITION BY key ORDER BY "date" DESC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS row_number
FROM window_function_demo;

Let’s take a look closer to key = 1. Step 1. since we don’t have the preceding 1 row, it treat the missing row as 0 and assign the current row to 100 (0+100). Step 2 and 3 are similar, and we now have the preceding 1 row. We can take that value and add it to the current value.

`FOLLOWING`

is the same idea as`PRECEDING`

, but look for the rows after the current row to include in the window function.

## Types of Window Functions

Generally, window functions can be classified into two types: aggregate types, and position-based type.

#### Aggregate Types

The core idea of the window function is to enhance the current row from aggregated rows. All the aggregate functions can be used as part of the window function. However, not every DBMS have equality adoption to the aggregate function. You can still accomplish this by getting some aggregate function by creating a aggregated temporary table or view, then joining with each row. But it has the limitation that window range is only applicable to all unbounded `PRECEDING`

or `FOLLOWING`

rows

**Position-Based Type**

Those functions describe the position information of the current row within a given partition (i.e., rank, ntile) or get the value for a given position (lag, lead). Those functions are handy to learn about the relevant position of the current row. Many hard SQL interview questions require you to be savvy on various the position-based type window function.

## Frequent Use Cases for Window Function

**Remove duplication**. I found many times`distinct`

isn’t practical to perform deduplicate since it works only on all selected rows. If you want to remove duplication by a set of keys, we can write a CTE (common table expression) and use the`ROW_NUMBER`

function. We can deduplicate`PARTITION BY`

with the desired deduplication order, then filter out the row number that isn’t equal to 1.

` ````
```WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY key Order BY "date") rn
FROM my_dataset
)
SELECT *
FROM CTE
WHERE rn = 1

**Get Top N for each group**. We are often interested in only the top few rows within a partition. For example, “Give a list of top 3 salespeople for each state”

` ````
```WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY state Order BY revenue DESC) rn
FROM sales
)
SELECT sales_person, state
FROM CTE
WHERE rn <= 3

## Final Thoughts

Being Proficiency in SQL window function is a critical step to becoming fluent in SQL. This story is helpful to demystify SQL window function, and provide you a step-by-step guide and sample code on how to think window function clearly.

Thinking in SQL is critical. You can reason why things happen like this and quickly resolve the errors. Additionally, it helps you better structure your code for solving the complex queries in a more organized way.

#### About Me

I hope my stories are helpful to you.

For data engineering post, you can also **subscribe to my new articles** or becomes a **referred Medium member** that also gets full access to stories on Medium.

#### More Articles

### Here Is What I Learned Using Apache Airflow over 6 Years

Apache Airflow is undoubtedly the most popular open-source project for data engineering for years. It gains popularity at the right time with The Rise Of Data Engineer. Today, I want to share my journey with Airflow and what I learned over 6 years.

### My Life With Depersonalization Derealization Disorder (DPDR)

One of the challenges for me is identifying this blurry feeling. It took me years to find out the name of my issue.

Five years ago, I randomly searched for my symptoms. I found Depersonalization Derealization Disorder

### Bidding War on Housing Market? Let’s Use R For Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a methodology in data science as the initial approach to gain insights by visualizing and summarizing data. We will use some exploratory data analysis technics to find the reason behind the bidding war on the housing market.