Demystifying Null in SQL: A Comprehensive Guide for Data Professionals

Photo by Sunder Muthukumaran on Unsplash
Photo by Sunder Muthukumaran on Unsplash

Sometimes writing SQL can be frustrating, especially when encountering NULL values. It’s common to feel frustrated and think, “This isn’t what I want.” I hope this article can help you better understand these tricky NULL in SQL.

If you write any SQL, you must have experience working with NULL. NULL was created by E. F. Codd, the relational database creator, to address the “missing information” case.

Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database. — Wikipedia

Suppose you come across an online questionnaire that asks you to choose an occupation from a list of options. However, at that moment, you may not have a preferred occupation. In such a situation, you can simply fill in “N/A” as your answer, which will be recorded as NULL in the database to indicate that you did not provide an answer or that your answer was not listed. If you need to query people who did not provide a response, you can use the well-known syntax.

But, ignoring NULL values can have a negative impact on the accuracy of your data analysis. To demonstrate this, you can use http://sqlfiddle.com, a free online SQL editor, which supports MySQL 5.6. Please note that the syntax may vary depending on the database engine used.

Build a Schema To Demonstrate NULL

				
					CREATE TABLE IF NOT EXISTS form (
  id int(6) NULL,
  occupation varchar(200) NULL
);
INSERT INTO form (id, occupation) VALUES ('1', 'Accountant');
INSERT INTO form (id, occupation) VALUES ('2', 'Software Engineer');
INSERT INTO form (id, occupation) VALUES ('3', NULL);
INSERT INTO form (id, occupation) VALUES ('4', NULL);
INSERT INTO form (id, occupation) VALUES (NULL, NULL);
				
			

To simplify everything, in this example, we will have a table called form with only two columns: id and occupation , and occupation It is a nullable column. We will insert four records into this table, the last two with occupation as null to represent unknown or missing values.

SQL NULL in Group By

Let’s run the following script and observe what happens with null

				
					SELECT occupation, COUNT(1) AS cnt
FROM form
GROUP BY occupation
				
			
NULL in Group By
NULL in Group By

It is not uncommon to see NULL treated as a distinct category and displayed as (null). SQL can assist in grouping everything into this category, but we are still working to determine their values.

SQL NULL in Order By

				
					SELECT occupation, COUNT(1) AS cnt
FROM form
GROUP BY occupation
ORDER BY occupation
				
			
Null in Order By
Null in Order By

Nothing changed. Since we chose MySQL, are NULLs considered lower than any non-NULL value? But the different implementations of DB can change this behavior. You can refer to Ordering result sets, which have good documentation on the expected behavior. Standard SQL can also use NULLS Last , which is also implemented in DB like Oracle.

SQL NULL in Filtering

NULL in filtering conditions is error-prone. It is important to use caution when working with NULL values in where/having clause.

Case 1: find out who has an occupation in DB

				
					SELECT id, occupation
FROM form
WHERE occupation IS NOT NULL
				
			
find out who has an occupation in DB
find out who has an occupation in DB

This is the real use case for null you’d need to use IS to filter the nulls in this case. In the next case will, something could go south.

Case 2: find out which IDs are not Software Engineer as an occupation

				
					SELECT id, occupation
FROM form
WHERE occupation <> 'Software Engineer'
				
			
find out which IDs are not Software Engineer as an occupation.
find out which IDs are not Software Engineer as an occupation.

In this case, the only accountant is returned as it is not a software engineer. But what about IDs 3 and 4? They are not in the list either as SQL use IS operator for nulls, so it will be ignored whether in case of = or <> . You can try to change to occupation = ‘Software Engineer’ see the result.

Should you want to return the nulls or not is decided by your business logic. It is imperative always to be aware of nulls in your dataset while performing data analysis. If you keep nulls, you can add a new sense to show it.

				
					WHERE occupation <> 'Software Engineer' or occupation is NULL
				
			

or convert nulls to non-nulls

				
					WHERE COALESCE(occupation,'') <> 'Software Engineer'
				
			

SQL NULL in Concatenation

Concatenation is also a syntax that is likely to be error-prone with nulls. Take a look at this example

				
					SELECT id, CONCAT(occupation, NULL) AS occupation
FROM form
				
			
SQL Null in Concatenation
SQL Null in Concatenation

Everything is NULL in the end. SQL cannot determine what you want to do when you concatenate a value with a null, so it’s safe to return null, as SQL has no idea how to handle it. A lot of us might want to think of null as an empty string '' If you want that behavior, use COALESCE or similar to handling those scenarios.

SQL NULL in Calculation

1+2+3+4+null would be invalid in this case since we don’t know what value is invalid. However, if you’d want to aggregate value with null, most DB has different behavior of nulls with calculation changes.

				
					SELECT SUM(id)
FROM form
				
			

We got 10 as the final result, so nulls are just ignored. If you apply AVG , the result is 2.5, which is 10/4, and you can see nulls have been missed again. Whether you want to do SUM(id)/Count(1) or AVG(id) would be a different result, so you’d need to ensure your goal is the same as expected.

SQL NULL in Joins

				
					SELECT *
FROM form a
INNER JOIN form b on a.id = b.id
				
			
SQL NULL in Joins​
SQL NULL in Joins​

With a similar idea with dinner, you won’t see id 5 in this case, as SQL has no idea what to do with nulls. However, if you do left join, as SQL will keep the left records, it would show up as id 5.

Final Thought

In my opinion, null values can sometimes be seen as a shortcut – a way to avoid dealing with specific data points by using a catch-all term. For example, in Java programming, null values are commonly used, but this can lead to the risk of Null Pointer Exceptions. However, null values are a reality in SQL, and it is important to acknowledge and handle them appropriately when working with datasets.

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Photo by Donald Tran on Unsplash

Unlocking the Secrets of Slowly Changing Dimension (SCD): A Comprehensive View of 8 Types

Slowly Changing Dimension (SCD) is critical to dimensional modeling. We will discuss the eight types of SCDs. By the end, you will clearly understand each type and be able to differentiate between SCDs in dimensional modeling.

Read More →
Photo by Daria Nepriakhina 🇺🇦 on Unsplash

5 Fantastic Data Pipeline Orchestration Tools For R

Many modern data orchestration projects like Apache Airflow and Luigi are Python-based. Let’s explore the popular data pipeline orchestration options for R.

Read More →
Photo by Stephen Leonardi on Unsplash

Streaming Data Is Exciting: What You Need to Know Before Taking the Plunge

Is streaming data necessary for this particular use case? Rather than blindly diving in, it’s essential first to acknowledge the realities of working with streaming data.

Read More →

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link