Sometimes writing SQL can be frustrating, especially when encountering NULL values. It’s common to feel frustrated and think, “This isn’t what I want.” I hope this article can help you better understand these tricky NULL in SQL.
If you write any SQL, you must have experience working with NULL. NULL was created by E. F. Codd, the relational database creator, to address the “missing information” case.
Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database. — Wikipedia
Suppose you come across an online questionnaire that asks you to choose an occupation from a list of options. However, at that moment, you may not have a preferred occupation. In such a situation, you can simply fill in “N/A” as your answer, which will be recorded as NULL in the database to indicate that you did not provide an answer or that your answer was not listed. If you need to query people who did not provide a response, you can use the well-known syntax.
But, ignoring NULL values can have a negative impact on the accuracy of your data analysis. To demonstrate this, you can use http://sqlfiddle.com, a free online SQL editor, which supports MySQL 5.6. Please note that the syntax may vary depending on the database engine used.
Build a Schema To Demonstrate NULL
CREATE TABLE IF NOT EXISTS form ( id int(6) NULL, occupation varchar(200) NULL ); INSERT INTO form (id, occupation) VALUES ('1', 'Accountant'); INSERT INTO form (id, occupation) VALUES ('2', 'Software Engineer'); INSERT INTO form (id, occupation) VALUES ('3', NULL); INSERT INTO form (id, occupation) VALUES ('4', NULL); INSERT INTO form (id, occupation) VALUES (NULL, NULL);
To simplify everything, in this example, we will have a table called
form with only two columns:
occupation , and
occupation It is a nullable column. We will insert four records into this table, the last two with occupation as null to represent unknown or missing values.
SQL NULL in Group By
Let’s run the following script and observe what happens with null
SELECT occupation, COUNT(1) AS cnt FROM form GROUP BY occupation
It is not uncommon to see NULL treated as a distinct category and displayed as (null). SQL can assist in grouping everything into this category, but we are still working to determine their values.
SQL NULL in Order By
SELECT occupation, COUNT(1) AS cnt FROM form GROUP BY occupation ORDER BY occupation
Nothing changed. Since we chose MySQL, are NULLs considered lower than any non-NULL value? But the different implementations of DB can change this behavior. You can refer to Ordering result sets, which have good documentation on the expected behavior. Standard SQL can also use
NULLS Last , which is also implemented in DB like Oracle.
SQL NULL in Filtering
NULL in filtering conditions is error-prone. It is important to use caution when working with NULL values in where/having clause.
Case 1: find out who has an occupation in DB
SELECT id, occupation FROM form WHERE occupation IS NOT NULL
This is the real use case for null you’d need to use
IS to filter the nulls in this case. In the next case will, something could go south.
Case 2: find out which IDs are not Software Engineer as an occupation
SELECT id, occupation FROM form WHERE occupation <> 'Software Engineer'
In this case, the only accountant is returned as it is not a software engineer. But what about IDs 3 and 4? They are not in the list either as SQL use
IS operator for nulls, so it will be ignored whether in case of
<> . You can try to change to
occupation = ‘Software Engineer’ see the result.
Should you want to return the nulls or not is decided by your business logic. It is imperative always to be aware of nulls in your dataset while performing data analysis. If you keep nulls, you can add a new sense to show it.
WHERE occupation <> 'Software Engineer' or occupation is NULL
or convert nulls to non-nulls
WHERE COALESCE(occupation,'') <> 'Software Engineer'
SQL NULL in Concatenation
Concatenation is also a syntax that is likely to be error-prone with nulls. Take a look at this example
SELECT id, CONCAT(occupation, NULL) AS occupation FROM form
Everything is NULL in the end. SQL cannot determine what you want to do when you concatenate a value with a null, so it’s safe to return null, as SQL has no idea how to handle it. A lot of us might want to think of null as an empty string
'' If you want that behavior, use
COALESCE or similar to handling those scenarios.
SQL NULL in Calculation
1+2+3+4+null would be invalid in this case since we don’t know what value is invalid. However, if you’d want to aggregate value with null, most DB has different behavior of nulls with calculation changes.
SELECT SUM(id) FROM form
We got 10 as the final result, so nulls are just ignored. If you apply
AVG , the result is 2.5, which is 10/4, and you can see nulls have been missed again. Whether you want to do
AVG(id) would be a different result, so you’d need to ensure your goal is the same as expected.
SQL NULL in Joins
When joining, it’s possible that null values may be the end result. While it’s important to apply business logic after the join, it’s also crucial to consider how nulls will affect downstream cases. Whether or not to keep nulls will depend on your specific use cases. If you need to join keys that include null values, this should be taken into account.
SELECT * FROM form a INNER JOIN form b on a.id = b.id
With a similar idea with dinner, you won’t see id 5 in this case, as SQL has no idea what to do with nulls. However, if you do left join, as SQL will keep the left records, it would show up as id 5.
In my opinion, null values can sometimes be seen as a shortcut – a way to avoid dealing with specific data points by using a catch-all term. For example, in Java programming, null values are commonly used, but this can lead to the risk of Null Pointer Exceptions. However, null values are a reality in SQL, and it is important to acknowledge and handle them appropriately when working with datasets.
I hope my stories are helpful to you.
Slowly Changing Dimension (SCD) is critical to dimensional modeling. We will discuss the eight types of SCDs. By the end, you will clearly understand each type and be able to differentiate between SCDs in dimensional modeling.
Many modern data orchestration projects like Apache Airflow and Luigi are Python-based. Let’s explore the popular data pipeline orchestration options for R.
Is streaming data necessary for this particular use case? Rather than blindly diving in, it’s essential first to acknowledge the realities of working with streaming data.