The Foundation of Data Validation

Photo by Vardan Papikyan on Unsplash

Although it may not be the most glamorous aspect of data work, data validation is crucial to any data-related task. Data validation can be tedious. When we think of validation on Data, what is the first thing that comes into your mind? Endless Spreadsheet? Multiple layers of SubQuery SQL? 

I wish I could say data validation is fun. Unfortunately, it’s not a paved road process. If you are reading this blog post, you may have faced the challenge of data validation before, or you might be struggling with it. My goal in this post is to share my experience with data validation, including the principles and methods I have used, which will be helpful to you.

What is Data Validation?

Data Validation is the key to unlocking the potential of any system. It ensures that the data generated meets and exceeds all expectations, leaving no room for doubt or error. With validation, you can be confident that your system is performing at its best, delivering results you can rely on. 

Data validation is an essential aspect of our daily lives. It is everywhere, from the information we consume to our decisions.

  • I allow my child to watch 20 minutes of TV per day. When the time is up, my child checks the clock to confirm the duration. This is an example of data validation.
  • You order five dishes from Uber and then verify that you received all five dishes. This is an example of data validation.

Now let’s look at a more complex one: assuming you sell T-shirts on Amazon, your daily sales generate $1,000, with 100 T-shirts sold at $10 each. Will you get $1,000 from Amazon?

Have you considered the following: storage fees from Amazon, return and damaged items, and coupons for some customers to reduce the price? You probably need a system to track each order; when Amazon pays a lump sum, you check Amazon’s transaction against your system to ensure the payments are all properly conducted. This is usually called reconciliation in accounting. It is data validation again!

On the engineering side, system migration is another pain point for data validation! 

Why is Data Validation Hard?

When we think about data validation, we usually feel less confident in the system we are looking at. We are looking for data validation to help us get some signals to ensure high output quality. 

At first, data validation aims for a high percentage ratio, and the happy path is mostly already considered. So the start pointing of data validation could be on “Hard Mode” initially. 

The second challenging part is how to identify and fix the gaps. Many data are domain-specific, and domain experts must dig out the potential gaps and close them. 

Data Validation Process

The domain-specific knowledge is not generalizable, but overall data validation can be generalized.

Now, you need to perform some data validation; let’s check what process we usually need to perform.

Step 1: Set your goal

What’s the purpose of your validation? The goal here is the critical part of data validation success; setting up a goal in the first place is also good practice to fall into the rabbit hole to get a 100% match rate. 

Are you doing a unit test validation (input-output) or comparing the values among two datasets? 

Step 2: Set up your workspace folder and track the process log

Light data validation can be accomplished by writing a few lines of SQL on an ad-hoc basis. However, to dive deep into a complex validation problem, setting up a dedicated environment for your workspace and a log journal to track the process is a good idea.

Step 3: Set up the PRIMARY key for data validation

Any entry into the dataset requires a signature: a primary key. There are a couple of benefits of having a primary key: 

  • Remove duplication. To perform data validation efficiently, it’s common to have a strategy to handle duplication. Instead of checking whether the entire row is similar, the primary key is usually a common field to remove duplication. I usually use the window functionROW_NUMBER and return only the first/last rows.
				
					WITH (
  SELECT my_columns, ROW_NUMBER() OVER(PARTITION BY my_primary_keys ORDER BY my_sort_condition) n
  FROM my_dataset
  WHERE date = DATE '2024-04-28'
)
SELECT
FROM CTE
WHERE n=1
				
			
  • Effective to perform the following operation after joining with a primary key. A common issue with data validation is losing direction after joining, resulting in errors. If you join datasets with a one-to-many or many-to-many relationship, it can be an area of common errors. A good strategy is to leverage the primary key after joining to perform a few stats operations like row count, duplication key count, or sum some columns’ value before and after, etc. Usually, the error can be found after performing such validation. 

Step 4: Set the comparison rule and correctly classify them.

The perfect classification is when one bucket matches 100% with the data validation result. However, in real-life data validation practice, this may not be achievable. 

To identify the unmatched items, we need to categorize them. Once we have the categories, we can break down the unmatched items into smaller subcategories. This will help us understand how far we are from reaching 100% and what areas to focus on next.

Step 5: Generate the data validation result in a repeatable manner

This step is the actual work of data validation, and we’d need to develop a repeatable process that can generate the validation result using the comparison rule and proper classification. The underlying method you are using is less importing; you could write SQL or Python and even make a spreadsheet. The repeatable manner is critical; this makes your data validation process shareable and ready for the next iteration after bug fixes. 

Step 6: Fix the issue

The data validation result and unmatched classification give domain experts a hint on the issue. Now, it’s time to make a fix with the guidance from the data validation result.

Step 7: Repeat steps 5 and 6 until you reach the goal you set in Step 1

Data Validation Method

There are a couple of data validation methods I recommend using

EDA and Visualization: Typically, when faced with a new dataset, my first step would be to perform exploratory data analysis. This would allow me to get a sense of the data profile. I can quickly identify any significant differences or areas requiring further attention using multivariate graphical techniques. 

EDA is not overkill for data validation tasks. Since data validation requires a deep understanding of the data’s shape, having EDA as part of our toolbox is essential.

Photo by Luke Chesser on Unsplash

Row-by-row comparison

Row-by-row comparison is best if you can obtain a detailed row for each dataset. After properly joining the datasets, we can compare each column to see if they match. This gives much visibility and flexibility for data validation. 

Note that there are a couple of cases that require extra attention as they could affect your final result:

  • NULL and Empty String
  • Double / Float precision
  • Letter case

Row-by-row requires joining; it can be resource-intensive for large datasets. One strategy you can leverage here is to use the same sampling strategy on both sides first, then perform the final data validation.

Aggregation comparison

Aggregation comparison ensures that the overall shape and comparison of the dataset won’t change dramatically. While there might be a few outliers, the aggregation comparison can give us a sense of whether or not the key metrics are being impacted. 

To analyze data effectively, you can group the data by specific keys and then apply standard aggregation functions such as SUM, COUNT, AVG, PERCENTILE, etc. This approach is particularly useful when dealing with large datasets, as the aggregation result is less likely to change. However, looking closely at smaller grouped datasets is essential to ensure they still meet your expectations.

Final Thought

I hope this article provides a clear understanding of the basics of data validation and its methodology. Data validation can be challenging, but I enjoy it like a puzzle. Although it can be discouraging when I get stuck, the feeling of satisfaction after I figure it out makes it all worth it.

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Photo by Jordan Rogers on Unsplash

Why R for Data Engineering is More Powerful Than You Thought

R could add potential benefits to help the data engineering community. Let's discuss about Why R for Data Engineering is More Powerful Than You Thought.
Read More →
Foto von Enis Yavuz auf Unsplash

Is Apache Airflow Due for Replacement? The First Impression Of mage-ai

Airflow has been widespread for years. Is Apache Airflow due for a replacement? mage-ai is the new ETL tool for data engineers to check out ...
Read More →
Photo by Huyen Bui on Unsplash

Get Fluent in Python Decorators by Visualizing It

Python decorator is syntactic sugar. You can achieve everything without explicitly using the decorator. However, Using the decorator can help your code be more concise ...
Read More →

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link