< Back

Data Testing: The Data Docks

Author

Mats Van Audenaeren

Date

28/07/2025

Share this article

Every day, data arrives at our systems like ships coming into a busy harbor. Some carry containers, all labeled and packed just right. Others? Not so much. Think mismatched crates, missing tags, and mystery boxes that no one remembers ordering. That’s where data testing comes in.

 

The job is to check that every piece of data is clean, correctly labeled, and ready for its journey, whether it’s getting transformed, moved, or handed off to another system. In short: We test the data. But this isn’t about writing fancy algorithms or building dashboards with shiny graphs. This is about something much simpler, but just as important. It’s about making sure that what goes in is the same as what comes out.

If you’ve ever read a report and thought “Huh, that doesn’t look right,” chances are, something went wrong in the data pipeline. Maybe a column got dropped, a value was transformed incorrectly, or data types didn’t match. These tiny things can cause big problems: making bad decisions based on bad data. That’s why we put checkpoints in place. Little inspections to make sure every step of the data pipeline is working as it should.

In this blog, I’ll walk you through the main types of data tests we run using SQL (and sometimes a bit of Python). We’ll keep it light, use simple examples, and follow our data like a shipping container: from its arrival, through its transformations, to its final destination.

 

Name Tags: Do the Tables Look Right?

Before we even peek inside the data, we need to make sure the tables and columns are named properly, and in the right order. It might sound boring, but trust me, this is one of the easiest ways things go wrong. Data is picky. Even small differences in naming or column order can break pipelines and confuse analysts. These tests are simple but powerful. By catching issues early, we avoid bigger problems later when transformations, reports, or migrations are already underway. It’s like checking the label before you open the container. Basic, but essential.

Let’s start by testing the container labels, or in technical terms: table names. Imagine unloading a ship and seeing boxes labeled: 

-        Cust_info_2022 

-        customer_information 

-        custinfo -  ci

 They might all contain customer data… or not. That’s why one of the first tests is to check table names. We want to make sure they match naming conventions based on the analysis and are consistent between systems (especially after migrations). We can test this with a simple SQL query. If there’s no result, something is wrong:

 

SELECT table_name

FROM information_schema.tables

WHERE table_name = 'your_expected_table_name';

Once we’ve confirmed the table names, we open up the containers  and check the columns. Imagine you’re looking for a column called email, and instead you find: e_mail, EmailAddress or mail. Or worse, all the columns are in a different order. That might not break things right away, but when someone joins two tables or runs a script expecting column #3 to be email, things fall apart. So we test that.

 

The tests are simple:

1.     Are the column names correct?

2.     Are the columns in the expected order?

 

How? Even more simple: you select everything from your table and check if the results are different from what’s expected:

 

SELECT *

FROM your_table;

 

Type Check: Are the Labels Correct?

 

Alright, we’ve confirmed the boxes are labeled right and the contents are in the right order. Time to open one up and see what’s inside. Because sometimes a box labeled “10kg bananas” is actually full of mangoes…

 

You should always be aware of data impersonation. In the data world, this happens when a column looks like it should hold numbers, but it’s actually stored as text. Or maybe it’s supposed to be a date, but it’s just a string that looks like one. So we test a very important aspect: Do the column data types make sense? And are they what the analysis expects them to be?

 

Imagine we have a column signup_date. Analysis expects it to be a date type. You fetch it with a simple query:

 

SELECT column_name, data_type

FROM information_schema.columns

WHERE table_name = ‘your_table_name’

AND column_name = ‘signup_date’

 

You take a look at the results and you see that the column signup_date is actually a VARCHAR. Big trouble! Flag it. Always check that data types match what’s expected. Don’t assume because a column looks right, that it is right. Catching type mismatches early avoids subtle bugs later.

 

ID tags: Who’s Who in the data?

 

Every good shipment has tracking numbers. Without them, things get lost. Two packages with the same number? Chaos. One package with no number? Untraceable. In the world of data, keys do this job. They’re how we keep track of who's who.

 

Primary keys are one tag to rule them all. A primary key makes sure every row in a table is unique. Think of it like a passport, no two people should share one. If two customers in your database have the same id, you’ve got a problem.

 

First, fetch the primary key and check if the key is placed on the correct column(s):

 

SELECT column_name

FROM information_schema.key_column_usage

WHERE constraint_name = 'PRIMARY'

AND table_name = ‘your_table’;

 

Compare this result to what’s expected from the analysis. Signal if the results are not correct. Now, you can also test this by counting the records of the columns that are expected to be a primary key:

 

SELECT id, COUNT(*)

FROM customers

GROUP BY id

HAVING COUNT(*) > 1;

 

If this query returns anything at all… Yikes. You’ve got duplicate IDs.

Did the Transformation Do Its Job?

 

Okay, the cargo has passed the basics: names are right, labels match, keys are in place. Now comes the real action. In most data pipelines, raw data doesn’t stay raw. It gets cleaned, shaped, and transformed. Turned into something more useful. Think: chopping veggies before a meal. We should definitely test that!

 

Let’s say we have this raw data in a users_raw table:

user_id

birth_date

1

1990-01-01

2

2000-06-30

 

And in our transformed table users_clean, we expect:

user_id

age

1

35

2

25

 

Looks like simple logic: subtract birth year from current year. But maybe someone wrote the transformation slightly wrong and used today’s day of year, or forgot leap years. How can we test this?:

 

SELECT

 r.user_id,

 EXTRACT(YEAR FROM CURRENT_DATE) - EXTRACT(YEAR FROM r.birth_date) AS expected_age,

 c.age AS actual_age

FROM users_raw r

JOIN users_clean c ON r.user_id = c.user_id

WHERE (EXTRACT(YEAR FROM CURRENT_DATE) - EXTRACT(YEAR FROM r.birth_date)) <> c.age;

 

This test shows you exactly where transformed values don’t match what you expect. Don’t just re-run the same transformation code and compare the output. It’ll always match! Instead, write logic that independently checks what the result should be, then compare. Try to write your logic, without looking at the implementation. Transformations are where a lot of sneaky bugs happen. 

 

Cross-Dock Migration: Did the Shipment Survive the Trip?

 

Alright, our data’s packed, labeled, and transformed. Now it’s time for one of the most nerve-wracking parts of the journey: Moving the data from one system to another. AKA: Data Migration. This is like moving containers from one ship to another at a busy port, fast, precise, and nothing can go missing. But it does. A lot. So, we test it. No matter how the data moves, we check one thing: did all the data make it, and is it the same?

 

Now, where do we start? Row counts! Your first line of defense:

 

–  Source

SELECT COUNT(*) FROM your_source_table

 

–  Target

SELECT COUNT(*) FROM your_target_table

 

If the numbers don’t match, don’t even bother checking the rest. Something is already broken. But, if it does match, this test doesn’t give you proof that all the data is still the same. Next up: compare the actual data. You’d be surprised how many times the row count is fine, but values are off.:

If I can, I use a full SQL EXCEPT or MINUS between tables:

-- Show rows in source not in target

SELECT * FROM source_db.customers

MINUS

SELECT * FROM target_db.customers;

 

-- And vice versa

SELECT * FROM target_db.customers

MINUS

SELECT * FROM source_db.customers;

 

When that’s not possible (like when systems use different tech), I export the data, and compare the data files in simple tools like Notepad++. If the data is way too big, try going into Python and compare there. 

 

Bad migrations are silent killers. They might not crash anything, but they’ll eat trust slowly over time. That’s why we treat every migration like an audit: nothing leaves the dock without being checked.

 

Why Data Testing Matters

 

Data testing isn’t about perfection, it’s about trust. A small mistake today can snowball into broken dashboards, bad decisions, or lost credibility tomorrow. By putting checks in place  before, during, and after each step. We keep the data journey safe, reliable, and traceable.

It’s the foundation everything else rests on. Like a harbor master who inspects every ship before it sails, we make sure our data is ready for what comes next. And when we do it right, no one notices, because everything just works.

So, next time a dashboard looks off, or a migration is coming up, think like the data docks: Check the names. Check the labels. Count the crates. Test the transformations. And above all, never let a mystery box sneak through.

 

Keep your data clean.