3 Hard-Learned Data Lessons That Changed How I Build Pipelines

Sometimes you have to learn through experience...

Sep 02, 2025

1. Clean Your Data as Early as Possible

This seems obvious until you're debugging why your executive dashboard shows different numbers than last week's report.

The mistake I made: I wasn’t being consistent with where transformations were applied. Due to a mix of timing constraints and inefficient modeling, transformations would be applied ad hoc as needed.

The problem: Some transformations happened in silver, others in gold. Case sensitivity was handled differently across layers. Data types were a mess.

When going back and making any changes, as the transformations are applied downstream, whatever you do in one model has to then be replicated in others consistently in others which is inefficient and also prone to bugs.

What I do now: Apply all cleaning and standardization at the staging layer, right after ingestion. Every downstream model inherits the same clean, consistent data. The staging layer is 1-1 with source, and should be responsible for light transformations and cleaning.

2. The Simplest Solution Usually Wins - Make the most of your toolbox.

If the tools you are using already have good functionality and features, use them.

Sometimes it can be tempting to over architect and implement solutions from zero, however the reality is that this take time away from the more important work you could be doing impacting business decisions, and it will be less reliable than deployed features from organisations with tens or hundreds of engineers.

DBT Macros are a great example of this.

Reality check: Your future self (and your teammates) will thank you for boring, predictable solutions over architectural masterpieces in the long run.

3. Don't Repeat Yourself (Especially in Data Transformations)

Every time you copy-paste transformation logic, you're creating a future bug.

Different teams will modify their copies differently. Data definitions will drift.

This can be better illustrated with a quick example.

Route 1: Strings are cleaned downstream by individual teams.

Route 2: Strings are standardised and cleaned upstream with a shared macro.

If in route 1, any discrepancy arises between teams, models will begin to break and it will be difficult to debug why. Not only that but there is going to be more ode than necessary.

Stick with route 2.

This is where DBT macros become your best friend.

A Macro That Saves Me Hours Every Week

Here's a string cleaning macro I use across my staging models.

{% macro clean_string(column_name, nullify_empty=true) %}

{% if nullify_empty %}

CASE

WHEN {{ column_name }} IS NULL OR TRIM({{ column_name }}) = ''

THEN NULL

ELSE UPPER(TRIM({{ column_name }))

END

{% else %}

UPPER(TRIM({{ column_name }))

{% endif %}

{% endmacro %}

Why this works

Handles nulls and empty strings consistently
Standardizes case and removes whitespace
One definition, used everywhere
No more "why is this customer showing up twice?" debugging sessions

Usage Example

SELECT

{{ clean_string('customer_name') }} as customer_name,

{{ clean_string('product_category', false) }} as category

FROM raw_sales

What lessons have you learned the hard way? I'm always curious about other engineers' war stories.

P.S. - If you're building data systems for logistics or e-commerce, I'd love to connect. Drop me a line at nevada@elio.agency or find me on LinkedIn.

Nevada’s Substack

Discussion about this post

Ready for more?