ETL: The Central Abstraction

Behind every dashboard you've ever seen, every "daily KPI" email an executive receives, every training dataset that mysteriously appears in your team's S3 bucket — there is a pipeline. Almost always, that pipeline follows the same three-step pattern.

ETL stands for Extract, Transform, Load.

Extract → Transform → Load
Step 1
Extract
Pull data from its sources — no cleaning, just movement.
The source could be a Postgres database, a SaaS API (Salesforce, Stripe, HubSpot), a flat file dropped into S3, or a real-time event stream from Kafka. Extract doesn't transform anything; it just moves data from where it lives to where the pipeline can reach it.
Examples
Postgres · APIs · S3 · Kafka
Step 2
Transform
Clean, validate, deduplicate, join, reshape. Business logic lives here.
Convert the messy reality of your sources into a consistent, queryable format. "Orders in euros need to be converted to USD." "Null zip codes get the state's default." "Duplicate user records from before the 2022 migration need to be merged." This is where the real engineering work happens.
Examples
Clean · Validate · Join · Reshape
Step 3
Load
Write the result into the target — usually a warehouse or feature store.
The load step is often incremental: you append new records rather than rewriting everything. The target could be a data warehouse like Snowflake or BigQuery, a feature store for ML, or a downstream database serving a product.
Examples
Snowflake · BigQuery · Feature stores

Every data pipeline — from a nightly dashboard refresh to 10 billion sensor events per day — is a version of ETL. The core operations stay the same; the tools change when the scale changes.

ELT: The Warehouse-First Variant

As data warehouses became more powerful (and cheaper per query), a variation emerged: ELT — Extract, Load, then Transform. You dump raw data into the warehouse first and run transformations inside the warehouse using SQL. Snowflake and BigQuery both encourage this pattern. Tools like dbt (data build tool) have made it enormously popular.


The trade-off: ELT is simpler to operate — you're writing SQL, not Python pipeline code — but harder to keep clean. The "raw" layer can sprawl if nobody enforces standards. In practice, most modern teams run ELT for structured sources and ETL for heavier transformations on unstructured data.

Checkpoint

A team decides to use ELT instead of ETL for a new analytics pipeline. They load raw JSON from their API into BigQuery, then run SQL transformations to clean and aggregate it. What is the most significant trade-off they've made?