ETL: The Central Abstraction

Behind every dashboard you've ever seen, every "daily KPI" email an executive receives, every training dataset that mysteriously appears in your team's S3 bucket — there is a pipeline. Almost always, that pipeline follows the same three-step pattern.

ETL stands for Extract, Transform, Load.

Extract → Transform → Load

Step 1

Extract

Pull data from its sources — no cleaning, just movement.

The source could be a Postgres database, a SaaS API (Salesforce, Stripe, HubSpot), a flat file dropped into S3, or a real-time event stream from Kafka. Extract doesn't transform anything; it just moves data from where it lives to where the pipeline can reach it.

Examples

Postgres · APIs · S3 · Kafka

Step 2

Transform

Clean, validate, deduplicate, join, reshape. Business logic lives here.

Convert the messy reality of your sources into a consistent, queryable format. "Orders in euros need to be converted to USD." "Null zip codes get the state's default." "Duplicate user records from before the 2022 migration need to be merged." This is where the real engineering work happens.

Examples

Clean · Validate · Join · Reshape

Step 3

Load

Write the result into the target — usually a warehouse or feature store.

The load step is often incremental: you append new records rather than rewriting everything. The target could be a data warehouse like Snowflake or BigQuery, a feature store for ML, or a downstream database serving a product.

Examples

Snowflake · BigQuery · Feature stores

Every data pipeline — from a nightly dashboard refresh to 10 billion sensor events per day — is a version of ETL. The core operations stay the same; the tools change when the scale changes.

ℹ

ELT: The Warehouse-First Variant

As data warehouses became more powerful (and cheaper per query), a variation emerged: ELT — Extract, Load, then Transform. You dump raw data into the warehouse first and run transformations inside the warehouse using SQL. Snowflake and BigQuery both encourage this pattern. Tools like dbt (data build tool) have made it enormously popular.

The trade-off: ELT is simpler to operate — you're writing SQL, not Python pipeline code — but harder to keep clean. The "raw" layer can sprawl if nobody enforces standards. In practice, most modern teams run ELT for structured sources and ETL for heavier transformations on unstructured data.

Checkpoint

A team decides to use ELT instead of ETL for a new analytics pipeline. They load raw JSON from their API into BigQuery, then run SQL transformations to clean and aggregate it. What is the most significant trade-off they've made?

←PreviousWhat is Data Engineering?Building Data Pipelines Next→Orchestration and DAGsBuilding Data Pipelines