Designing a Pipeline

◆

The Brief: E-Commerce Sales Dashboard

An e-commerce company wants a daily dashboard of sales by region. The data lives in three places: orders in a Postgres database, web events in Kafka, and marketing spend in the Salesforce API. The dashboard needs to be ready by 7 a.m. every day, reflecting the previous day's data.

Source layer — Postgres (orders), Kafka (web events), Salesforce API (marketing spend). Three different systems, three different access patterns.
Ingestion — An Airflow DAG that runs at 2 a.m. every night: extracts yesterday's orders via SQL, the day's events from Kafka, and the marketing data via REST API call.
Staging — Raw data lands in S3, partitioned by date, in Parquet format. Nothing is transformed yet. This is the insurance policy: if a transformation breaks, you can rerun it without re-extracting from source.
Transformation — A Spark job joins orders, events, and spend; computes per-region totals; writes a clean fact table back to S3.
Load — The fact table is loaded into Snowflake, ready for query.
Serving — A BI tool (Tableau, Looker, Metabase) queries Snowflake to render the dashboard.
Monitoring — Airflow alerts if any step fails. Data quality checks (row counts within expected ranges, no nulls in key fields) run after the transform step.

DAG Builder

Wave 1

○

Extract: Postgres (orders)

extract

○

Extract: Kafka (events)

extract

○

Extract: Salesforce API

extract

Wave 2

○

Join: orders + events

transform

depends on: Extract, Extract

Wave 3

○

Join: marketing spend

transform

depends on: Join, Extract

Wave 4

○

Quality check: null rates

quality

depends on: Join

Wave 5

○

Load → Snowflake

load

depends on: Quality check

extracttransformloadqualityTasks in same wave run in parallel

Tasks in the same wave run in parallel; each wave waits for the previous to complete.

⚠

The Unsexy 20%: Quality, Lineage, Governance, Cost

Getting a pipeline to run for the first time is 80% of the work. Keeping it running when the company grows (and proving to auditors that it runs correctly!) is the other 20%.

Data quality — schema checks, null-rate monitoring, distribution drift detection. Tools: Great Expectations, dbt tests, Soda. Without automated quality checks, you'll find out about bad data when the CEO asks why the dashboard shows $0 in sales for a region that shipped 10,000 units.
Lineage — given a number on a dashboard, can you trace it back to the source rows that produced it? Lineage tools (DataHub, OpenLineage) make this possible. Regulators and auditors increasingly require it.
Governance — who is allowed to see what, and how do you prove it? Access control isn't a nice-to-have; it's a HIPAA requirement, a GDPR requirement, and a SOC 2 requirement depending on your industry.
Cost — pipelines that worked fine at 100 GB/day can bankrupt you at 10 TB/day. Watch query costs. BigQuery and Snowflake both have pricing models where a single poorly-optimized query can be shockingly expensive.

Checkpoint

At 6:58 a.m., the head of sales reports that the regional dashboard is empty — it shows no data for any region. You're the on-call engineer. In what order do you investigate?

←PreviousThe Cloud WarehousesBuilding Data Pipelines Next→From Notebook to the WorldBuilding ML Pipelines