Designing a Pipeline

The Brief: E-Commerce Sales Dashboard

An e-commerce company wants a daily dashboard of sales by region. The data lives in three places: orders in a Postgres database, web events in Kafka, and marketing spend in the Salesforce API. The dashboard needs to be ready by 7 a.m. every day, reflecting the previous day's data.


  1. Source layer — Postgres (orders), Kafka (web events), Salesforce API (marketing spend). Three different systems, three different access patterns.
  2. Ingestion — An Airflow DAG that runs at 2 a.m. every night: extracts yesterday's orders via SQL, the day's events from Kafka, and the marketing data via REST API call.
  3. Staging — Raw data lands in S3, partitioned by date, in Parquet format. Nothing is transformed yet. This is the insurance policy: if a transformation breaks, you can rerun it without re-extracting from source.
  4. Transformation — A Spark job joins orders, events, and spend; computes per-region totals; writes a clean fact table back to S3.
  5. Load — The fact table is loaded into Snowflake, ready for query.
  6. Serving — A BI tool (Tableau, Looker, Metabase) queries Snowflake to render the dashboard.
  7. Monitoring — Airflow alerts if any step fails. Data quality checks (row counts within expected ranges, no nulls in key fields) run after the transform step.
DAG Builder
Wave 1
Extract: Postgres (orders)
extract
Extract: Kafka (events)
extract
Extract: Salesforce API
extract
Wave 2
Join: orders + events
transform
depends on: Extract, Extract
Wave 3
Join: marketing spend
transform
depends on: Join, Extract
Wave 4
Quality check: null rates
quality
depends on: Join
Wave 5
Load → Snowflake
load
depends on: Quality check
extracttransformloadqualityTasks in same wave run in parallel
Tasks in the same wave run in parallel; each wave waits for the previous to complete.

The Unsexy 20%: Quality, Lineage, Governance, Cost

Getting a pipeline to run for the first time is 80% of the work. Keeping it running when the company grows (and proving to auditors that it runs correctly!) is the other 20%.


  • Data quality — schema checks, null-rate monitoring, distribution drift detection. Tools: Great Expectations, dbt tests, Soda. Without automated quality checks, you'll find out about bad data when the CEO asks why the dashboard shows $0 in sales for a region that shipped 10,000 units.
  • Lineage — given a number on a dashboard, can you trace it back to the source rows that produced it? Lineage tools (DataHub, OpenLineage) make this possible. Regulators and auditors increasingly require it.
  • Governance — who is allowed to see what, and how do you prove it? Access control isn't a nice-to-have; it's a HIPAA requirement, a GDPR requirement, and a SOC 2 requirement depending on your industry.
  • Cost — pipelines that worked fine at 100 GB/day can bankrupt you at 10 TB/day. Watch query costs. BigQuery and Snowflake both have pricing models where a single poorly-optimized query can be shockingly expensive.
Checkpoint

At 6:58 a.m., the head of sales reports that the regional dashboard is empty — it shows no data for any region. You're the on-call engineer. In what order do you investigate?