Orchestration and DAGs

Writing pipeline code is the easy part. Getting it to run reliably, on schedule, in the right order, with alerts when something breaks — that's orchestration.

Think about what a production pipeline actually needs. It needs to run at 2 a.m. every night without anyone pressing a button. It needs to know that Step 4 can't start until Steps 2 and 3 have both succeeded. It needs to retry Step 2 automatically if the source database was briefly unavailable. It needs to alert someone if it fails after three retries. And it needs to show you a history of every run so you can debug what went wrong on Tuesday.

That is the job of an orchestration tool. And the central concept underneath all of them is the DAG.

Directed Acyclic Graph (DAG)

A directed acyclic graph is a graph where edges have a direction (A → B means "A must finish before B starts") and there are no cycles (you can't loop back to a node you've already visited).

In a pipeline DAG:

Nodes are tasks — "extract orders from Postgres," "transform and join," "load to Snowflake"
Edges are dependencies — "this task depends on that one"
Acyclic means pipelines run forward in time, not in loops

When you define a pipeline as a DAG, the orchestrator can figure out which tasks can run in parallel (those with no dependency on each other) and which must run sequentially. This is not something you calculate manually — you express the graph, and the tool does the scheduling.

Airflow DAG — Nightly Sales Pipeline

idlerunningdonefaileddependency arrows show execution order

Each node is a task; arrows are dependencies. Tasks with no shared dependency can run in parallel — the orchestrator schedules this automatically.

An Airflow DAG visualizes the dependency structure of a pipeline. Tasks with no dependency on each other can run in parallel; the orchestrator handles the scheduling automatically.

ℹ

Apache Airflow

Apache Airflow is the de facto open-source standard for pipeline orchestration. You write pipelines as Python code — each task is a Python function or operator, and you express the dependencies between them using Python. Airflow's scheduler interprets the DAG and executes tasks in order, retries failures, and surfaces everything in a web UI.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG("nightly_sales", start_date=datetime(2024, 1, 1), schedule="0 2 * * *") as dag:
    extract = PythonOperator(task_id="extract_orders", python_callable=extract_orders)
    transform = PythonOperator(task_id="transform", python_callable=transform_orders)
    load = PythonOperator(task_id="load_to_warehouse", python_callable=load_to_snowflake)

    extract >> transform >> load  # dependency chain

The >> operator defines the edge: extract must finish before transform, transform must finish before load.

✦

Other Orchestration Tools

Talend — open-source data integration with a graphical drag-and-drop interface. Easier onboarding for non-engineers; less flexible than code-based approaches. Still common in organizations with mixed technical backgrounds.
Informatica — the heavyweight enterprise option. Comprehensive, expensive, common in large organizations with strict governance and compliance requirements.
Prefect and Dagster — newer Python-native orchestrators that address some of Airflow's rough edges (local testing, type safety, observability). Worth knowing if you're evaluating tools for a greenfield project.

Checkpoint

A pipeline has four tasks: (A) extract customer records, (B) extract order records, (C) join customers to orders, (D) load the joined result to the warehouse. Which tasks can run in parallel, and why?

←PreviousETL: The Central AbstractionBuilding Data Pipelines Next→Big Data: Hadoop, Spark, and HiveBuilding Data Pipelines