What is Data Engineering?

There's a fantasy version of the data scientist role where clean datasets materialize each morning, perfectly formatted, ready to model. In reality, someone built the pipeline that produced that data. Someone decided how to partition it. Someone set the schedule. Someone is getting paged at 2 a.m. when it breaks.

That person is the data engineer — and on smaller teams, that person is you.

Data engineering is the practice of designing, building, and maintaining the infrastructure that collects, stores, and serves data at scale.

In practice, the line between data engineering and ML engineering is porous. At a startup, you might own the entire pipeline from raw source to deployed model. At a large company, a specialized data engineering team will own the pipeline up to the feature store, and your job starts where theirs ends. Either way, you'll be more effective — and more trusted — if you can speak the language of data engineering.

The Data Engineer's Core Toolkit

  • Languages — Python and SQL are non-negotiable. Scala or Java still appear around Spark and Kafka.
  • A cloud platform — AWS, Azure, or GCP. Usually your employer decides.
  • A warehouse — Snowflake, Redshift, or BigQuery.
  • An ETL/orchestration tool — Apache Airflow leads the open-source pack.
  • Version control and containers — Git, Docker, often Kubernetes.
Checkpoint

A data scientist at a startup says: 'I spend 70% of my time getting data into a usable format — cleaning it, joining it, scheduling jobs to refresh it.' What is the most accurate description of what they're doing?