Big Data: Hadoop, Spark, and Hive
There's a moment in every data team's life when the data outgrows a single machine. Maybe you're joining 50 GB tables and your laptop starts sweating. Maybe a query that ran in ten minutes last year now takes three hours. Maybe you're trying to process a week's worth of clickstream data and it simply doesn't fit in memory.
When that moment arrives, you need a distributed processing framework — software that splits a computation across many machines and coordinates the results. Three names define this space, and they often appear in job descriptions, architecture docs, and technical interviews.
Apache Hadoop
Hadoop introduced two foundational ideas in the mid-2000s:
- HDFS (Hadoop Distributed File System) — store data across many machines, with automatic replication for fault tolerance. If one node fails, the data exists on two others.
- MapReduce — process distributed data in parallel using a two-phase programming model: a map step that processes each chunk independently, and a reduce step that aggregates the results.
MapReduce is elegant in theory and painful in practice. Every intermediate result gets written to disk before the next step reads it — which makes it slow. Writing complex transformations in MapReduce requires thinking in terms of map/reduce phases, which is not how most people think about data manipulation.
Hadoop is still in production at large enterprises, but most new workloads have migrated to Spark.
Apache Spark
Spark is the current default for large-scale data processing, and the improvement over Hadoop MapReduce is substantial. The key insight: instead of writing intermediate results to disk between steps, Spark keeps them in memory. On iterative workloads (like machine learning training, which makes many passes over the data), this can be 100× faster.
Spark supports four workloads in a single framework:
- Batch processing — process a large chunk of data on a schedule
- Stream processing — process events as they arrive (Spark Streaming)
- Machine learning — via the built-in MLlib library
- Interactive SQL — via Spark SQL, which accepts standard SQL syntax
The PySpark API means you write Python — which is the reason Spark has become the default for data science teams that also need distributed compute.
Apache Hive
Hive sits on top of Hadoop (or Spark) and lets you query distributed data using HiveQL, a SQL dialect. Under the hood, Hive translates your query into MapReduce or Spark jobs. Hive made big data queryable by analysts who couldn't write MapReduce programs.
A common architecture you'll encounter in Hadoop-heritage environments:
sources → HDFS (storage) → Spark (processing) → Hive (querying)
Hive is less prominent in greenfield projects — modern cloud warehouses provide SQL over distributed data without the operational complexity of a Hadoop cluster — but it's ubiquitous in enterprises that built their data infrastructure in the 2010s.
You Don't Need a Hadoop Cluster for Most Problems
Hadoop and Spark are solutions to genuinely big data problems. A 10 GB dataset that fits in a Pandas DataFrame or a Postgres table does not need Spark. A common mistake is reaching for distributed tools before you've outgrown single-machine tools — and then spending weeks on infrastructure instead of on the analysis.
The rule of thumb: try Pandas first. Try DuckDB (a fast in-process analytical database) second. Add distributed compute only when you've verified that the data genuinely doesn't fit single-machine processing.
Step through the four phases of a Spark job to see how the driver partitions data, dispatches tasks to executors, and collects results — all without touching disk for intermediate state.
Why is Apache Spark typically much faster than Hadoop MapReduce for iterative machine learning workloads?