The Cloud Warehouses

Modern data engineering rarely runs on raw Hadoop clusters anymore. The dominant pattern at most companies is a managed cloud data warehouse — a service where the underlying distributed infrastructure is someone else's problem, and you focus on SQL and pipeline design.

Three platforms define this space. You typically don't pick the warehouse — your employer already has. But understanding the trade-offs makes you useful in architecture conversations, and it can come up in interviews.

ℹ

Snowflake

Snowflake is cloud-native and multi-cloud — it runs on AWS, Azure, and GCP, and you can query data across all three from the same interface. Its signature architectural decision is separating storage from compute: you scale them independently, which means you can spin up a large compute cluster for a complex query and then scale back down, paying only for what you use.

Snowflake is popular with organizations that operate across cloud providers or want to avoid cloud lock-in. It's also the choice when you're buying the warehouse as a service rather than building it yourself — Snowflake handles performance tuning, indexing, and maintenance.

ℹ

Amazon Redshift

Redshift is AWS's managed warehouse. It uses columnar storage and massively parallel query execution. It's tightly integrated with the rest of the AWS ecosystem — S3, Glue, Lambda, SageMaker — which makes it the natural choice for teams already deep in AWS.

Redshift works well for read-heavy analytical workloads at significant scale. Its Redshift Spectrum feature lets you query data directly in S3 without loading it into the warehouse — a useful bridge between lake and warehouse patterns.

ℹ

Google BigQuery

BigQuery is serverless — you don't provision clusters or think about nodes. You write SQL, submit queries, pay per byte scanned. This makes it the fastest path from "zero" to "querying petabytes of data" of any warehouse in the market.

The billing model is also different from Redshift: you pay per query rather than per cluster-hour, which is cheaper for intermittent workloads but can become expensive if you run large scans frequently. BigQuery has become the default recommendation for learning and prototyping precisely because the operational overhead is near-zero.

Pipeline Cost SimulatorEstimated monthly cost

Your workload

Data stored500 GB

10 GB9.8 TB

Queries per day50 queries

1 queries1000 queries

Compute hours / day (Snowflake)4h

0.5h24h

Snowflake billed per compute-hour (warehouse on/off). BigQuery billed per TB scanned. Redshift billed per node-hour (always on). Prices approximate 2024 list rates.

Monthly estimate

Snowflake

$1932

Compute: $1920Storage: $12

BigQuery

Cheapest$690

Compute: $680Storage: $10

Redshift

$2359

Compute: $2347Storage: $12

Tip: BigQuery's per-scan billing favors intermittent workloads. Redshift's always-on nodes favor high, sustained query volume. Snowflake's auto-suspend favors bursty compute needs.

◆

Which Would You Choose?

A healthcare analytics team is moving from an on-premises data warehouse to the cloud. They need strong compliance controls (HIPAA), SQL-based transformations, and an easy bridge to their existing AWS infrastructure. Their team is deep in AWS and doesn't want to manage multiple cloud accounts.

Best fit: Redshift. Native AWS integration, HIPAA-eligible (with proper configuration), columnar storage for the analytical workloads they're running. Snowflake would also work but introduces multi-cloud complexity they don't need. BigQuery would require migrating their AWS infrastructure to GCP or managing cross-cloud access.

💭Reflection

You're joining a small startup that has been storing all of its data in flat CSV files in S3. They've asked you to recommend a warehouse. What questions would you ask before making a recommendation?

←PreviousBig Data: Hadoop, Spark, and HiveBuilding Data Pipelines Next→Designing a PipelineBuilding Data Pipelines