Warehouses, Lakes, and How to Choose
Data Warehouse
A data warehouse is a centralized repository of integrated, historical data from across an organization, optimized for analytical queries rather than transactional ones. The defining traits:
- Subject-oriented — organized around business topics (sales, customers, products), not application workflows
- Integrated — data from multiple sources is cleaned and unified into a consistent schema
- Non-volatile — you don't overwrite history; you append
- Time-variant — you keep a historical record, not just current state
The big names: Snowflake, Amazon Redshift, Google BigQuery. They all use columnar storage — storing data column-by-column rather than row-by-row — which makes analytical queries (aggregate this column, filter by that column) dramatically faster.
Data Lake
A data lake is a centralized repository that stores raw data — structured, semi-structured, and unstructured — in its native format, with no upfront schema requirement. Think of it as a staging area for everything: raw logs, JSON API responses, images, audio, CSV exports. Schema is applied at read time rather than write time.
The big names: AWS S3, Azure Data Lake Storage, Google Cloud Storage. Cheap per-byte, theoretically infinite, accepts anything.
A data lake without discipline becomes a data swamp. Dump enough unlabeled, unmanaged, unversioned data into S3 over a few years and you end up with tens of thousands of objects nobody can describe, query performance that has degraded to unusable, and no audit trail for compliance. If you've ever inherited a data/ bucket with 40,000 files and no documentation, you know the feeling.
Governance is necessary for a data lake: naming conventions, partitioning schemes, metadata catalogs, and access controls from day one. A lake without governance is just a filesystem with better marketing.
Common lakes
Common warehouses
Click any row to expand the reasoning behind each trade-off — not just what differs, but why it matters.
Data Lakehouse
The lakehouse is the 2019-era attempt to get the best of both: cheap, flexible lake storage with warehouse-quality schema enforcement, governance, and SQL performance. Databricks coined the term. Delta Lake, Apache Iceberg, and Apache Hudi are the open table formats that power it. Whether your team uses one depends on your vendor relationships and scale — but knowing the concept is useful for design conversations.
Finally, a word on the common file formats you'll see inside lakes:
Common File Formats
- Parquet — columnar, excellent compression, great for analytical queries. The default choice for ML training data.
- ORC — also columnar, optimized for Hive and Presto workloads. More common in Hadoop-heritage environments.
- Avro — row-based, supports schema evolution, well-suited for streaming pipelines.
Parquet has become the de facto standard for data science workflows. If you're storing a training dataset and don't have a reason to do otherwise, use Parquet.
Five Questions to Pick the Right Storage
When a stakeholder asks "what should we use to store this?", walk through five questions:
- 1. What's the data? Structured, semi-structured, unstructured?
- 2. How big, how fast, how varied? (The four V's.)
- 3. What are the access patterns? Heavy reads? Heavy writes? Complex joins? Similarity search?
- 4. What scale do we need? Today, and in two years?
- 5. What's the team's existing ecosystem? A "perfect" choice nobody can operate is worse than a "good" choice that fits the team's skills.
There is rarely one right answer. There is almost always one wrong answer you can rule out by asking those five questions.
1What is the structure of your data?
2What is the data volume and velocity?
3What is the primary access pattern?
4Do you need to scale across multiple machines?
Answer questions about your data's V's and access patterns. The tool surfaces the storage type that fits — with its reasoning.