What Kind of Data?

Before you can choose where to store something, you need to know what it is. Data comes in three structural forms, and each one has a different relationship with the databases designed to hold it.

Three Structural Forms of Data

Click a card to learn more. The type determines what kind of database can hold it — and what kind of pipeline you'll need to make it useful.

Knowing the type gets you halfway there. The other half is the four V's — a framework for sizing up any storage problem.

The Four V's of Data
How Much?
Volume
The total size of data to store and query.
A few gigabytes fits on a laptop. A few petabytes requires a distributed system. The gap between those two scenarios is most of what drives database architecture decisions.
Key distinction
GB vs. TB vs. PB
How Fast?
Velocity
The rate at which data arrives.
A monthly data export is trivially handled with batch processes. A real-time event stream from a million devices is an entirely different engineering problem.
Key distinction
Batch vs. streaming
How Many Types?
Variety
The number of distinct data shapes in the system.
A single schema of structured records is the easy case. A system ingesting clickstreams, images, user text, and transaction records simultaneously requires a more thoughtful architecture.
Key distinction
Structured vs. mixed
How Trustworthy?
Veracity
The reliability and accuracy of the data.
Data from controlled sensors is high-veracity. Data scraped from the open web is not. Low veracity upstream means more validation and cleaning work, which affects pipeline design all the way to the model.
Key distinction
Controlled vs. scraped

Answer all four for any storage problem and the right tool usually surfaces. Skip them and you'll spend months migrating away from a database that was never right for the job.

Checkpoint

A hospital system is adding a new feature: continuous vitals monitoring for ICU patients. Every patient generates one data point per second across six vital signs. The data must be queryable by time window (e.g., 'give me everything from the last 6 hours') and retained for two years. Which V drives the storage choice most strongly here?