Representation Is Compression

Every time you represent something — in a drawing, a word, a number, a file — you are making a decision about what to keep and what to throw away. That is compression.

The original thing (a landscape, a face, a song) contains infinite detail. Any representation of it is finite. Something always gets left out. The only question is which details survive the translation.

ℹ

Lossy vs. Lossless

In computing, compression is either lossless (the original can be perfectly reconstructed — like a ZIP file) or lossy (some information is permanently discarded — like a JPEG or an MP3). Human memory is almost always lossy. So is most real-world data collection.

This isn't a flaw. It's the point. A map that contains every detail of a city at 1:1 scale is useless — it's just the city again. Compression is what makes information actionable. The data scientist's job is to choose compressions that preserve what matters for the task at hand.

Here's a concrete demonstration. You've seen the Starbucks logo hundreds of times. But how much of it actually made it into your memory?

Branded in Memory

Step 1 of 5

Draw from Memory

Without looking it up, draw the Starbucks logo as accurately as you can. Use the tools below — take your time.

Size

Draw the Starbucks logo from memory, then compare your compressed representation to the original — and to everyone else's.

Notice what happened. You and hundreds of other people encoded the same logo, but each representation was different. Everyone kept the dominant features — green circle, mermaid figure — and dropped the fine details. That's the compression in action: high-frequency visual information (the exact crown shape, the star count, the precise arm position) got filtered out; low-frequency structure (color, rough shape, general subject) was retained.

This is exactly what a JPEG does to a photograph. It discards high-frequency detail that is expensive to store and that most viewers won't notice. It keeps the broad strokes. Your visual memory runs the same algorithm.

◆

Compression in Every Data Type

Text: A summary compresses a document. A word compresses a concept. An emoji compresses an emotion.
Images: JPEG compression discards high-frequency pixel variation. Downsampling throws away resolution.
Tabular data: Binning a continuous variable (age → "20–30") compresses by quantizing. Averaging discards variance.
Models: A trained model is itself a compression — it encodes statistical patterns from millions of training examples into a fixed number of parameters.

When you choose a representation for your data, you are choosing a compression scheme. The decision is never neutral. A feature you don't include can't influence your model. A resolution you discard can't be recovered. A category boundary you draw will shape every downstream result.

Understanding representation as compression reframes the question. It's not "how do I store this data?" It's "what do I need to preserve — and what am I willing to lose?"

Checkpoint

A data scientist bins a continuous "age" column into ranges like 0–18, 19–35, 36–60, 60+. What kind of compression is this?

←PreviousUser Behavior and Combined Data TypesInformation Representation Next→SurveysSourcing Data