The Questions That Open a Dataset

The temptation when you get a new dataset is to load it into a notebook, call df.head(), and start fitting models. Don't. Start with questions. EDA begins before you look at an ML model!

ℹ

Four Categories of Opening Questions

Context questions. What is the source of the data? Who collected it, how, and when? What are the known biases and limitations? How does this data relate to the problem you're solving? This is the connective tissue between sourcing (Chapter 2) and everything that comes next.

Sampling questions. Is the dataset representative of the population of interest? Will you need train/validation/test splits? How will you split? Are there subgroups that need separate analysis? If your dataset is 95% from one country but you'll deploy globally, you need to know that now — not after the model is in production.

Structure questions. What are the dimensions? What are the data types of each variable — numerical, categorical, text, datetime? Are there missing values, and how are they represented?

Quality questions. Are there duplicates? Inconsistent values? Columns that look numeric but contain strings? Outliers visible even before any statistics are computed?

Missing values are encoded in more ways than you expect: blank cells, NaN, 0, -9999, the empty string "", or a special sentinel the original collector chose. Figuring out what counts as "missing" in your particular dataset is non-trivial — and worth checking before anything else. The wrong assumption here corrupts every downstream step.

◆

The Healthcare Null That Wasn't Missing

I once worked on a healthcare project where what I assumed were missing values in a column were actually patients who hadn't yet had a particular test. The data wasn't missing — the test wasn't performed yet. Treating those nulls as missing data and imputing them would have been an enormous mistake. The context question — why is this null? — is the question that catches this. In industry, the experienced data scientists spend the most time on context questions and the least time fitting models. The juniors are often the opposite. Be the experienced one.

Checkpoint

You receive a dataset and immediately run df.describe() to get summary statistics. What critical information might you be missing?

←PreviousData OrganizationSourcing Data Next→Descriptive StatisticsExploratory Data Analysis