Population vs. Sample

A population is the entire group of entities you care about — all the users of your product, all the patients with a condition, all possible outputs of a process. A sample is a subset you actually have data on.

In data science, you almost never have the population. Even when your dataset feels huge, it's a slice in time, drawn from a particular source, filtered by whatever pipeline collected it. The model trained on it will be deployed in a world that's slightly different from the data it learned from. Treating your data as the whole truth is one of the most common ways smart people make bad decisions.

The Dataset Is Never the Whole World

A model trained on last year's transactions doesn't know about next year's user behavior. A model trained on data from one hospital may not generalize to patients at another. Even "all users" is a sample — it's every user who signed up, used your product in this particular way, and was captured by your logging pipeline. The selection process is always baked in.

This is why everything we do in this unit is sampling statistics — methods designed to make sound claims about a population when all you have is a sample. Population parameters are the unknowns we're trying to estimate. Sample statistics are our estimates.

A few definitions:

  • Population mean (mumu): The true average over the whole population. Unknown in practice.
  • Sample mean (ar{x}): The average from your data. Your best estimate of mumu.
  • Population standard deviation (sigmasigma): The true spread. Almost always unknown.
  • Sample standard deviation (ss): Computed from your data, with a correction factor. Your estimate of sigmasigma.
Population & Sample Explorer

Population

200 individuals · exam scores

True mean (μ)

72.6

In practice

μ is unknown — we only see samples

Sample size (n)

0 draws

Distribution of sample means (x̄)

Draw some samples to see the distribution
Sample mean (x̄)True population mean (μ)Increase n to reduce spread

Draw repeated samples from a population of 200 exam scores and watch how the distribution of sample means ($\bar{x}$) clusters around the true population mean ($\mu$). Increase $n$ to see the spread shrink.

💭Reflection

Think about a dataset you've worked with. What population was it trying to represent? What selection processes might have biased the sample? What conclusions might be wrong because of that gap?