The "How Much Data" Question
If you spend any time around data, you will be asked some version of this question approximately forever: How much data do I need? To predict churn. To detect this disease. To prove this campaign worked.
Well, the answer is: it depends.
It depends on the complexity of your problem. The model you plan to use. The expected variability in your data. The quality of that data. The dimensionality of your feature space. The class imbalance you anticipate. The effect size you're hoping to detect. The significance level you've chosen. And sometimes most decisively — your time and budget constraints.
You'll leave this unit with the language to give a better answer than "it depends" — though fair warning, you'll still say "it depends" a lot. The difference is you'll be able to enumerate exactly what it depends on, and you'll know how to get an answer.
The factors that determine how much data you need, in rough order of importance:
- Effect size you're trying to detect. Small effects require much more data than large ones.
- Variability in your outcome. Noisy outcomes require more data to find signal in.
- Statistical power you want. Wanting to miss fewer real effects (higher power) means more data.
- Significance threshold (α). Stricter thresholds require larger samples.
- Model complexity. More parameters generally means more training examples needed.
- Class imbalance. A dataset that's 99% one class effectively has very few examples of what you care about.
Per group
63
Total (2 groups)
126
Small study
two-sample design
Effect size (Cohen's d)
0.50The standardized difference: how large is the effect relative to the natural variation in the outcome. Small effects demand far larger samples.
Statistical power (1 − β)
80%Probability of detecting a real effect when it exists. 80% is the conventional standard — chosen to balance study cost against the risk of missing real findings.
Significance threshold (α)
α = 0.05How much false-positive risk you accept. Stricter thresholds require larger samples. α = 0.05 is the default in most fields.
Required n vs. effect size (current power & α)
The curve is hyperbolic — halving the effect size roughly quadruples the required sample size.
0.50
Effect (d)
80%
Power
0.05
α
Adjust effect size, desired power, and significance threshold to see how required sample size changes. The curve makes viscerally clear why detecting small effects is so expensive. We will dive into this in great detail in Chapter 4!
You're asked to evaluate whether a new recommendation algorithm increases click-through rate from 2.1% to 2.3%. How would you think about how much data you need? What would make this easier or harder?