The Multiple Comparisons Problem
When you run multiple hypothesis tests, the risk of at least one false positive grows fast. Each individual test with α = 0.05 has a 5% chance of a false positive. Run 20 independent tests and you have roughly a 64% chance of at least one false positive — even when no real effects exist.
This is the multiple comparisons problem, and you'll bump into it constantly:
- Comparing each variant in a 10-arm A/B test against the control.
- Testing whether each of 50 features is significantly associated with churn.
- Comparing model performance across many cross-validation folds.
- Running the same experiment in multiple geographic segments.
Bonferroni Correction
The simplest and most conservative fix. Divide your significance threshold by the number of tests:
where m is the number of hypotheses being tested. If you're running 5 tests with α = 0.05, each individual test must clear 0.05/5 = 0.01.
Trade-off: Bonferroni reduces power — by raising the bar, you'll miss more real effects. It's conservative by design. For very large numbers of tests, consider the Benjamini-Hochberg procedure instead, which controls the false discovery rate rather than the family-wise error rate.
You're running an A/B/C/D/E test — five variants against a control (six arms total). You want to control your family-wise error rate at α = 0.05. What threshold should each individual pairwise comparison use with Bonferroni correction?