The Paired Samples t-Test
Question: Is there a significant difference between the means of two related measurements?
Use it when: The same subjects are measured under two conditions (before/after, treatment/control on same subjects). Pairing controls for individual variability — this is where paired tests get their statistical power advantage over independent tests.
Test statistic:
where is the mean of paired differences, is 0 under the null, is the SD of the differences, and is the number of pairs.
In Python: scipy.stats.ttest_rel(before, after)
The paired t-test shows up constantly in ML contexts, beyond the obvious before-and-after scenarios:
- Model version comparison: Compare predictions from baseline vs. optimized model on the same test examples. Pair the errors per example, then test if the mean difference is significant.
- Algorithm comparison across datasets: Model A and B evaluated on the same 10 benchmark datasets. Pair the performance by dataset.
- Feature engineering evaluation: Same model trained with and without a feature, evaluated on the same test set. Pair by fold in cross-validation.
Pairing removes between-subject variability from the error term. If different test folds vary a lot in difficulty, the paired test adjusts for that. An independent test wouldn't — it would treat all that fold variability as error, reducing power.
Each row is one subject. Gray dot = before, colored dot = after. Lines connect paired values.
The paired test computes differences within each pair, eliminating between-subject variability from the error term. When subjects vary a lot from each other (high individual variation), this deflates the SE and amplifies the t-statistic. When subjects are nearly identical, pairing provides little advantage.
Compare the same dataset analyzed as paired vs. independent. See how the paired test has more power when individual variation is large.
You compare two NLP models using 5-fold cross-validation on the same dataset. For each fold, you record Model A's F1 score and Model B's F1 score. Which test should you use to determine if one model is significantly better?