The Paired Samples t-Test

Question: Is there a significant difference between the means of two related measurements?

Use it when: The same subjects are measured under two conditions (before/after, treatment/control on same subjects). Pairing controls for individual variability — this is where paired tests get their statistical power advantage over independent tests.

Test statistic:

t=dˉμdsd/nt = \frac{\bar{d} - \mu_d}{s_d/\sqrt{n}}

where xˉ\bar{x} is the mean of paired differences, mudmu_d is 0 under the null, sds_d is the SD of the differences, and nn is the number of pairs.

In Python: scipy.stats.ttest_rel(before, after)

The paired t-test shows up constantly in ML contexts, beyond the obvious before-and-after scenarios:

  • Model version comparison: Compare predictions from baseline vs. optimized model on the same test examples. Pair the errors per example, then test if the mean difference is significant.
  • Algorithm comparison across datasets: Model A and B evaluated on the same 10 benchmark datasets. Pair the performance by dataset.
  • Feature engineering evaluation: Same model trained with and without a feature, evaluated on the same test set. Pair by fold in cross-validation.

Pairing removes between-subject variability from the error term. If different test folds vary a lot in difficulty, the paired test adjusts for that. An independent test wouldn't — it would treat all that fold variability as error, reducing power.

Paired vs. Independent Explorer
Choose a scenario
Paired observations
55.0086.00

Each row is one subject. Gray dot = before, colored dot = after. Lines connect paired values.

Test results on the same data
Paired t-testmore power
SE
0.0000
t-statistic
Infinity
df
9.0
p-value
< 0.001
✓ Significant (p < 0.05)
Independent t-test (Welch's)
SE
3.5431
t-statistic
-1.693
df
18.0
p-value
0.1076
✗ Not significant (p ≥ 0.05)
Why the difference?

The paired test computes differences within each pair, eliminating between-subject variability from the error term. When subjects vary a lot from each other (high individual variation), this deflates the SE and amplifies the t-statistic. When subjects are nearly identical, pairing provides little advantage.

67.90
Mean before
73.90
Mean after
6.000
Mean diff (d̄)
Before After (improved) After (declined)

Compare the same dataset analyzed as paired vs. independent. See how the paired test has more power when individual variation is large.

Checkpoint

You compare two NLP models using 5-fold cross-validation on the same dataset. For each fold, you record Model A's F1 score and Model B's F1 score. Which test should you use to determine if one model is significantly better?