The Paired Samples t-Test

Question: Is there a significant difference between the means of two related measurements?

Use it when: The same subjects are measured under two conditions (before/after, treatment/control on same subjects). Pairing controls for individual variability — this is where paired tests get their statistical power advantage over independent tests.

Test statistic:

$t = \frac{\bar{d} - \mu_d}{s_d/\sqrt{n}}$

where $\bar{x}$ is the mean of paired differences, $mu_d$ is 0 under the null, $s_d$ is the SD of the differences, and $n$ is the number of pairs.

In Python: scipy.stats.ttest_rel(before, after)

The paired t-test shows up constantly in ML contexts, beyond the obvious before-and-after scenarios:

Model version comparison: Compare predictions from baseline vs. optimized model on the same test examples. Pair the errors per example, then test if the mean difference is significant.
Algorithm comparison across datasets: Model A and B evaluated on the same 10 benchmark datasets. Pair the performance by dataset.
Feature engineering evaluation: Same model trained with and without a feature, evaluated on the same test set. Pair by fold in cross-validation.

Pairing removes between-subject variability from the error term. If different test folds vary a lot in difficulty, the paired test adjusts for that. An independent test wouldn't — it would treat all that fold variability as error, reducing power.

Paired vs. Independent Explorer

Choose a scenario

Paired observations

Each row is one subject. Gray dot = before, colored dot = after. Lines connect paired values.

Test results on the same data

Paired t-testmore power

0.0000

t-statistic

Infinity

9.0

p-value

< 0.001

✓ Significant (p < 0.05)

Independent t-test (Welch's)

3.5431

t-statistic

-1.693

18.0

p-value

0.1076

✗ Not significant (p ≥ 0.05)

Why the difference?

The paired test computes differences within each pair, eliminating between-subject variability from the error term. When subjects vary a lot from each other (high individual variation), this deflates the SE and amplifies the t-statistic. When subjects are nearly identical, pairing provides little advantage.

67.90

Mean before

73.90

Mean after

6.000

Mean diff (d̄)

Before After (improved) After (declined)

Compare the same dataset analyzed as paired vs. independent. See how the paired test has more power when individual variation is large.

Checkpoint

You compare two NLP models using 5-fold cross-validation on the same dataset. For each fold, you record Model A's F1 score and Model B's F1 score. Which test should you use to determine if one model is significantly better?

←PreviousIndependent Samples and Welch's t-TestParametric Tests Next→When the Assumptions Won't HoldNonparametric Tests