Post Hoc Tests

When ANOVA says "some group differs," post hoc tests tell you which groups differ — while controlling the family-wise error rate across all pairwise comparisons.

Tukey's HSD (Honestly Significant Difference)

The most commonly used post hoc test. Compares all possible pairs of group means while controlling family-wise Type 1 error. Assumes equal variance and approximately normal data.

How it works: Computes an HSD value based on the critical q value from the studentized range distribution. If a pairwise mean difference exceeds HSD, those groups are significantly different.

In Python: statsmodels.stats.multicomp.pairwise_tukeyhsd(values, groups)

Default choice for post hoc testing after a significant ANOVA.

Tukey's HSD Explorer
Group distributions
Group AGroup BGroup C

Vertical bar = group mean. Each dot is one observation.

Group A
Mean40
Spread5
Group B
Mean50
Spread5
Group C
Mean60
Spread5
HSD Formula
HSD = qα,k,dfW × √(MSE / n)|q = 3.510MSE = 11.32n = 10dfW = 27HSD = 3.73

Any pairwise |Δ mean| > 3.73 is significant at α = 0.05.

Pairwise Comparisons3 / 3 significant
Group AvsGroup BSignificant
|Δ mean| = 10.00HSD = 3.73
Group AvsGroup CSignificant
|Δ mean| = 20.00HSD = 3.73
Group BvsGroup CSignificant
|Δ mean| = 10.00HSD = 3.73
Group AGroup BGroup CSignificantNot significant

Adjust group means and spread to see how Tukey's HSD threshold responds. The bar chart shows each pairwise mean difference against the HSD cutoff — when the bar exceeds the marker, that pair is significant at α = 0.05.

Other Options

  • Bonferroni-adjusted pairwise comparisons: Conduct pairwise t-tests and adjust the significance threshold to α/m, where m is the number of comparisons. Controls the family-wise error rate and is simple to apply, though it can be conservative when many comparisons are performed.
  • Scheffé's test: Most conservative. Controls for all possible comparisons (not just pairwise), including complex contrasts. Use when you want to make comparisons you didn't pre-specify.
  • Duncan's new multiple range test: Less conservative than Tukey's. More prone to Type 1 errors. Less commonly used today.

Example: Three Recommendation Algorithms

You're comparing user ratings for collaborative filtering, content-based filtering, and matrix factorization. ANOVA returns p = 0.012 — at least one algorithm differs. You run Tukey's HSD and find:

  • Collaborative filtering vs. content-based: p = 0.021 (significant)
  • Collaborative filtering vs. matrix factorization: p = 0.008 (significant)
  • Content-based vs. matrix factorization: p = 0.42 (not significant)

Conclusion: matrix factorization and content-based filtering are comparable to each other, but collaborative filtering differs from both.

💭Reflection

You're comparing five ML models on 10 benchmark datasets. You want to know if any models perform significantly differently. Walk through the full analysis plan: what test do you start with, what do you do if it's significant, and what corrections apply?