Post Hoc Tests
When ANOVA says "some group differs," post hoc tests tell you which groups differ — while controlling the family-wise error rate across all pairwise comparisons.
Tukey's HSD (Honestly Significant Difference)
The most commonly used post hoc test. Compares all possible pairs of group means while controlling family-wise Type 1 error. Assumes equal variance and approximately normal data.
How it works: Computes an HSD value based on the critical q value from the studentized range distribution. If a pairwise mean difference exceeds HSD, those groups are significantly different.
In Python: statsmodels.stats.multicomp.pairwise_tukeyhsd(values, groups)
Default choice for post hoc testing after a significant ANOVA.
Vertical bar = group mean. Each dot is one observation.
Any pairwise |Δ mean| > 3.73 is significant at α = 0.05.
Adjust group means and spread to see how Tukey's HSD threshold responds. The bar chart shows each pairwise mean difference against the HSD cutoff — when the bar exceeds the marker, that pair is significant at α = 0.05.
Other Options
- Bonferroni-adjusted pairwise comparisons: Conduct pairwise t-tests and adjust the significance threshold to α/m, where m is the number of comparisons. Controls the family-wise error rate and is simple to apply, though it can be conservative when many comparisons are performed.
- Scheffé's test: Most conservative. Controls for all possible comparisons (not just pairwise), including complex contrasts. Use when you want to make comparisons you didn't pre-specify.
- Duncan's new multiple range test: Less conservative than Tukey's. More prone to Type 1 errors. Less commonly used today.
Example: Three Recommendation Algorithms
You're comparing user ratings for collaborative filtering, content-based filtering, and matrix factorization. ANOVA returns p = 0.012 — at least one algorithm differs. You run Tukey's HSD and find:
- Collaborative filtering vs. content-based: p = 0.021 (significant)
- Collaborative filtering vs. matrix factorization: p = 0.008 (significant)
- Content-based vs. matrix factorization: p = 0.42 (not significant)
Conclusion: matrix factorization and content-based filtering are comparable to each other, but collaborative filtering differs from both.
You're comparing five ML models on 10 benchmark datasets. You want to know if any models perform significantly differently. Walk through the full analysis plan: what test do you start with, what do you do if it's significant, and what corrections apply?