Post Hoc Tests

When ANOVA says "some group differs," post hoc tests tell you which groups differ — while controlling the family-wise error rate across all pairwise comparisons.

Tukey's HSD (Honestly Significant Difference)

The most commonly used post hoc test. Compares all possible pairs of group means while controlling family-wise Type 1 error. Assumes equal variance and approximately normal data.

How it works: Computes an HSD value based on the critical q value from the studentized range distribution. If a pairwise mean difference exceeds HSD, those groups are significantly different.

In Python: statsmodels.stats.multicomp.pairwise_tukeyhsd(values, groups)

Default choice for post hoc testing after a significant ANOVA.

Tukey's HSD Explorer

Group distributions

Vertical bar = group mean. Each dot is one observation.

Group A

Mean40

Spread5

Group B

Mean50

Spread5

Group C

Mean60

Spread5

HSD Formula

HSD = q_{α,k,df_W} × √(MSE / n)|q = 3.510MSE = 11.32n = 10df_W = 27HSD = 3.73

Any pairwise |Δ mean| > 3.73 is significant at α = 0.05.

Pairwise Comparisons3 / 3 significant

Group AvsGroup BSignificant

|Δ mean| = 10.00HSD = 3.73

Group AvsGroup CSignificant

|Δ mean| = 20.00HSD = 3.73

Group BvsGroup CSignificant

|Δ mean| = 10.00HSD = 3.73

Group AGroup BGroup CSignificantNot significant

Adjust group means and spread to see how Tukey's HSD threshold responds. The bar chart shows each pairwise mean difference against the HSD cutoff — when the bar exceeds the marker, that pair is significant at α = 0.05.

Other Options

Bonferroni-adjusted pairwise comparisons: Conduct pairwise t-tests and adjust the significance threshold to α/m, where m is the number of comparisons. Controls the family-wise error rate and is simple to apply, though it can be conservative when many comparisons are performed.
Scheffé's test: Most conservative. Controls for all possible comparisons (not just pairwise), including complex contrasts. Use when you want to make comparisons you didn't pre-specify.
Duncan's new multiple range test: Less conservative than Tukey's. More prone to Type 1 errors. Less commonly used today.

◆

Example: Three Recommendation Algorithms

You're comparing user ratings for collaborative filtering, content-based filtering, and matrix factorization. ANOVA returns p = 0.012 — at least one algorithm differs. You run Tukey's HSD and find:

Collaborative filtering vs. content-based: p = 0.021 (significant)
Collaborative filtering vs. matrix factorization: p = 0.008 (significant)
Content-based vs. matrix factorization: p = 0.42 (not significant)

Conclusion: matrix factorization and content-based filtering are comparable to each other, but collaborative filtering differs from both.

💭Reflection

You're comparing five ML models on 10 benchmark datasets. You want to know if any models perform significantly differently. Walk through the full analysis plan: what test do you start with, what do you do if it's significant, and what corrections apply?

←PreviousThe Logic: Partitioning VariabilityANOVA Next→The SetupRegression