P-Hacking and Why α Comes First

Here's a problem you have to actively guard against. Suppose you collect your data, look at it, decide you'd love the result to be significant, and start shopping. You try α = 0.05 — your p of 0.07 doesn't make the cut. You think, "0.10 is sometimes used in exploratory research." Now you have significance.

This is p-hacking: bending the rules after the fact to get the result you want. It's not always this brazen. Sometimes it's running a dozen statistical tests and only reporting the one that came out significant. Sometimes it's slicing the data by every demographic until one slice shows an effect. Sometimes it's stopping data collection the moment a result crosses the significance threshold.

The Defense: Discipline Before Data

  • Choose α before you look at the data.
  • Decide on your test before you collect.
  • Pre-register your analysis plan — even informally, by writing it down before running anything.
  • If you explore the data and try multiple things, that's fine — but be honest about it. Treat exploratory results as hypothesis-generating, not confirmatory. Correct for multiple comparisons.

In industry, this pressure is real. Your manager wants a result. Your stakeholder wants justification for a decision they've already made. The temptation to "find" significance is constant. The antidote is to set your statistical plan before your data arrives — and to document that you did.

💭Reflection

Describe a scenario where p-hacking could be tempting in an industry ML context. What pressures would create the temptation, and what safeguards would you put in place?