P-Values, Carefully

The p-value is the probability of obtaining test results at least as extreme as your observed results, assuming the null hypothesis is true. It ranges from 0 to 1. Smaller p-values indicate stronger evidence against the null.

That definition is precise and commonly misread. Let's slow down on it.

Common Misinterpretations of the p-value

  • Wrong: "p = 0.03 means there's a 3% chance the null is true."
    Right: "If the null were true, we'd see results this extreme about 3% of the time." The p-value is about the data given the null — not about the null given the data.
  • Wrong: "p < 0.05 means the effect is real and important."
    Right: Statistical significance and practical significance are different. A tiny effect can have a tiny p-value if you have enough data.
  • Wrong: "p > 0.05 proves there's no effect."
    Right: Failing to reject the null means you lack evidence for an effect — not that there's no effect. The effect might be real but your sample might be too small to detect it.

Common significance thresholds (α): 0.05 is standard in most fields; 0.01 is used when false positives are particularly costly; 0.10 is sometimes used in exploratory work. Choose α before looking at the data.

Use p-values as one piece of evidence for decisions, not as proof. They are tools, not verdicts.

P-Value Visualizer

The shaded area is the p-value, the probability of observing a test statistic at least this extreme under the null hypothesis (standard normal). Drag the slider to see how the p-value changes.

Test statistic (z)1.96
0.004.00
Tail
Significance level (α)
−4±1.96+4
p-valueP(|Z| ≥ 1.96)
0.0500
Statistically significant at α = 0.05 — reject H₀
Reminder: statistical significance does not imply practical importance. Always consider effect size alongside the p-value.
1.96
Test statistic (z)
0.0500
p-value
Reject H₀
Decision at α = 0.05
p-value (shaded area)Test statistic

Drag the test statistic to see the p-value (shaded area) update in real time. Toggle between one- and two-tailed tests and change α to see how the rejection decision changes.

Checkpoint

An A/B test shows that users who saw the new homepage design clicked "sign up" at a rate of 4.21% vs. 4.20% for the old design, with p = 0.003. What is the most appropriate conclusion?