Overview of Hypothesis Testing

Here's something that gets lost when we focus on shipping code: as a data scientist or ML engineer, you are doing science. You're forming hypotheses. You're designing experiments. You're drawing conclusions from evidence. Every time you tune hyperparameters, compare algorithms, or evaluate whether a new feature improved your model, you're doing hypothesis testing — whether you call it that or not.

Done well, hypothesis testing looks like this — six steps, in order, with no shortcuts.

Step 1: Formulate Hypotheses

You need two: a null hypothesis (H0H_0) and an alternative hypothesis (H1H_1).

The null is the assumption of no significant difference or effect. The alternative is the claim you're trying to support. The null is what you're trying to produce evidence against.

Example: Testing a new image recognition algorithm against a baseline with 85% accuracy.

  • H0H_0: The new algorithm's average accuracy equals 85%.
  • H1H_1: The new algorithm's average accuracy is greater than 85%.

Step 2: Select Your Statistical Test

Choose your test before looking at the data. Different tests are appropriate for different data structures. Later in this unit, we will talk about choosing the appropriate test.

Step 3: Collect Your Data

Notice this is step three, not step one. Hypotheses and test choice must come before data collection. This is what separates a real experiment from a fishing expedition. The moment you look at data before forming a hypothesis, you're at risk of p-hacking (more on that shortly).

Step 4: Calculate the Test Statistic

Calculate the test statistic — a number computed from your sample that summarizes how far the data deviates from what the null predicts.

Step 5: Calculate the P-Value

Calculate the p-value — the probability of seeing a test statistic at least as extreme as yours, assuming the null is true.

Step 6: Make Your Decision

If pαp \leq \alpha (your pre-chosen significance level), reject the null. Otherwise, fail to reject.

Critical language note: you fail to reject the null — you never "accept" it.

Checkpoint

A researcher collects data, looks at it, notices an interesting pattern, forms a hypothesis based on what she saw, then runs a statistical test on the same data. What is wrong with this approach?