Missing Features: Two Cautionary Tales

Before we talk about missing values, let's talk about missing features — the variables you didn't measure that turn out to be necessary for a correct answer. This is more dangerous than missing values, because it's harder to detect.

Chart showing correlation between ice cream sales and shark attacks over summer months

The Ice Cream and Shark Attack Problem

Ice cream sales and shark attacks are highly correlated. Both rise in summer. Does eating ice cream cause shark attacks? Of course not. The missing feature is temperature. Warm weather causes more ice cream eating and more people in the ocean and more shark encounters. Without temperature in the model, the correlation looks causal. It isn't. This is a confounder — a variable that influences both what you're measuring and what you're trying to predict.

When Missing Features Invert the Conclusion

A famous UK study found that smoking by pregnant mothers reduced the rate of Down syndrome. The study was used to inform decisions. The problem: the model didn't include the mother's age, which is most strongly associated with Down syndrome risk. At the time, younger women smoked at higher rates than older women — so smokers as a group had lower Down syndrome rates. When you control for age, smoking actually increases the risk. The missing feature didn't just introduce noise. It inverted the conclusion entirely.

Be conscientious about which features are in your model and which ones aren't. A model with the wrong feature set can confidently produce a wrong answer. In domains where causal inference matters — medicine, economics, public policy — the question of what's not in the model is taken as seriously as what is. Develop the habit of asking: what am I not measuring that could explain the pattern I'm seeing?

Checkpoint

A model predicts employee performance and finds that employees who drink more coffee perform better. You're about to recommend a coffee subsidy. What should you do first?