Data Leakage: The Silent Killer
Data leakage is when information from outside your training data sneaks into the training process. It is everywhere. It happens often. It is usually invisible until your model fails in deployment — and by then you've often already shipped it.
The five most common forms:
- Test set used for hyperparameter tuning. The canonical form, covered in §5.1.
- Normalization computed on the full dataset. When you compute mean and standard deviation to standardize features, those statistics are parameters of your pipeline. If you compute them using the test data, your model has effectively seen the test set. Fix: compute scaling parameters on training data only, then apply them to validation and test.
- EDA leaks. Insights you derive from looking at the full dataset can influence modeling decisions in subtle ways. Best practice: partition the test set before you do EDA.
- Feature engineering leaks. A feature like "average sales for this customer" — if computed across all data, including future transactions — leaks future information into training.
- Time-aware leaks. For time series, your test set should be the future, not random samples. Otherwise the model "knows" things that in production it shouldn't.
A Good Rule of Thumb
Pretend the test set doesn't exist until the very last step. Run all preprocessing, EDA, feature engineering, model selection, and hyperparameter tuning as if the test set will arrive tomorrow. Then evaluate exactly once. If you don't like the result, you don't get to go back and tune. That's the deal.
The Pricing Model That Lost Money
A pricing model performed beautifully in backtests and lost money the moment it went live. The reason: backtests randomly sampled across years instead of training on past and testing on future. The model had been "predicting" past prices using future data. The leakage was invisible in metrics until the system actually went live. Time-aware splits are non-negotiable in any sequential domain.
Walk through the ML pipeline below. Click each step to reveal whether it introduces data leakage — and how to fix it.
Placeholder: interactive leakage checker — walk through a sample ML pipeline and identify which steps introduce leakage.
You standardize all features using sklearn's StandardScaler, fitting it on the entire dataset before splitting into train/test. What is wrong with this?