Why Three Splits, Not Two
Most tutorials show this: partition your data into a training set and a test set, train on training, evaluate on test, report the result. That's what most courses teach. It's also wrong for most real work.
Here's what happens in practice — and I've seen this at companies, not just in student projects:
- You train a model on training data.
- You evaluate it on test data.
- You don't love the results.
- You change the architecture, tune hyperparameters, or try a different feature set.
- You evaluate on test data again. Better. You iterate.
- By the time you're done, your "test" set has effectively become a training signal — you've been making decisions based on it.
That's data leakage. Your reported test performance is now a lie.
The Three-Split Solution
- Train — for training the model.
- Validation — for tuning: hyperparameters, model architecture, feature decisions.
- Test — held out, untouched, used exactly once at the end to report final performance.
If you have an external test set — data from a completely separate collection or time period — use that instead of a third split. If you don't, always split three ways.
Common split ratios:
Drag the handles or pick a preset to set your train / validation / test proportions. Then step through the workflow.
Which operations touch which set?
Drag the handles to set split proportions, then step through the workflow to see which operations touch each set.
Think About What Kind of Generalization You Need
Random shuffling is rarely the right split strategy for sequential or hierarchical data. A medical imaging model that splits randomly across patients — so the same patient's images appear in both train and test — is measuring memorization, not generalization. Think about what real-world deployment looks like: test on new patients, test on future time periods, test on new geographic regions. Make your split reflect the generalization you actually need.
99% AUC → 70% in Production
A medical imaging model scored 99% AUC during evaluation and dropped to 70% in production. The cause: random train/test split across patients, so the same patient's images appeared in both sets. The model had memorized patients, not learned the disease. A proper patient-level split would have caught it before deployment.
You've trained a model, evaluated on your test set, and got 78% accuracy. You decide to try a different feature set and re-evaluate on the same test set — getting 82%. Which number should you report?