Why Three Splits, Not Two

Most tutorials show this: partition your data into a training set and a test set, train on training, evaluate on test, report the result. That's what most courses teach. It's also wrong for most real work.

Here's what happens in practice — and I've seen this at companies, not just in student projects:

You train a model on training data.
You evaluate it on test data.
You don't love the results.
You change the architecture, tune hyperparameters, or try a different feature set.
You evaluate on test data again. Better. You iterate.
By the time you're done, your "test" set has effectively become a training signal — you've been making decisions based on it.

That's data leakage. Your reported test performance is now a lie.

ℹ

The Three-Split Solution

Train — for training the model.
Validation — for tuning: hyperparameters, model architecture, feature decisions.
Test — held out, untouched, used exactly once at the end to report final performance.

If you have an external test set — data from a completely separate collection or time period — use that instead of a third split. If you don't, always split three ways.

Three-Split Workflow

Common split ratios:

70%

15%

TrainValidationTest

Set your split

Drag the handles or pick a preset to set your train / validation / test proportions. Then step through the workflow.

○ Train (70%)○ Validation (15%)○ Test (15%)

Which operations touch which set?

Set

Size

Used for

Train set

70%

fit model weights

Validation set

15%

select hyperparameters & architecture

Test set

15%

report final performance (once)

Drag the handles to set split proportions, then step through the workflow to see which operations touch each set.

⚠

Think About What Kind of Generalization You Need

Random shuffling is rarely the right split strategy for sequential or hierarchical data. A medical imaging model that splits randomly across patients — so the same patient's images appear in both train and test — is measuring memorization, not generalization. Think about what real-world deployment looks like: test on new patients, test on future time periods, test on new geographic regions. Make your split reflect the generalization you actually need.

◆

99% AUC → 70% in Production

A medical imaging model scored 99% AUC during evaluation and dropped to 70% in production. The cause: random train/test split across patients, so the same patient's images appeared in both sets. The model had memorized patients, not learned the disease. A proper patient-level split would have caught it before deployment.

Checkpoint

You've trained a model, evaluated on your test set, and got 78% accuracy. You decide to try a different feature set and re-evaluate on the same test set — getting 82%. Which number should you report?

←PreviousFrom Plot to NarrativeTelling Stories with Data Next→Cross-ValidationPreparing Data for Modeling