Stratified Sampling for Imbalanced ML

The most concrete place sampling concepts hit your ML workflow is class imbalance. If your dataset has 95% one class and 5% another, a naive random train/val/test split can produce a test set with very few minority-class examples — making your evaluation metrics noisy and your model's behavior on the minority class unstable.

The Two-Line Fix

from sklearn.model_selection import train_test_split, StratifiedKFold

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
    ...

Pass stratify=y to train_test_split and the class distribution is preserved across both splits. StratifiedKFold does the same for cross-validation. These are the simplest, highest-leverage habits to build for imbalanced classification.

Note: stratified sampling preserves the imbalance — it doesn't fix it. If your dataset is 95/5, your training and test sets will both be 95/5. The goal of stratification is representativeness, not balance. For actually addressing the imbalance, see the next chapter on class balancing techniques.

Stratified Split Explorer100 samples · 95 majority · 5 minority · 20% test
Split strategy
Majority · TrainMajority · TestMinority · TrainMinority · Test
Train80 samples
76 majority4 minority
Minority rate: 5%exact
Test20 samples
19 majority1 minority
Minority rate: 5%exact
Test set minority rate is 5% (true rate: 5%). Try reshuffling to see how much this varies.
Stratification preserves the imbalance — it does not fix it. Both train and test stay 95/5.

Compare random vs. stratified train/test splits on an imbalanced dataset. Notice how random splits can produce test sets with very few — or zero — minority examples, while stratified splits preserve the true class ratio in both sets.

Checkpoint

You apply SMOTE to your dataset to oversample the minority class, then do a stratified train/test split. What's wrong with this order of operations?