Stratified Sampling for Imbalanced ML
The most concrete place sampling concepts hit your ML workflow is class imbalance. If your dataset has 95% one class and 5% another, a naive random train/val/test split can produce a test set with very few minority-class examples — making your evaluation metrics noisy and your model's behavior on the minority class unstable.
The Two-Line Fix
from sklearn.model_selection import train_test_split, StratifiedKFold
# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Stratified cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
...
Pass stratify=y to train_test_split and the class distribution is preserved across both splits. StratifiedKFold does the same for cross-validation. These are the simplest, highest-leverage habits to build for imbalanced classification.
Note: stratified sampling preserves the imbalance — it doesn't fix it. If your dataset is 95/5, your training and test sets will both be 95/5. The goal of stratification is representativeness, not balance. For actually addressing the imbalance, see the next chapter on class balancing techniques.
Compare random vs. stratified train/test splits on an imbalanced dataset. Notice how random splits can produce test sets with very few — or zero — minority examples, while stratified splits preserve the true class ratio in both sets.
You apply SMOTE to your dataset to oversample the minority class, then do a stratified train/test split. What's wrong with this order of operations?