Undersampling: Random and Tomek Links
Random Undersampling
Discard majority-class examples until you match the minority class.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy='majority')
X_resampled, y_resampled = rus.fit_resample(X, y)
The cost: You're throwing away real data. Sometimes that's fine — if you have a million majority examples and a thousand minority examples, you can afford to discard most of the majority. However, every discarded example is information you'll never recover.
Tomek Links
A smarter form of undersampling. A Tomek link exists between two examples of different classes that are each other's nearest neighbors — meaning they sit right on top of the decision boundary. Tomek links removes the majority-class member from each such pair.
The intuition: these boundary cases are noisy or genuinely ambiguous observations. Removing them helps the model find a cleaner decision boundary.
from imblearn.under_sampling import TomekLinks
tl = TomekLinks()
X_resampled, y_resampled = tl.fit_resample(X, y)
Good use: As a complement to SMOTE — apply SMOTE first to balance, then Tomek links to clean up the noisy boundary that may have formed.
The Golden Rule: Resample Training Data Only
Your test set should always reflect the real-world distribution. If you resample your full dataset then split into train and test, your test set is no longer representative of production — and your evaluation will be optimistic.
Always: split first, resample the training set only. This is the single most commonly violated rule in ML work involving imbalanced data.
The same imbalanced dataset, four resampling strategies. Compare how each technique changes the composition and distribution of training points.
The same imbalanced dataset across four resampling strategies — compare how each technique changes the composition and distribution of training points.