Undersampling: Random and Tomek Links

Random Undersampling

Discard majority-class examples until you match the minority class.

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy='majority')
X_resampled, y_resampled = rus.fit_resample(X, y)

The cost: You're throwing away real data. Sometimes that's fine — if you have a million majority examples and a thousand minority examples, you can afford to discard most of the majority. However, every discarded example is information you'll never recover.

Tomek Links

A smarter form of undersampling. A Tomek link exists between two examples of different classes that are each other's nearest neighbors — meaning they sit right on top of the decision boundary. Tomek links removes the majority-class member from each such pair.

The intuition: these boundary cases are noisy or genuinely ambiguous observations. Removing them helps the model find a cleaner decision boundary.

from imblearn.under_sampling import TomekLinks
tl = TomekLinks()
X_resampled, y_resampled = tl.fit_resample(X, y)

Good use: As a complement to SMOTE — apply SMOTE first to balance, then Tomek links to clean up the noisy boundary that may have formed.

⚠

The Golden Rule: Resample Training Data Only

Your test set should always reflect the real-world distribution. If you resample your full dataset then split into train and test, your test set is no longer representative of production — and your evaluation will be optimistic.

Always: split first, resample the training set only. This is the single most commonly violated rule in ML work involving imbalanced data.

Resampling Strategy Comparison

The same imbalanced dataset, four resampling strategies. Compare how each technique changes the composition and distribution of training points.

No Resampling

42 majority vs 8 minority

majority

minority

5:1

ratio

SMOTE

Synthetic minority added

majority

minority

+26

synthetic

Random Undersampling

Majority discarded

majority

minority

1:1

ratio

Tomek Links

Boundary stragglers removed

majority

minority

5:1

ratio

Majority classMinority classSynthetic (SMOTE)Removed (ghost)

The same imbalanced dataset across four resampling strategies — compare how each technique changes the composition and distribution of training points.

←PreviousOversampling: Random and SMOTEClass Balancing Next→A Decision FrameworkClass Balancing