Oversampling: Random and SMOTE
Before getting to oversampling and undersampling, a quick hierarchy of options in order of preference:
- Get more real minority-class data. Collecting genuine examples is almost always better than synthesizing them.
- Data augmentation (for images, audio, text). Transforms existing minority-class examples to produce new ones that preserve class membership.
- Oversampling or undersampling the existing data.
If you can collect more genuine minority-class data, you almost always should. Synthetic methods are clever, but they're synthetic.
Random Oversampling
Duplicate examples from the minority class until it matches the majority class.
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority')
X_resampled, y_resampled = ros.fit_resample(X, y)
Catch: You're duplicating. The model sees identical examples multiple times, which can cause overfitting — it learns those specific examples rather than the pattern behind them.
SMOTE — Synthetic Minority Oversampling Technique
Instead of duplicating, SMOTE creates synthetic minority examples by interpolating between existing ones:
- Pick a minority instance.
- Find its k nearest neighbors (typically k=5).
- Randomly choose one neighbor.
- Create a new synthetic point somewhere on the line between them.
from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X, y)
Advantages: Less prone to overfitting than random oversampling. Introduces diversity.
Caveats: Can add noise if minority class is itself noisy. Can blur class boundaries if classes overlap. Only works on continuous features — use SMOTE-NC for categorical features.
The minority class (violet) is heavily outnumbered. A classifier trained here will bias toward the majority class (indigo).
SMOTE in 2D: the minority class (violet) is oversampled by interpolating synthetic points between real minority examples and their k nearest neighbors.
You apply SMOTE to a medical dataset with a mix of continuous and categorical features (like blood type and diagnosis codes). What should you be aware of?