Oversampling: Random and SMOTE

Before getting to oversampling and undersampling, a quick hierarchy of options in order of preference:

  1. Get more real minority-class data. Collecting genuine examples is almost always better than synthesizing them.
  2. Data augmentation (for images, audio, text). Transforms existing minority-class examples to produce new ones that preserve class membership.
  3. Oversampling or undersampling the existing data.

If you can collect more genuine minority-class data, you almost always should. Synthetic methods are clever, but they're synthetic.

Random Oversampling

Duplicate examples from the minority class until it matches the majority class.

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority')
X_resampled, y_resampled = ros.fit_resample(X, y)


Catch: You're duplicating. The model sees identical examples multiple times, which can cause overfitting — it learns those specific examples rather than the pattern behind them.

SMOTE — Synthetic Minority Oversampling Technique

Instead of duplicating, SMOTE creates synthetic minority examples by interpolating between existing ones:

  1. Pick a minority instance.
  2. Find its k nearest neighbors (typically k=5).
  3. Randomly choose one neighbor.
  4. Create a new synthetic point somewhere on the line between them.

from imblearn.over_sampling import SMOTE
smote = SMOTE(k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X, y)


Advantages: Less prone to overfitting than random oversampling. Introduces diversity.
Caveats: Can add noise if minority class is itself noisy. Can blur class boundaries if classes overlap. Only works on continuous features — use SMOTE-NC for categorical features.

SMOTE Visualizer40 majority · 8 minority · k=3

The minority class (violet) is heavily outnumbered. A classifier trained here will bias toward the majority class (indigo).

40
majority
8
minority
5:1
imbalance ratio
Class balance17% minority
Majority classMinority classSynthetic (SMOTE)

SMOTE in 2D: the minority class (violet) is oversampled by interpolating synthetic points between real minority examples and their k nearest neighbors.

Checkpoint

You apply SMOTE to a medical dataset with a mix of continuous and categorical features (like blood type and diagnosis codes). What should you be aware of?