A Decision Framework
When you encounter imbalanced data, walk through these questions in roughly this order:
Step 1: Is the Metric the Problem?
If you're optimizing for accuracy on an imbalanced dataset, the metric is part of the problem. Look at precision, recall, F1, and AUC-PR (precision-recall AUC) for the minority class. Sometimes "fixing" the metric is the right answer, not resampling the data.
Step 2: Can You Get More Minority Data?
Always the first ask. If yes, do that. Real data beats synthetic data every time.
Step 3: Is the Imbalance Severe Enough to Warrant Synthetic Methods?
A 60/40 split usually isn't worth resampling. A 99/1 split almost always is. There's no hard cutoff — use domain judgment about how much the minority class matters to your task.
Step 4: If Resampling, How?
Try SMOTE before random oversampling — less likely to cause overfitting. If classes overlap significantly, combine SMOTE with Tomek links to clean the boundary. Consider algorithm-level fixes too: class weights in your loss function (e.g., class_weight='balanced' in sklearn) can be effective without touching the data at all.
You're working on a fraud detection model with 99.9% legitimate transactions and 0.1% fraud. Your manager wants you to 'balance the classes.' Walk through the decision framework: what questions would you ask first, and what approach would you recommend?