A Decision Framework

When you encounter imbalanced data, walk through these questions in roughly this order:

Step 1: Is the Metric the Problem?

If you're optimizing for accuracy on an imbalanced dataset, the metric is part of the problem. Look at precision, recall, F1, and AUC-PR (precision-recall AUC) for the minority class. Sometimes "fixing" the metric is the right answer, not resampling the data.

Step 2: Can You Get More Minority Data?

Always the first ask. If yes, do that. Real data beats synthetic data every time.

Step 3: Is the Imbalance Severe Enough to Warrant Synthetic Methods?

A 60/40 split usually isn't worth resampling. A 99/1 split almost always is. There's no hard cutoff — use domain judgment about how much the minority class matters to your task.

Step 4: If Resampling, How?

Try SMOTE before random oversampling — less likely to cause overfitting. If classes overlap significantly, combine SMOTE with Tomek links to clean the boundary. Consider algorithm-level fixes too: class weights in your loss function (e.g., class_weight='balanced' in sklearn) can be effective without touching the data at all.

💭Reflection

You're working on a fraud detection model with 99.9% legitimate transactions and 0.1% fraud. Your manager wants you to 'balance the classes.' Walk through the decision framework: what questions would you ask first, and what approach would you recommend?