Handling Outliers and Transforming Data

Once you've detected outliers, you need to decide what to do. And regardless of outlier decisions, you'll almost always need to transform your features before modeling. The two tasks are related: handle outliers before applying scaling, since outliers distort both normalization and standardization.

For response variable (Y) outliers — don't auto-remove. They may signal model deficiencies, incorrect assumptions, or missing features. Investigate first.

For predictor (X) outliers — first determine whether they're influential (use Cook's distance or similar). If they significantly steer the model, investigate the root cause. If they're real, natural observations, leave them in. If they're errors, remove or correct them.

Common Transforms

Min-Max Scaling (Normalization). Rescales each feature to [0, 1]: xscaled=(xxmin)/(xmaxxmin)x_{\text{scaled}} = (x - x_{\min}) / (x_{\max} - x_{\min}). Use when data is bounded or binary, or for computer vision (pixel values). Watch out: sensitive to outliers — a single extreme value compresses everything else into a narrow range.

Z-Score Standardization. Rescales to zero mean and unit variance: xstandardized=(xμ)/σx_{\text{standardized}} = (x - \mu) / \sigma. Use when data is approximately Gaussian and unbounded. Helpful for clustering, PCA, and neural networks. Doesn't produce bounded output.

Log Transformation. For right-skewed data: xlog=log(x+1)x_{\text{log}} = \log(x + 1). Compresses large values, pulls the distribution toward normality. Required before linear regression, ANOVA, and models assuming normal features. Classic candidates: income, population, prices, word counts, file sizes.

Transform Explorer
Transformation
Right-skewedn = 80
3,836343,841
Original values. Right-skewed income data — a few large values stretch the tail.
53,808
Mean
36,319
Median
55,013
Std Dev
3,836
Min
2.68
Skewness

Apply min-max, z-score, or log to a sample distribution and see the effect on shape and outlier behavior.

Checkpoint

You're building a KNN classifier. One feature is age (range 20–80) and another is annual income (range $20,000–$2,000,000). You apply no scaling. What happens?