Handling Outliers and Transforming Data
Once you've detected outliers, you need to decide what to do. And regardless of outlier decisions, you'll almost always need to transform your features before modeling. The two tasks are related: handle outliers before applying scaling, since outliers distort both normalization and standardization.
For response variable (Y) outliers — don't auto-remove. They may signal model deficiencies, incorrect assumptions, or missing features. Investigate first.
For predictor (X) outliers — first determine whether they're influential (use Cook's distance or similar). If they significantly steer the model, investigate the root cause. If they're real, natural observations, leave them in. If they're errors, remove or correct them.
Common Transforms
Min-Max Scaling (Normalization). Rescales each feature to [0, 1]: . Use when data is bounded or binary, or for computer vision (pixel values). Watch out: sensitive to outliers — a single extreme value compresses everything else into a narrow range.
Z-Score Standardization. Rescales to zero mean and unit variance: . Use when data is approximately Gaussian and unbounded. Helpful for clustering, PCA, and neural networks. Doesn't produce bounded output.
Log Transformation. For right-skewed data: . Compresses large values, pulls the distribution toward normality. Required before linear regression, ANOVA, and models assuming normal features. Classic candidates: income, population, prices, word counts, file sizes.
Apply min-max, z-score, or log to a sample distribution and see the effect on shape and outlier behavior.
You're building a KNN classifier. One feature is age (range 20–80) and another is annual income (range $20,000–$2,000,000). You apply no scaling. What happens?