Detecting Outliers

Outliers are points far from the rest. They can be measurement errors, data entry errors, processing artifacts — or real, important observations that just happen to be extreme. The data alone usually can't tell you which. That's why domain knowledge is the essential second step after detection.

Five sources of outliers to keep in mind:

Measurement error — faulty or miscalibrated instrument.
Data entry error — a human typed the wrong thing.
Experimental error — something went wrong during collection.
Data processing error — a pipeline bug introduced an artifact.
True outliers — real, natural observations that just happen to be extreme.

Discovering Outliers

Visual. Scatter plots and box plots. Anything outside the box plot's whiskers is conventionally flagged.

Z-score method. Assumes approximate normality. Compute $z = (x - \mu) / \sigma$ for each point. Flag any point where $|z| > k$ . Common choice: $k = 3$ . Here $k$ is a threshold multiplier — higher $k$ means only more extreme points are flagged; lower $k$ flags more aggressively. Sensitive to the normality assumption — if your data isn't approximately normal, the z-score method will mis-flag.

IQR method. Does not assume normality. Upper threshold: $Q3 + k \times IQR$ ; lower threshold: $Q1 - k \times IQR$ , where $k = 1.5$ is conventional (Tukey's rule). $k$ controls the fence width: larger $k$ widens the fences and flags fewer points; smaller $k$ narrows them and flags more. Any point outside this range is flagged.

Cook's distance. For regression: measures how much the regression parameters change when a specific point is removed. Useful for identifying influential observations — points that disproportionately steer the model.

⚠

Don't Assume Outliers Are Noise

In fraud detection, the outliers are the point — you're hunting them, not removing them. In sensor monitoring, outliers might be equipment failures you need to flag immediately. In customer analytics, outliers might be your most valuable customers. Always investigate before deciding what to do with an outlier.

Outlier Detector

IQR · k = 1.5Q1 − k·IQR / Q3 + k·IQR

lo: -38.70 · hi: 140.1

Z-Score · k = 3μ ± k·σ

lo: -152.84 · hi: 283.0

n = 63 values (right-skewed income data, $k)

IQR hi

Z hi

8.00480.0

IQR only2 points·Z only0 points·Both2 points

IQR flags more on the right tail — it uses the median-based spread rather than the mean and std, which are pulled upward by the extreme values, making z-score thresholds looser than you might expect.

Flagged (IQR)

Flagged (Z)

Only IQR

Only Z

Adjust the threshold multiplier for each method and see how IQR and z-score flag different points on right-skewed data.

Checkpoint

Your feature is right-skewed (long right tail). You want to detect outliers. Which method is more appropriate — z-score or IQR?

←PreviousWhen Missingness Is a FeaturePreprocessing Next→Handling Outliers and Transforming DataPreprocessing