Detecting Outliers
Outliers are points far from the rest. They can be measurement errors, data entry errors, processing artifacts — or real, important observations that just happen to be extreme. The data alone usually can't tell you which. That's why domain knowledge is the essential second step after detection.
Five sources of outliers to keep in mind:
- Measurement error — faulty or miscalibrated instrument.
- Data entry error — a human typed the wrong thing.
- Experimental error — something went wrong during collection.
- Data processing error — a pipeline bug introduced an artifact.
- True outliers — real, natural observations that just happen to be extreme.
Discovering Outliers
Visual. Scatter plots and box plots. Anything outside the box plot's whiskers is conventionally flagged.
Z-score method. Assumes approximate normality. Compute for each point. Flag any point where . Common choice: . Here is a threshold multiplier — higher means only more extreme points are flagged; lower flags more aggressively. Sensitive to the normality assumption — if your data isn't approximately normal, the z-score method will mis-flag.
IQR method. Does not assume normality. Upper threshold: ; lower threshold: , where is conventional (Tukey's rule). controls the fence width: larger widens the fences and flags fewer points; smaller narrows them and flags more. Any point outside this range is flagged.
Cook's distance. For regression: measures how much the regression parameters change when a specific point is removed. Useful for identifying influential observations — points that disproportionately steer the model.
Don't Assume Outliers Are Noise
In fraud detection, the outliers are the point — you're hunting them, not removing them. In sensor monitoring, outliers might be equipment failures you need to flag immediately. In customer analytics, outliers might be your most valuable customers. Always investigate before deciding what to do with an outlier.
IQR flags more on the right tail — it uses the median-based spread rather than the mean and std, which are pulled upward by the extreme values, making z-score thresholds looser than you might expect.
Adjust the threshold multiplier for each method and see how IQR and z-score flag different points on right-skewed data.
Your feature is right-skewed (long right tail). You want to detect outliers. Which method is more appropriate — z-score or IQR?