Descriptive Statistics
Descriptive statistics give you a compact numeric summary of each variable. Three kinds matter — and each one has downstream modeling implications, not just reporting value.
Central Tendency
Mean, median, mode. What's "typical"?
- The mean is the arithmetic average — sensitive to outliers.
- The median is the middle value when data is sorted — robust to outliers.
- The mode is the most common value — most useful for categorical data.
When the mean and median are close, the distribution is roughly symmetric. When they're far apart, the distribution is skewed — a single outlier can pull the mean far from where most of the data lives. If you report only the mean on skewed data, you're misleading yourself and your audience.
Dispersion tells you how spread out the data is. High standard deviation means data is variable; low means it's tightly clustered. A feature with zero variance gives the model nothing to work with — it's a constant and should be dropped. The IQR (Q3 − Q1) captures the middle 50% of observations and is robust to extreme values, making it a useful complement to standard deviation for skewed data.
Distribution shape is captured by skewness and kurtosis. Positive skewness means a long right tail (think income distributions); negative means a long left tail. Kurtosis measures tailedness — high kurtosis means extreme values are more common than a normal distribution would predict. These matter for modeling: many ML algorithms assume approximate normality. Heavy skew often signals that a log transform is needed before fitting.
Distribution explorer — adjust skewness and kurtosis sliders and see how mean vs. median diverge.
A feature has mean = $85,000 and median = $52,000. What does this tell you — and what should you do before using this feature in a model?