Residual Analysis and Confidence Intervals
Residual Analysis
After fitting a regression, examining patterns in the residuals is how you check whether the model's assumptions actually hold.
- Residuals vs. Fitted: Plot residuals on the y-axis vs. fitted values on the x-axis. You want a random, structureless cloud centered on zero. Watch for:
- Curvature → nonlinear relationship, the linear model is wrong.
- Fan or funnel shape → heteroscedasticity (non-constant variance).
- Large individual outliers → may unduly influence the model.
- Normal Q-Q: Plots sorted residuals against the quantiles you'd expect from a normal distribution. Points hugging the diagonal mean normality holds. Shapiro-Wilk formalizes this as a hypothesis test.
- Scale-Location: Plots vs. fitted values. A flat, horizontal spread of points confirms homoscedasticity; an upward slope signals that variance is growing with the fitted value (heteroscedasticity).
- Homoscedasticity test: Breusch-Pagan test. Null: residual variance is constant. Significant p → heteroscedasticity. Fix: log-transform the outcome, use weighted least squares, or switch to a model that accommodates non-constant variance.
Summary statistics to watch:
- Mean residual — should be near zero. A non-zero mean indicates systematic bias; the model is consistently over- or under-predicting.
- SD of residuals — the typical size of a prediction error. Smaller is better, but what counts as "small" depends on the scale of your outcome.
- SSE (Sum of Squared Errors) — . The raw total of all squared residuals. It shrinks as the model fits better, and is the quantity OLS regression directly minimizes.
- ✓Residuals scatter randomly around zero — no pattern.
- ✓Spread is roughly constant across fitted values (homoscedastic).
- ✓Q-Q plot points follow the diagonal — normality holds.
Fit a regression model and display residual plots (vs. fitted values, Q-Q plot). Toggle between a well-specified model and one with heteroscedasticity to see the patterns.
Confidence Intervals
A confidence interval gives a range of values likely to contain the true population parameter. A 95% CI from repeated sampling would contain the true value about 95% of the time.
Z-score method (population known, ):
T-score method (population unknown — use this one in practice):
Common Misconception
A 95% CI does NOT mean "there's a 95% probability the true value is in this interval." That's a Bayesian statement. The frequentist interpretation: if we repeated the procedure many times, ~95% of computed intervals would contain the true parameter.