Residual Analysis and Confidence Intervals

Residual Analysis

After fitting a regression, examining patterns in the residuals is how you check whether the model's assumptions actually hold.

Residuals vs. Fitted: Plot residuals on the y-axis vs. fitted values on the x-axis. You want a random, structureless cloud centered on zero. Watch for:
- Curvature → nonlinear relationship, the linear model is wrong.
- Fan or funnel shape → heteroscedasticity (non-constant variance).
- Large individual outliers → may unduly influence the model.
Normal Q-Q: Plots sorted residuals against the quantiles you'd expect from a normal distribution. Points hugging the diagonal mean normality holds. Shapiro-Wilk formalizes this as a hypothesis test.
Scale-Location: Plots $\sqrt{|\text{residual}|}$ vs. fitted values. A flat, horizontal spread of points confirms homoscedasticity; an upward slope signals that variance is growing with the fitted value (heteroscedasticity).
Homoscedasticity test: Breusch-Pagan test. Null: residual variance is constant. Significant p → heteroscedasticity. Fix: log-transform the outcome, use weighted least squares, or switch to a model that accommodates non-constant variance.

Summary statistics to watch:

Mean residual — should be near zero. A non-zero mean indicates systematic bias; the model is consistently over- or under-predicting.
SD of residuals — the typical size of a prediction error. Smaller is better, but what counts as "small" depends on the scale of your outcome.
SSE (Sum of Squared Errors) — $\sum (y_i - \hat{y}_i)^2$ . The raw total of all squared residuals. It shrinks as the model fits better, and is the quantity OLS regression directly minimizes.

Residual Analysis Dashboard

Model type

Diagnostic plots

Residuals vs. Fitted

Normal Q-Q

Scale-Location

Diagnosis

✓Residuals scatter randomly around zero — no pattern.
✓Spread is roughly constant across fitted values (homoscedastic).
✓Q-Q plot points follow the diagonal — normality holds.

-0.371

Mean residual

2.207

SD of residuals

245.64

SSE

Fit a regression model and display residual plots (vs. fitted values, Q-Q plot). Toggle between a well-specified model and one with heteroscedasticity to see the patterns.

Confidence Intervals

A confidence interval gives a range of values likely to contain the true population parameter. A 95% CI from repeated sampling would contain the true value about 95% of the time.

Z-score method (population $\sigma$ known, $n \geq 30$ ):

$\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}$

T-score method (population $\sigma$ unknown — use this one in practice):

$\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}$

⚠

Common Misconception

A 95% CI does NOT mean "there's a 95% probability the true value is in this interval." That's a Bayesian statement. The frequentist interpretation: if we repeated the procedure many times, ~95% of computed intervals would contain the true parameter.

←PreviousGoodness-of-Fit: R², AIC, and BICModel Evaluation Next→Simpson's ParadoxModel Evaluation