The Setup
Whether you are searching for a home or just wanting to window shop, apps that predict house prices are pretty neat. Let's say that we want to predict house prices for our own app. We have data on houses — how many bedrooms they have, where they're located, whether there's a pool — and we want to build a model that takes those features as input and outputs a predicted price.
Linear regression is one of the oldest and most useful tools for exactly this. But before we learn how it works, we need to understand when it's valid — because the model makes assumptions about your data, and if those assumptions are badly violated, the predictions and coefficient estimates can't be trusted.
Relationship is a straight line
The relationship between each predictor and the outcome is approximately linear. A straight line through the data should capture the trend without any systematic curve.
Observations don't influence each other
Each observation is independent of every other. Knowing the value of one residual tells you nothing about another. This is violated by clustered sampling, repeated measures, and time series data.
Residuals have constant variance
The spread of residuals is roughly the same across all fitted values. A model that's equally uncertain whether predicting a $200k or a $900k house satisfies this assumption.
Errors follow a bell curve
The residuals are approximately normally distributed around zero. This matters most for hypothesis tests on coefficients and for confidence intervals — not for the outcome variable itself.
These assumptions are checked after fitting a model, using residual plots. Violations bias coefficient estimates, corrupt standard errors, and make hypothesis tests unreliable.
With those constraints in mind, here's what we're building toward. We have a dataset of houses. Each house has a sale price (what we want to predict) and one or more features (what we'll use to predict it). The simplest case is one feature, one outcome, one line.
The question is: which line? Any line drawn through the scatter of houses makes some errors — it predicts 280k for a house that sold for 340k, or 420k for one that sold for 390k. The errors are called residuals, and the goal of linear regression is to find the line that makes those errors as small as possible in total.
The next section shows exactly how residuals work — and what "as small as possible" means mathematically.
A real estate analyst fits a linear regression model predicting house price from square footage. She notices that the residuals for small houses are all clustered near zero, but for large houses the residuals vary widely — some $200k too high, some $150k too low. Which assumption is most likely violated?