Variable Relationships
Once you understand each variable individually, you start asking about pairs and groups. This is where EDA starts generating genuine insight rather than just summary numbers.
The pairwise correlation matrix gives you a compact summary of linear relationships between numerical variables. Plot it as a heatmap — color-coded matrices make patterns jump out fast. Values close to +1 mean variables move together strongly; close to −1 means they move opposite each other; close to 0 means no linear relationship. A critical caveat: low correlation does not mean no relationship — nonlinear relationships can be strong and correlation will miss them entirely.
The most common downstream implication: highly correlated features are largely redundant. Including all of them adds dimensionality without much new information and can confuse some models. We'll come back to this in feature selection (Chapter 7).
Boston-area housing: price, size, and neighborhood features. Hover any cell to inspect the relationship.
Hover a cell to see the relationship details.
Correlation heatmap — select a sample dataset and explore pairwise relationships. Data is synthetic and for educational purposes only. Does not constitute real data relationships.
What to Look For Beyond Correlation
- Bimodal distributions (two peaks in a histogram) often indicate hidden subgroups worth investigating.
- Clusters in scatter plots suggest natural separability — and may change how you think about modeling.
- Strong nonlinear patterns hint at features that need engineering before they'll be useful to a linear model.
- Box plots by category — does a numerical variable's distribution differ across categories? This is fast visual evidence of a feature-target relationship.
Correlation Is Not Causation
The classic warning still holds. Ice cream sales and shark attacks are correlated (both peak in summer — the missing variable is temperature). Correlation is where most analyses begin, not where they end. It is evidence of a relationship to investigate, not evidence of a cause to act on.
In a housing dataset, square footage and number of rooms have a correlation of 0.91. What is the most important modeling implication?