Visualization in EDA
You'll plot things during EDA primarily for yourself, not for stakeholders. These are throwaway plots — they don't have to be pretty. The goal is to learn something quickly. Save the polish for your final report. (We cover communication visualizations in Chapter 4.)
Shows the distribution of a single continuous variable. Reveals skew, peaks, and gaps.
The four core EDA plot types — histogram, box plot, scatter plot, and correlation heatmap — applied to a shared dataset.
The EDA Visualization Toolkit
- Histograms — distribution of a single continuous variable. Skewness, peaks, gaps, all visible at a glance.
- Box plots — spread and outliers compactly. Box = IQR, line = median, whiskers = non-outlier extremes, dots = flagged outliers. Great for comparing distributions across categories.
- Scatter plots — relationship between two continuous variables. The most direct way to see correlations, clusters, and nonlinear patterns.
- Pair plots — all pairwise scatter plots in a grid, histograms on the diagonal. Indispensable when you have a modest number of features and want to scan everything at once.
- Correlation heatmaps — color-coded correlation matrix. Faster to read than raw numbers; patterns jump out immediately.
- Bar charts by category — count or distribution broken out by a categorical variable. Quick way to spot imbalance.
Build Your 'First Plots' Routine
Every data scientist builds a small set of "first plots" they run on every new dataset. A useful default routine: shape → head → info → describe → pair plot (if fewer than ~10 features) → correlation heatmap → box plots per numerical feature → bar charts per categorical feature. It takes about ten minutes and tells you 80% of what you need to know before anything else. Build your own version and run it every time.
A histogram of your target variable has two distinct peaks. What is the most likely explanation, and what should you do?