Visualization in EDA

You'll plot things during EDA primarily for yourself, not for stakeholders. These are throwaway plots — they don't have to be pretty. The goal is to learn something quickly. Save the polish for your final report. (We cover communication visualizations in Chapter 4.)

EDA Visualization Toolkit
distribution

Shows the distribution of a single continuous variable. Reveals skew, peaks, and gaps.

Dataset:
median
mean
38Sales units91
MedianMean
What to look for: Product A is roughly symmetric — mean and median are close together, so either is a fair summary. Switch to Product B to see a spike-heavy distribution.

The four core EDA plot types — histogram, box plot, scatter plot, and correlation heatmap — applied to a shared dataset.

The EDA Visualization Toolkit

  • Histograms — distribution of a single continuous variable. Skewness, peaks, gaps, all visible at a glance.
  • Box plots — spread and outliers compactly. Box = IQR, line = median, whiskers = non-outlier extremes, dots = flagged outliers. Great for comparing distributions across categories.
  • Scatter plots — relationship between two continuous variables. The most direct way to see correlations, clusters, and nonlinear patterns.
  • Pair plots — all pairwise scatter plots in a grid, histograms on the diagonal. Indispensable when you have a modest number of features and want to scan everything at once.
  • Correlation heatmaps — color-coded correlation matrix. Faster to read than raw numbers; patterns jump out immediately.
  • Bar charts by category — count or distribution broken out by a categorical variable. Quick way to spot imbalance.

Build Your 'First Plots' Routine

Every data scientist builds a small set of "first plots" they run on every new dataset. A useful default routine: shape → head → info → describe → pair plot (if fewer than ~10 features) → correlation heatmap → box plots per numerical feature → bar charts per categorical feature. It takes about ten minutes and tells you 80% of what you need to know before anything else. Build your own version and run it every time.

Checkpoint

A histogram of your target variable has two distinct peaks. What is the most likely explanation, and what should you do?