Worked Example: The Iris Dataset
The classic Iris dataset is too clean to be realistic, but it's a perfect training ground because every step works and the lessons transfer. 150 flowers across three species, each measured on four numerical features: sepal length, sepal width, petal length, petal width. Let's walk a complete EDA pass — and extract a modeling plan from it.
Step 1 — Dimensions and Types. 150 rows × 5 columns. The four measurement features are float64. The target (species) is an object — categorical, needs encoding. No missing values, which is genuinely unheard of in real data. This alone tells us Iris is a teaching dataset.
Step 2 — Descriptive Statistics. Mean sepal length ≈ 5.84 cm; median ≈ 5.8 cm — roughly symmetric. Standard deviation ≈ 0.83. Skewness and kurtosis values close to zero suggest approximately normal distributions for most features. Petal length and petal width have more spread and a slightly different story.
Step 3 — Data Quality. No duplicate rows. No inconsistencies. Nothing alarming in the value ranges. Everything passes — again, this is why it's a teaching dataset. Expect far more work on real data.
Step 4 — Variable Relationships
The correlation between petal length and petal width is ≈ 0.96 — extremely high. Biologically this makes sense: bigger petals are bigger in both dimensions. For modeling, this signals redundancy: we probably don't need both features. The pair plot makes species clusters obvious — the classes are largely linearly separable, suggesting simple models (k-NN, decision tree, logistic regression) should work well.
Pair plot of the Iris dataset
Step 5 — Visualization. The histogram of petal length is bimodal — two distinct peaks, likely corresponding to setosa (much smaller petals) vs. the other two species. The box plots reveal a few outliers in sepal width. The pair plot confirms the species clusters are visible to the naked eye.
Step 6 — Feature Engineering Hint from EDA
A new feature like petal_area = petal_length × petal_width shows even cleaner separation between species in histograms. EDA just handed us a feature engineering recipe. This is the goal: EDA gives you a concrete plan for what to do next.
The Modeling Plan EDA Produces
- With 150 samples, complex models are overkill — start simple.
- The categorical target needs encoding (label or one-hot).
- Numerical features will benefit from standardization.
- Highly correlated features (petal length + petal width) suggest considering PCA or dropping one.
- Linear separability suggests starting with logistic regression.
EDA output is a plan. If you finished EDA and don't have a plan, you didn't finish EDA.
The correlation between petal_length and petal_width in Iris is 0.96. What is the correct EDA conclusion — and what is NOT warranted?
Think of a real dataset you'd like to work with. Walk through the six EDA steps mentally. At which step do you expect to find the most surprises — and why?