Worked Example: The Iris Dataset

The classic Iris dataset is too clean to be realistic, but it's a perfect training ground because every step works and the lessons transfer. 150 flowers across three species, each measured on four numerical features: sepal length, sepal width, petal length, petal width. Let's walk a complete EDA pass — and extract a modeling plan from it.

Step 1 — Dimensions and Types. 150 rows × 5 columns. The four measurement features are float64. The target (species) is an object — categorical, needs encoding. No missing values, which is genuinely unheard of in real data. This alone tells us Iris is a teaching dataset.

Step 2 — Descriptive Statistics. Mean sepal length ≈ 5.84 cm; median ≈ 5.8 cm — roughly symmetric. Standard deviation ≈ 0.83. Skewness and kurtosis values close to zero suggest approximately normal distributions for most features. Petal length and petal width have more spread and a slightly different story.

Step 3 — Data Quality. No duplicate rows. No inconsistencies. Nothing alarming in the value ranges. Everything passes — again, this is why it's a teaching dataset. Expect far more work on real data.

ℹ

Step 4 — Variable Relationships

The correlation between petal length and petal width is ≈ 0.96 — extremely high. Biologically this makes sense: bigger petals are bigger in both dimensions. For modeling, this signals redundancy: we probably don't need both features. The pair plot makes species clusters obvious — the classes are largely linearly separable, suggesting simple models (k-NN, decision tree, logistic regression) should work well.

Iris Dataset — Pair Plot150 samples · 3 species · 4 features

Filter:

SL = Sepal Length (cm)SW = Sepal Width (cm)PL = Petal Length (cm)PW = Petal Width (cm)

Hover a dot to see measurements — diagonal shows per-species histograms

SetosaVersicolorVirginica

Pair plot of the Iris dataset

Step 5 — Visualization. The histogram of petal length is bimodal — two distinct peaks, likely corresponding to setosa (much smaller petals) vs. the other two species. The box plots reveal a few outliers in sepal width. The pair plot confirms the species clusters are visible to the naked eye.

✦

Step 6 — Feature Engineering Hint from EDA

A new feature like petal_area = petal_length × petal_width shows even cleaner separation between species in histograms. EDA just handed us a feature engineering recipe. This is the goal: EDA gives you a concrete plan for what to do next.

◆

The Modeling Plan EDA Produces

With 150 samples, complex models are overkill — start simple.
The categorical target needs encoding (label or one-hot).
Numerical features will benefit from standardization.
Highly correlated features (petal length + petal width) suggest considering PCA or dropping one.
Linear separability suggests starting with logistic regression.

EDA output is a plan. If you finished EDA and don't have a plan, you didn't finish EDA.

Checkpoint

The correlation between petal_length and petal_width in Iris is 0.96. What is the correct EDA conclusion — and what is NOT warranted?

Checkpoint

Think of a real dataset you'd like to work with. Walk through the six EDA steps mentally. At which step do you expect to find the most surprises — and why?

←PreviousVisualization in EDAExploratory Data Analysis Next→The Box Plot LessonTelling Stories with Data