Data Quality

After descriptive statistics, run a quality pass. The goal is to surface problems before they become invisible assumptions baked into your model. Four questions drive this pass.

  • Duplicate rows? Duplicates can be exact (same row appears twice) or near-duplicates (same entity with small differences — the same person entered twice with slightly different names).
  • Inconsistent values? A "country" column with "USA," "United States," and "U.S." A "dates" column in three different formats. An "age" column with values of 200.
  • Outliers or extreme values? Outliers can be sensor failures, data entry errors, or real-but-rare events. The data alone usually can't tell you which.
  • Values that make sense given domain knowledge? A heart rate of 800 bpm is not a real heart rate. A house listing at $1.00 is not a deal. Sanity-check against what should be possible.

Domain Knowledge Beats Statistical Anomaly Detection

A statistician can flag an extreme value as a numerical outlier. Only a domain expert can tell you whether it's a sensor failure or a real, important event. Engage your experts before making outlier decisions.

The Negative Purchases That Were Returns

A friend working in retail analytics found that a "purchases" column contained negative values for thousands of customers. It turned out those values represented returns — perfectly meaningful data, just undocumented. Without that one piece of context, every downstream model would have been wrong. With it, the negative values were one of the most predictive features in the dataset. Domain knowledge unlocked the value.

Checkpoint

Your dataset has a 'systolic_blood_pressure' column with one value of 420 mmHg. A normal human range is roughly 80–180 mmHg. What should you do?