The Data Quality Assessment

A structured data quality assessment (DQA) should happen before you start modeling. Garbage in, garbage out — and the only way to know whether you have garbage is to check systematically. Ad-hoc quality checks miss things. A structured process doesn't.

The Eight Dimensions of a DQA

  • Profiling — EDA on the dataset: distributions, types, ranges.
  • Completeness — missing values and incomplete records.
  • Accuracy — cross-check against trusted sources or via manual validation.
  • Consistency — across sources, formats, and time periods.
  • Integrity — enforced constraints: unique IDs, valid value ranges, referential integrity.
  • Lineage and provenance — where the data came from, what transformations have been applied.
  • Automated testing — validation rules baked into your pipeline so quality regressions are caught automatically.
  • Continuous monitoring — quality is not a one-time check.
DQA Walkthrough — Crop Yield Prediction1/8 steps
📊
Profiling
Step 1 of 8
Action

Run df.info() and df.describe() on the crop dataset.

Finding

Rainfall column has extreme values (0 mm and 3,200 mm in the same region). Soil pH has 23% missing values.

Response

Flag rainfall as needing outlier investigation. Prioritize soil pH missingness for the next step.

Step through each dimension of a data quality assessment on a realistic dataset — profiling, completeness, accuracy, consistency, integrity, lineage, testing, and monitoring.

Real-time systems need continuous DQA. Tools like Great Expectations and Soda let you encode data quality rules as code and run them every time new data arrives. If you're going into a data-intensive role in industry, these are worth learning.

Checkpoint

You're handed a dataset from an external vendor with no documentation. Walk through the first three DQA steps — profiling, completeness, and accuracy — describing specifically what you'd check and how.