Missing Values: MCAR, MAR, and MNAR

When values are missing within a feature, the right strategy depends on why they're missing. There are three categories, and they need very different handling. Applying the wrong strategy can introduce systematic bias into everything downstream.

MCAR — Missing Completely At Random

No pattern. The missingness is independent of everything — like a sensor glitch that randomly drops 2% of readings. This is the easy case.

Handling: If it's less than ~5% of data, it's generally fine to drop the affected rows or fill in with mean, median, or mode. For time series, forward-fill or back-fill. The key test: missingness should not correlate with any other variable.

MAR — Missing At Random. Missingness in column C is associated with the values in some other column. Example: people with high blood pressure may hesitate to report their weight, so weight is missing more often when blood pressure is high. The missingness is "random" given the other variables, but not unconditionally.

Test for it by dividing into subgroups and checking whether the missingness rate differs significantly across groups. Simple mean/median fills give unreliable results here — use model-based imputation that conditions on the related variables. If 60% or more of a feature is missing, consider dropping the feature entirely.

MNAR — Missing Not At Random. Missingness depends on the missing value itself. Examples: very rich and very poor people skip income fields; older applicants worried about age discrimination leave the age field blank. The right fix is almost always to revisit the data collection mechanism rather than impute — because imputation gives unreliable results no matter how clever the method.

Summary: Which Strategy for Which Type

  • MCAR → drop or simple fill (mean/median/mode)
  • MAR → model-based imputation conditioned on related variables
  • MNAR → fix the collection mechanism if possible; impute with caution and document the limitation
Missingness Classifier

1Does the missingness correlate with other observed columns?

2Could the missing value itself drive whether it's recorded?

3What fraction of values in this column are missing?

4What's the most plausible cause of missingness?

0/4

Placeholder: interactive tool — describe your missing data pattern and get a MCAR/MAR/MNAR classification with recommended strategy.

Checkpoint

In a healthcare dataset, 'weight' is missing more often for patients with higher blood pressure readings. What type of missingness is this, and what's the right imputation strategy?