When Missingness Is a Feature

Sometimes the fact that data is missing is itself informative. If "income field is blank" predicts something about your customers, that pattern is a feature, not a bug — and you should preserve it rather than impute it away.

Two ways to encode missingness explicitly:

  • Sentinel value — use a distinct value like −1, −9999, or a category like "unknown" that the model can learn to recognize as "was missing."
  • Binary missing flag — add a separate 0/1 column indicating whether the original was missing. Keep the imputed value in the original column; add the flag alongside it.

Use these only when the feature is genuinely important and the missingness pattern is meaningfully predictive. Don't add missing flags everywhere just because you can — it adds dimensionality for no gain when the pattern is random.

Credit Scoring and Fraud Detection

Credit scoring is a domain where missingness-as-feature is widely used. The pattern of which fields a borrower fills in — and which they leave blank — is itself predictive of creditworthiness. Models exploit this. The same pattern appears in fraud detection (which fields does the fraudster skip?), insurance underwriting, and many user-facing systems where users self-select what to disclose. In these domains, treating a missing value as "just missing" throws away signal you actually need.

Wearable Sensor Data

Wearable devices produce another clear example: if a user's step count, heart rate, or sleep data goes missing for several consecutive days, the gap itself is often a signal. People tend to stop wearing their devices when they're sick, hospitalized, or going through a difficult period. A model that imputes "average steps" across a missing week treats those days as normal — but a binary flag for "device not worn for 3+ consecutive days" can serve as a proxy for illness or disengagement that's genuinely predictive in health outcome models.

Wearable Missingness Explorer7 missing days

How should we handle the gaps?

Missing days shown as gaps. No imputation.
Week 1
8k
7k
9k
7k
8k
9k
7k
Week 24-day gap
2k
7k
8k
Week 3
8k
8k
9k
9k
7k
8k
8k
Week 43-day gap
8k
8k
6k
7k
RecordedMissingavg: 7,519 steps/day

Toggle between raw gaps, mean imputation, and a binary missing flag to see how each strategy changes what the model can learn from a user's device-wear pattern.

Checkpoint

In a loan application dataset, 'employer_name' is blank for 18% of applicants. A domain expert tells you that self-employed applicants and unemployed applicants both tend to leave this field blank. Should you impute the missing values or encode the missingness?