When Missingness Is a Feature
Sometimes the fact that data is missing is itself informative. If "income field is blank" predicts something about your customers, that pattern is a feature, not a bug — and you should preserve it rather than impute it away.
Two ways to encode missingness explicitly:
- Sentinel value — use a distinct value like
−1,−9999, or a category like"unknown"that the model can learn to recognize as "was missing." - Binary missing flag — add a separate 0/1 column indicating whether the original was missing. Keep the imputed value in the original column; add the flag alongside it.
Use these only when the feature is genuinely important and the missingness pattern is meaningfully predictive. Don't add missing flags everywhere just because you can — it adds dimensionality for no gain when the pattern is random.
Credit Scoring and Fraud Detection
Credit scoring is a domain where missingness-as-feature is widely used. The pattern of which fields a borrower fills in — and which they leave blank — is itself predictive of creditworthiness. Models exploit this. The same pattern appears in fraud detection (which fields does the fraudster skip?), insurance underwriting, and many user-facing systems where users self-select what to disclose. In these domains, treating a missing value as "just missing" throws away signal you actually need.
Wearable Sensor Data
Wearable devices produce another clear example: if a user's step count, heart rate, or sleep data goes missing for several consecutive days, the gap itself is often a signal. People tend to stop wearing their devices when they're sick, hospitalized, or going through a difficult period. A model that imputes "average steps" across a missing week treats those days as normal — but a binary flag for "device not worn for 3+ consecutive days" can serve as a proxy for illness or disengagement that's genuinely predictive in health outcome models.
How should we handle the gaps?
Toggle between raw gaps, mean imputation, and a binary missing flag to see how each strategy changes what the model can learn from a user's device-wear pattern.
In a loan application dataset, 'employer_name' is blank for 18% of applicants. A domain expert tells you that self-employed applicants and unemployed applicants both tend to leave this field blank. Should you impute the missing values or encode the missingness?