Monitoring and Data Drift
Your model's performance in production is not your model's performance in evaluation. The world changes. User behavior shifts. Data pipelines drift. Upstream systems change their schemas. A model trained on pre-pandemic mobility data is useless in 2020. A fraud detection model trained in Q1 will be fighting a different set of fraud patterns by Q4.
This is data drift — the distribution of the inputs your model sees in production shifts away from the distribution it was trained on. When that happens, model performance degrades. Silently, gradually, until something breaks loudly enough for someone to notice.
Without monitoring, you have a model. With monitoring, you have a system.
What time do students arrive relative to the class start time (0)? Hover a chart to highlight it.
Hover a chart to see observations about drift.
The drift analogy
Just as the student arrival distribution shifts over a semester, a deployed model's input distribution shifts as user behavior, language, or world events evolve. The model was trained on Week 1 data — by Week 12, it's operating out-of-distribution.
Student arrival times shift over a semester — the same pattern your model faces as production data drifts away from training data.
Two Types of Drift
- Data drift (covariate shift) — the input distribution changes while the relationship stays the same. Example: the age distribution of your users shifts younger. Your model's relationship between age and purchase probability is still correct; it's just seeing more inputs in a range it saw rarely during training.
- Concept drift — the relationship itself changes. Example: a model predicts whether a tweet will go viral. The criteria for "going viral" changes as the platform's recommendation algorithm changes. The inputs look the same; the labels that were correct for training data are no longer correct for current data.
Concept drift is harder to detect because you need ground truth labels to measure it — and those often arrive with a delay or not at all.
A complete monitoring stack layers all five. Resource monitoring fires first (infrastructure break), input distribution monitoring fires second (data break), performance tracking fires third (model break) — but only once labels arrive. Logging and alerting make the rest actionable.
Population Stability Index
The Population Stability Index (PSI) is a common metric for detecting input drift. It measures how much a feature's distribution has shifted between a baseline (training time) and the current window:
Interpretation:
- — no significant shift; model should be stable
- — moderate shift; monitor closely
- — significant shift; consider retraining
PSI is widely used in financial services (credit scoring, fraud detection) because it has interpretable thresholds and requires no ground truth — you can compute it on input features alone, which matters when labels arrive weeks or months after the prediction.
Slide right to shift the age distribution of incoming users. Watch model accuracy degrade and PSI rise.
| Feature | PSI | Status |
|---|---|---|
| age | 0.004 | Stable |
| income | 0.002 | Stable |
| visits | 0.005 | Stable |