Monitoring and Data Drift

Your model's performance in production is not your model's performance in evaluation. The world changes. User behavior shifts. Data pipelines drift. Upstream systems change their schemas. A model trained on pre-pandemic mobility data is useless in 2020. A fraud detection model trained in Q1 will be fighting a different set of fraud patterns by Q4.

This is data drift — the distribution of the inputs your model sees in production shifts away from the distribution it was trained on. When that happens, model performance degrades. Silently, gradually, until something breaks loudly enough for someone to notice.

Without monitoring, you have a model. With monitoring, you have a system.

Interactive · Data Drift Over Time

What time do students arrive relative to the class start time (0)? Hover a chart to highlight it.

Week 10255075100-10-5051020304050DensityMinutes Relative to Class Start Time
Week 5020406080-10-5051020304050DensityMinutes Relative to Class Start Time
Week 12015304560-10-5051020304050DensityMinutes Relative to Class Start Time

Hover a chart to see observations about drift.

The drift analogy

Just as the student arrival distribution shifts over a semester, a deployed model's input distribution shifts as user behavior, language, or world events evolve. The model was trained on Week 1 data — by Week 12, it's operating out-of-distribution.

Student arrival times shift over a semester — the same pattern your model faces as production data drifts away from training data.

Two Types of Drift

  • Data drift (covariate shift) — the input distribution P(X)P(X) changes while the relationship P(YX)P(Y|X) stays the same. Example: the age distribution of your users shifts younger. Your model's relationship between age and purchase probability is still correct; it's just seeing more inputs in a range it saw rarely during training.

  • Concept drift — the relationship P(YX)P(Y|X) itself changes. Example: a model predicts whether a tweet will go viral. The criteria for "going viral" changes as the platform's recommendation algorithm changes. The inputs look the same; the labels that were correct for training data are no longer correct for current data.

Concept drift is harder to detect because you need ground truth labels to measure it — and those often arrive with a delay or not at all.

Monitoring Stack
LayerWhat it tracksWatch out for
Performance Tracking
Ground truth metrics
Compute model metrics on live data as ground truth labels arrive. Alert when metrics fall below a defined threshold. Direct measurement of what you actually care about.
Accuracy · Precision · Recall · RMSE
Gold standard
Labels often arrive with delay — track lag as a first-class metric.
Input Distribution
Feature drift detection
Track statistical properties of input features over time. If the mean of a key feature drifts significantly, that's an early warning of performance degradation — detectable before labels arrive.
PSI · KS Test · Mean/Std shift
Leading indicator
Useful precisely because it requires no ground truth labels.
Resource Monitoring
Infra health
Standard software operations monitoring — but essential. A spike in latency or error rate is often the first signal that something is wrong, before any ML-specific metric fires.
CPU · Memory · Latency · Error rate
Standard ops
Shared with your backend team; don't reinvent the wheel here.
Centralized Logging
Prediction audit trail
Every prediction logged with its inputs, output, timestamp, and model version. This is what you query when something breaks at 2 a.m. — the audit trail that lets you replay and diagnose failures.
Inputs · Output · Timestamp · Model version
Queryable history
Log model version with every prediction to isolate regressions after deploys.
Alerting
Outage + quality
Alerts for outages (model unreachable) and for quality degradation (model reachable but performing poorly). The second type is the one most teams forget to build — and the one that silently costs you.
PagerDuty · Prometheus · CloudWatch
Two alarm types
Quality alerts require thresholds. Set them before you go live, not after.

A complete monitoring stack layers all five. Resource monitoring fires first (infrastructure break), input distribution monitoring fires second (data break), performance tracking fires third (model break) — but only once labels arrive. Logging and alerting make the rest actionable.

Population Stability Index

The Population Stability Index (PSI) is a common metric for detecting input drift. It measures how much a feature's distribution has shifted between a baseline (training time) and the current window:

PSI=i(piactualpiexpected)ln(piactualpiexpected)\text{PSI} = \sum_{i} \left(p_i^{\text{actual}} - p_i^{\text{expected}}\right) \cdot \ln\left(\frac{p_i^{\text{actual}}}{p_i^{\text{expected}}}\right)

Interpretation:

  • PSI<0.1\text{PSI} < 0.1 — no significant shift; model should be stable
  • 0.1PSI<0.20.1 \leq \text{PSI} < 0.2 — moderate shift; monitor closely
  • PSI0.2\text{PSI} \geq 0.2 — significant shift; consider retraining

PSI is widely used in financial services (credit scoring, fraud detection) because it has interpretable thresholds and requires no ground truth — you can compute it on input features alone, which matters when labels arrive weeks or months after the prediction.

Drift DetectiveWeek 1 of 16
Drift intensityNone

Slide right to shift the age distribution of incoming users. Watch model accuracy degrade and PSI rise.

Model Accuracy: 0.867
Avg Age: 33.6 (baseline: 35)
FeaturePSIStatus
age0.004Stable
income0.002Stable
visits0.005Stable
PSI < 0.1: stable  |  0.1–0.2: monitor  |  ≥ 0.2: retrain. Dashed line in charts = training baseline.