Monitoring and Data Drift

Your model's performance in production is not your model's performance in evaluation. The world changes. User behavior shifts. Data pipelines drift. Upstream systems change their schemas. A model trained on pre-pandemic mobility data is useless in 2020. A fraud detection model trained in Q1 will be fighting a different set of fraud patterns by Q4.

This is data drift — the distribution of the inputs your model sees in production shifts away from the distribution it was trained on. When that happens, model performance degrades. Silently, gradually, until something breaks loudly enough for someone to notice.

Without monitoring, you have a model. With monitoring, you have a system.

⟳Interactive · Data Drift Over Time

What time do students arrive relative to the class start time (0)? Hover a chart to highlight it.

Hover a chart to see observations about drift.

The drift analogy

Just as the student arrival distribution shifts over a semester, a deployed model's input distribution shifts as user behavior, language, or world events evolve. The model was trained on Week 1 data — by Week 12, it's operating out-of-distribution.

Student arrival times shift over a semester — the same pattern your model faces as production data drifts away from training data.

ℹ

Two Types of Drift

Data drift (covariate shift) — the input distribution $P(X)$ changes while the relationship $P(Y|X)$ stays the same. Example: the age distribution of your users shifts younger. Your model's relationship between age and purchase probability is still correct; it's just seeing more inputs in a range it saw rarely during training.

Concept drift — the relationship $P(Y|X)$ itself changes. Example: a model predicts whether a tweet will go viral. The criteria for "going viral" changes as the platform's recommendation algorithm changes. The inputs look the same; the labels that were correct for training data are no longer correct for current data.

Concept drift is harder to detect because you need ground truth labels to measure it — and those often arrive with a delay or not at all.

Monitoring Stack

LayerWhat it tracksWatch out for

Performance Tracking

Ground truth metrics

Compute model metrics on live data as ground truth labels arrive. Alert when metrics fall below a defined threshold. Direct measurement of what you actually care about.

Accuracy · Precision · Recall · RMSE

Gold standard

Labels often arrive with delay — track lag as a first-class metric.

Input Distribution

Feature drift detection

Track statistical properties of input features over time. If the mean of a key feature drifts significantly, that's an early warning of performance degradation — detectable before labels arrive.

PSI · KS Test · Mean/Std shift

Leading indicator

Useful precisely because it requires no ground truth labels.

Resource Monitoring

Infra health

Standard software operations monitoring — but essential. A spike in latency or error rate is often the first signal that something is wrong, before any ML-specific metric fires.

CPU · Memory · Latency · Error rate

Standard ops

Shared with your backend team; don't reinvent the wheel here.

Centralized Logging

Prediction audit trail

Every prediction logged with its inputs, output, timestamp, and model version. This is what you query when something breaks at 2 a.m. — the audit trail that lets you replay and diagnose failures.

Inputs · Output · Timestamp · Model version

Queryable history

Log model version with every prediction to isolate regressions after deploys.

Alerting

Outage + quality

Alerts for outages (model unreachable) and for quality degradation (model reachable but performing poorly). The second type is the one most teams forget to build — and the one that silently costs you.

PagerDuty · Prometheus · CloudWatch

Two alarm types

Quality alerts require thresholds. Set them before you go live, not after.

A complete monitoring stack layers all five. Resource monitoring fires first (infrastructure break), input distribution monitoring fires second (data break), performance tracking fires third (model break) — but only once labels arrive. Logging and alerting make the rest actionable.

◆

Population Stability Index

The Population Stability Index (PSI) is a common metric for detecting input drift. It measures how much a feature's distribution has shifted between a baseline (training time) and the current window:

$\text{PSI} = \sum_{i} \left(p_i^{\text{actual}} - p_i^{\text{expected}}\right) \cdot \ln\left(\frac{p_i^{\text{actual}}}{p_i^{\text{expected}}}\right)$

Interpretation:

$\text{PSI} < 0.1$ — no significant shift; model should be stable
$0.1 \leq \text{PSI} < 0.2$ — moderate shift; monitor closely
$\text{PSI} \geq 0.2$ — significant shift; consider retraining

PSI is widely used in financial services (credit scoring, fraud detection) because it has interpretable thresholds and requires no ground truth — you can compute it on input features alone, which matters when labels arrive weeks or months after the prediction.

Drift DetectiveWeek 1 of 16

Drift intensityNone

Slide right to shift the age distribution of incoming users. Watch model accuracy degrade and PSI rise.

Model Accuracy: 0.867

Avg Age: 33.6 (baseline: 35)

Feature	PSI	Status
age	0.004	Stable
income	0.002	Stable
visits	0.005	Stable

PSI < 0.1: stable | 0.1–0.2: monitor | ≥ 0.2: retrain. Dashed line in charts = training baseline.

←PreviousCI/CD and Versioning EverythingBuilding ML Pipelines Next→Scalability, Cost, and SecurityBuilding ML Pipelines