CI/CD and Versioning Everything

In software engineering, CI/CD — Continuous Integration / Continuous Deployment — is the practice of automating the testing and release of code. Push a change; tests run automatically; if they pass, the change ships. If they fail, it doesn't.

In ML, CI/CD gets harder. You're not just testing code. You're testing code, data, and models — and each one can fail independently while the others look fine. A model can pass its evaluation metrics on a frozen test set and still perform badly in production because the data distribution has shifted since the test set was created. You need to test all three layers.

ML CI/CD Pipeline
← Continuous Integration
Continuous Deployment →

Select a stage to see what it validates.

ML CI/CD adds two validation layers that software CI/CD lacks: data validation and model evaluation against a holdout baseline. Both must pass before any version reaches production.

The other half of this section is versioning — and it's one of the most neglected practices in ML engineering. In software, you version code with Git. In ML, you have three additional things to version, and failing to track any of them means you cannot debug, reproduce, or audit your system.

Version Everything
Always
Code
Standard software engineering practice — every model is trained by code, and that code must be reproducible.
If you can't reproduce the training script exactly as it ran six months ago, you can't explain why the model behaved the way it did. Version control is the baseline.
Tools
Git
Critical
Data
A model trained on different data produces different predictions. Without data versioning, you can't recreate model v17 six months later.
Data changes constantly — rows are added, corrected, deleted. If you overwrote last month's training set, that model version is gone forever, even if the code and weights survived.
Tools
DVC, lakeFS, Delta Lake
Required
Model
Rollbacks, A/B testing, audit trail, regulatory compliance all depend on being able to retrieve any previous model artifact.
When production degrades, you need to roll back in minutes, not hours. A model registry stores every trained artifact with its metadata, making rollback a one-line operation.
Tools
MLflow Model Registry, W&B
Often Missed
Config
Hyperparameters, feature flags, environment settings — a different learning rate is a different model, even if the code and data are identical.
Config is easy to overlook because it doesn't live in a file you think of as 'the model.' But two runs with identical code and data and different configs will produce different outputs. Track it the same way you'd track code.
Tools
Git, environment configs

The lineage between all four is what makes a system auditable: Model v17 was trained on Data v9, using Code v42, with Config v3. Without that lineage, you cannot answer "why did the model get worse after we retrained it?"

Start Versioning on Your First Model, Not Your Tenth

Every team I've talked to that doesn't version data says the same thing: "we'll add it later when the system is more mature." Then "later" never comes, the system grows, and retrofitting versioning onto an existing production system is orders of magnitude harder than building it in from the start.

The muscle of version-everything-from-day-one is the most valuable habit you can build now, when the stakes are low. Your future self — and future colleagues, and future auditors — will thank you.

Tools Worth Knowing

  • MLflow — open-source experiment tracking and model registry. Log parameters, metrics, and artifacts for each training run. Compare runs across experiments. One of the most widely-used MLOps tools in industry.
  • DVC (Data Version Control) — Git for datasets. Track large files that can't go in Git itself, with pointers stored in the repo and the data in S3 or GCS.
  • Weights & Biases (W&B) — experiment tracking, hyperparameter sweeps, model versioning. Popular in research and in companies that want a richer UI than MLflow provides Students get this for free, so it is worth trying out now!
  • GitHub Actions / GitLab CI — general-purpose CI/CD. Write a YAML file that defines what to run when code is pushed. Most ML teams layer MLflow on top of GitHub Actions for the ML-specific parts.
Checkpoint

A team retrained their demand forecasting model and deployed it. Two weeks later, performance is noticeably worse. They want to roll back to the previous version — but they discover that the training data was overwritten with the new version's training set, and the previous model artifact wasn't saved. What could they have done to avoid this situation?