CI/CD and Versioning Everything
In software engineering, CI/CD — Continuous Integration / Continuous Deployment — is the practice of automating the testing and release of code. Push a change; tests run automatically; if they pass, the change ships. If they fail, it doesn't.
In ML, CI/CD gets harder. You're not just testing code. You're testing code, data, and models — and each one can fail independently while the others look fine. A model can pass its evaluation metrics on a frozen test set and still perform badly in production because the data distribution has shifted since the test set was created. You need to test all three layers.
Select a stage to see what it validates.
ML CI/CD adds two validation layers that software CI/CD lacks: data validation and model evaluation against a holdout baseline. Both must pass before any version reaches production.
The other half of this section is versioning — and it's one of the most neglected practices in ML engineering. In software, you version code with Git. In ML, you have three additional things to version, and failing to track any of them means you cannot debug, reproduce, or audit your system.
The lineage between all four is what makes a system auditable: Model v17 was trained on Data v9, using Code v42, with Config v3. Without that lineage, you cannot answer "why did the model get worse after we retrained it?"
Start Versioning on Your First Model, Not Your Tenth
Every team I've talked to that doesn't version data says the same thing: "we'll add it later when the system is more mature." Then "later" never comes, the system grows, and retrofitting versioning onto an existing production system is orders of magnitude harder than building it in from the start.
The muscle of version-everything-from-day-one is the most valuable habit you can build now, when the stakes are low. Your future self — and future colleagues, and future auditors — will thank you.
Tools Worth Knowing
- MLflow — open-source experiment tracking and model registry. Log parameters, metrics, and artifacts for each training run. Compare runs across experiments. One of the most widely-used MLOps tools in industry.
- DVC (Data Version Control) — Git for datasets. Track large files that can't go in Git itself, with pointers stored in the repo and the data in S3 or GCS.
- Weights & Biases (W&B) — experiment tracking, hyperparameter sweeps, model versioning. Popular in research and in companies that want a richer UI than MLflow provides Students get this for free, so it is worth trying out now!
- GitHub Actions / GitLab CI — general-purpose CI/CD. Write a YAML file that defines what to run when code is pushed. Most ML teams layer MLflow on top of GitHub Actions for the ML-specific parts.
A team retrained their demand forecasting model and deployed it. Two weeks later, performance is noticeably worse. They want to roll back to the previous version — but they discover that the training data was overwritten with the new version's training set, and the previous model artifact wasn't saved. What could they have done to avoid this situation?