Versioning and Reproducibility

Reproducibility is the goal: someone restarting your work from scratch should arrive at the same results. This sounds obvious until you're six months into a project and can't remember which version of the preprocessing script produced the model that's currently in production.

Best Practices for Data Versioning

  • Use version control — for data transformations alongside code.
  • Write meaningful commit messages — your future self is the primary user.
  • Maintain a data catalog — a metadata file describing each dataset version: what it contains, where it came from, what was cleaned.
  • Consistent folder structure — raw / interim / processed / final. Stick to it.
  • Document and share your versioning practices with the team before work starts, not after.

Increasingly sophisticated tools exist for data versioning: DVC (Data Version Control), MLflow, Weights & Biases (free for students!), LakeFS. These treat data with the same rigor we treat code. If you go into industry, expect to encounter at least one of these on any serious ML team. Learning one as a student is a genuine résumé differentiator.

Checkpoint

Imagine you trained a model three months ago that's now running in production. Your manager asks you to retrain it with new data — but the original preprocessing script is gone and you don't remember the exact steps. What would you have done differently at the start of the project to prevent this situation?