Machine Learning Model and Dataset Versioning

Kurian Benoy (~kurianbenoy)


In this talk we will discuss about the current best practices of organizing ML projects and why traditional open-source tools like Git and Git-LFS won't help us here.

Currently the life-cycle of any Machine learning model goes through following process:

  • a ML practitioner tries out new image classification algorithm with input dataset
  • He tweaks algorithms, tries other ideas and fix bugs. All in local system
  • Some of her training data might require long runs, and may change code while weights remains same
  • She keeps around the model weights and evaluation scores for all her runs, and picks which weights to release as the final model once she’s out of time to run more experiments.
  • She publishes her results, with code and the trained weights.

Git can’t handle large amount of data of GB’s of size. While Git-LFS comes with the in-build difficulty of supporting only 2 GBs of data at the maximum(Github limitations) and even more problems exist.

Data Version Control or DVC.ORG is an open-source, command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favourite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects. Also, I will be discussing tools in the market for both experiment tracking and dataset versioning, and what are the best features of these products(PS: no comparison among one another).

Talk Outline

  • Startup Adventures
  • Challenges
  • Model and Dataset versioning?
  • How I discovered DVC?
  • Use case: Versioning Cats vs Dogs Deep Learning problem(8 min)
  • Conclusion


  • Should have preferably trained an ML model and worked with datasets of size greater than 100 MB
  • Working-level knowledge of Machine learning
  • a mindset to improve your current workflow

Speaker Info:

Kurian Benoy is an open-source contributor at CloudCV, DVC. He is the lead organiser of School of AI, Kochi and is an AI enthusiast working on Deep Learning and Computer Vision. Kurian is FOSSASIA Open TechNights WInner and gave a talk in FOSSASIA Open Tech submit about the team.

I am an active kaggler and was the first person to introduce about Data Version Control in Kaggle and is among the top 10 contributors of dvc, so far.

Speaker Links:

Id: 1272
Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: