Model and Dataset versioning practices using DVC tool

Aman Sharma (~algomaster99)


Description:

Python is a prevalent programming language in machine learning (ML) community. A lot of Python engineers and data scientists feel the lack of engineering practices like versioning large datasets and ML models and the lack of reproducibility. This lack is particularly acute for engineers who just moved to ML space.

As an open-source contributor, I was specifically interested in the open-source version control systems. Git and ad-hoc conventions on top of cloud storage are a common choice, but this toolset has certain limitations. New ML specific version control systems are being developed to better respond to the current ML team needs. DVC is one of such tools.

Data Version Control or DVC.ORG is an open source, a command-line tool written in Python. I became familiar with it while contributing to the project and was amazed by its broad functionality and efficiency. I will show how to use DVC to version datasets with dozens of gigabytes of data and version ML models, how to use your favourite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects.

Data versioning isn't the same as source code versioning since data files are huge in size. DVC will help a software engineer who got used to Git and versioning survive in ML projects.

The talk will include the following topics:

  • Introduction about the speaker and his involvement with DVC tool [2 minutes]
  • Comparison of data science and software development [3 minutes]
  • ML without dataset and model versioning [4 minutes]
    • Multiple data files will lead to excessive space consumption of disk
  • Tackling problems in a data science project [4 minutes]
    • Reproducibility using an ML pipeline by DVC
    • Metrics tracking
  • A small DVC demo [5 minutes]
  • Further improvements [4 minutes]
  • Q/A session [5 minutes]

P.S. A margin of 3 minutes is given.

(The outline is tentative and is subject to change.)

A currently work-in-progress PPT can be found in the following link: https://docs.google.com/presentation/d/1N4mmYe4B357sA1FRTGSaMBz4Vq2DXxlsRvD9WGvTj58/edit?usp=sharing

Here's a short introductory video about DVC: https://drive.google.com/open?id=1-CGxSJ3Qw4b3AS4vkAe3iWTawP4x3IvJ

Prerequisites:

Familiarity with basic Linux commands

Speaker Info:

Aman Sharma is a passionate software developer with active involvement in the open-source community. Since May 2019, I have been actively contributing to DVC by resolving the issues or implementing new ideas and then writing documentation for it.

I was introduced to the open-source community earlier this year when I started preparing for Google Summer of Code. My contributions to one organization called Vega became quite significant which eventually led me to do an internship (GSoC) under them.

I am currently pursuing B.Tech from IIT Roorkee and I am a part of a student group, called Information Management Group, there which is driven to assist the institute technologically by developing the applications. Our group manages the institute website and student portal which runs on the institute intranet.

Speaker Links:

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: