Version Control for Data Science
Alessia Marcolini
Data is the key differentiator between a Machine Learning project and a traditional software project: even if everything else stays stable, changing the data your models are trained upon makes a huge difference.
The best tools for tracking changes are the VCS that are used in software development, such as Git, Mercurial, and Subversion. They keep track of what was changed in a file, when and by whom, and synchronize changes to a central server so that multiple contributors can manage changes to the same set of files. But these traditional tools aren’t quite sufficient for Machine Learning because of the need for being able to track the data sets along with the code itself and some of the resulting models.
So versioning in Data Science projects can be pretty painful. There are generally six things that you usually want to keep track of:
- code
- data
- configurations
- resulting models
- performance metrics
- environments / dependencies
Running a Data Science project is an iterative process and you usually don’t want to commit changes every time you change one parameter or one performance metric. Instead, you'll run a variety of experiments and commit it once you’re satisfied.
This usually means that during the experimentation process, you might lose track of any of the experiments that you did (e.g. changes on data or dependencies). However, when you share your results with your colleagues, they'll not have any ideas of what you've already tried and most likely will end up redoing a bunch of work — heck, after a couple of weeks you could end up doing the same.
In this talk I will share some best practices to help you better version your ML project and also I will show some existing tools such as DVC, ndim and ReviewNB (to version Jupyter Notebooks).
This talk is aimed at PyData beginners and specific Machine Learning expertise is not required, although knowledge about Git and the Data Science ecosystem would help follow the speech.
Alessia Marcolini
Affiliation: Fondazione Bruno Kessler
Enthusiastic and curious ICT student at University of Trento, with a passion for deep learning. Junior Research Assistant at FBK, working on machine / deep learning solutions for environmental health and food quality. Pythonista, PyCon Italy organizer and Django Girls coach.