Data Version Control: Tool for Iterative Machine Learning

Description

Data version control or DVC is a new open source tool which is designed to help data scientists keep track of their ML processes and file dependencies in the simple form of git-like commands. This presentation post walks you through an iterative process of building a machine learning model with DVC.

Abstract:

Data Version Control: tool for iterative machine learning

Introduction

It is hardly possible in real life to develop a good machine learning model in a single pass. ML modeling is an iterative process and it is extremely important to keep track of your steps, dependencies between the steps, dependencies between your code and data files and all code running arguments. This becomes even more important and complicated in a team environment where data scientists’ collaboration takes a serious amount of the team’s effort.

We are pleased to show you a new open source tool — data version control or DVC. DVC is designed to help data scientists keep track of their ML processes and file dependencies in the simple form of git-like commands: “dvc run python trainmodel.py data/trainmatrix.p data/model.p”. Your existing ML processes can be easily transformed into reproducible DVC pipelines regardless of which programming language or tool was used.

Environment preparation

We walk you through an iterative process of building a machine learning model with DVC using stackoverflow posts dataset. First, you should initialize a Git repository and download a modeling source code that we will be using to show DVC in action:

Evaluate the model by the testing dataset.

The result.

$ cat data/evaluation.txt

AUC: 0.596182

```

Reproducing pipeline

The one thing to wrap your head around is that DVC automatically derives the dependencies between the steps and builds the dependency graph (DAG) transparently to the user. This graph is used for reproducing parts of your pipeline which were affected by recent changes. In the next code sample we are changing feature extraction step of the pipeline and reproduce the final result. DVC derives that only three steps out of seven need to be rebuilt and runs these steps:

Data sharing

Not only can DVC streamline your work into a single, reproducible environment, it also makes it easy to share this environment by Git including the dependencies (DAG) — an exciting collaboration feature which gives the ability to reproduce the research results in different computers. Moreover, you can share your data files through cloud storage services like AWS S3 or GCP Storage since DVC does not push data files to Git repositories.

Now, another data scientist can use this repository and reproduce the data files the same way you just did. If she doesn’t want (or has not enough computational resources) to reproduce everything, she can sync and lock shared data files. After that, only the last steps of the ML process will be reproduced.