Versioning Machine Learning Experiments vs Tracking Them

When working on a machine learning project it is common to run numerous experiments in search of a combination of an algorithm, parameters and data preprocessing steps that would yield the best model for the task at hand. To keep track of these experiments Data Scientists used to log them into Excel sheets due to a lack of a better option. However, being mostly manual, this approach had its downsides. To name a few, it was error-prone, inconvenient, slow, and completely detached from the actual experiments.

Luckily, over the last few years experiment tracking has come a long way and we have seen a number of tools appear on the market that improve the way experiments can be tracked, e.g. Weights&Biases, MLflow, Neptune. Usually such tools offer an API you can call from your code to log the experiment information. It is then stored in a database, and you use a dashboard to compare experiments visually. With that, once you change your code, you no longer have to worry about forgetting to write the results down — that’s done automatically for you. The dashboards help with visualization and sharing.

French Telco Orange Hit by Cyber-Attack

ATC Ghana supports Girls-In-ICT Program

This is a great improvement in keeping track of what has been done, but… Spotting an experiment that has produced the best metrics in a dashboard does not automatically translate into having that model ready for deployment. It’s likely that you need to reproduce the best experiment first. However, the tracking dashboards and tables that you directly observe are weakly connected to the experiments themselves. Thus, you still may need to semi-manually trace your steps back to stitch together the exact code, data and pipeline steps to reproduce the experiment. Could this be automated?

In this blog post I’d like to talk about versioning experiments instead of tracking them, and how this can result in easier reproducibility on top of the benefits of experiment tracking.

To achieve this I will be using DVC, an open source tool that is mostly known in the context of Data Versioning (after all it’s in the name). However, this tool can actually do a lot more. For instance, you can use DVC to define ML pipelines, run multiple experiments, and compare metrics. You can also version all the moving parts that contribute to an experiment.

Experiment Versioning

To start versioning experiments with DVC you’ll need to initialize it from any Git repo as shown below. Note, that DVC expects you to have your project structured in a certain logical way, and you may need to reorganize your folders a bit.

$ dvc exp init -i

This command will guide you to set up a default stage in dvc.yaml.
See https://dvc.org/doc/user-guide/project-structure/pipelines-files. DVC assumes the following workspace structure:

├── data
├── metrics.json
├── models
├── params.yaml
├── plots
└── srcCommand to execute: python src/train.py
Path to a code file/directory [src, n to omit]: src/train.py
Path to a data file/directory [data, n to omit]: data/images/
Path to a model file/directory [models, n to omit]:
Path to a parameters file [params.yaml, n to omit]:
Path to a metrics file [metrics.json, n to omit]:
Path to a plots file/directory [plots, n to omit]: logs.csv
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
default:
  cmd: python src/train.py
  deps:
  - data/images/
  - src/train.py
  params:
  - model
  - train
  outs:
  - models
  metrics:
  - metrics.json:
      cache: false
  plots:
  - logs.csv:
      cache: false
Do you want to add the above contents to dvc.yaml? [y/n]: yCreated default stage in dvc.yaml. To run, use "dvc exp run".
See https://dvc.org/doc/user-guide/experiment-management/running-experiments.

You may also notice that DVC assumes that you store parameters and metrics in files instead of logging them with an API. This means you’ll need to modify your code to read parameters from a YAML file and write metrics to a JSON file.

Finally, when initializing, DVC automatically creates a basic pipeline and stores it in a dvc.yaml file. With that, your training code, pipeline, parameters, and metrics now live in files that can be versioned.

Benefits of Experiment-as-Code Approach

Clean code

When set up this way, your code no longer depends on an experiment tracking API. Instead of inserting tracking API calls in your code to save experiment information in a central database, you save it in readable files. These are always available in your repo, your code stays clean, and you have less dependencies. Even without DVC, you can read, save, and version your experiment parameters and metrics with Git, though using plain Git is not the most convenient way to compare ML experiments.

$ git diff HEAD~1 -- params.yaml
diff --git a/params.yaml b/params.yaml
index baad571a2..57d098495 100644
--- a/params.yaml
+++ b/params.yaml
@@ -1,5 +1,5 @@
 train:
   epochs: 10
-model:
-  conv_units: 16
+model:
+  conv_units: 128

Reproducibility

Experiment tracking databases do not capture everything you need to reproduce an experiment. One important piece that is often missing is the pipeline to run the experiment end to end. Let’s take a look at the`dvc.yaml` file, the pipeline file that has been generated.

$ cat dvc.yaml
stages:
  default:
    cmd: python src/train.py
    deps:
    - data/images
    - src/train.py
    params:
    - model
    - train
    outs:
    - models
    metrics:
    - metrics.json:
        cache: false
    plots:
    - logs.csv:
        cache: false

This pipeline captures the command to run the experiment, parameters and other dependencies, metrics, plots, and other outputs. It has a single `default` stage, but you can add as many stages as you need. When treating all aspects of an experiment as code, including the pipeline, it becomes easier for anyone to reproduce the experiment.

Reduce noise

In a dashboard, you can see all of your experiments, and I mean ALL of them. At a certain point you will have so many experiments, you will have to sort, label, and filter them simply to keep up. With experiment versioning you have more flexibility in what you share and how you organize things.

For instance, you can try an experiment in a new Git branch. If something goes wrong or the results are uninspiring, you can choose not to push the branch. This way you can reduce some unnecessary clutter that you would otherwise encounter in an experiment tracking dashboard.

At the same time, if a particular experiment looks promising, you can push it to your repo along with your code so that the results stay in sync with the code and pipeline. The results are shared with the same people, and it’s already organized using your existing branch name. You can keep iterating on that branch, start a new one if an experiment diverges too much, or merge into your main branch to make it your primary model.

Why use DVC?

Even without DVC, you can change your code to read parameters from files and write metrics to other files, and track changes with Git. However, DVC adds a few ML-specific capabilities on top of Git that can simplify comparing and reproducing the experiments.

Large Data Versioning

Large data and models aren’t easily tracked in Git, but with DVC you can track them using your own storage, yet they are Git-compatible. When initialized DVC starts tracking the `models` folder, making Git ignore it yet storing and versioning it so you can back up versions anywhere and check them out alongside your experiment code.

Single-command reproducibility

Codifying the entire experiment pipeline is a good first step towards reproducibility, but it still leaves it to the user to execute that pipeline. With DVC you can reproduce the entire experiment with a single command. Not only that, but it will check for cached inputs and outputs and skip recomputing data that’s been generated before which can be a massive time saver at times.

$ dvc exp run
'data/images.dvc' didn't change, skipping
Stage 'default' didn't change, skippingReproduced experiment(s): exp-44136
Experiment results have been applied to your workspace.To promote an experiment to a Git branch run:dvc exp branch <exp> <branch>

Better branch organization

While Git branching is a flexible way to organize and manage experiments, there are often too many experiments to fit any Git branching workflow. DVC tracks experiments so you don’t need to create commits or branches for each one:

$ dvc exp show ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ ┃Experiment               ┃ Created      ┃    loss ┃    acc ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │workspace                │ -            │ 0.25183 │ 0.9137 │ │mybranch                 │ Oct 23, 2021 │       - │      - │ │├──9a4ff1c [exp-333c9]   │ 10:40 AM     │ 0.25183 │ 0.9137 │ │├──138e6ea [exp-55e90]   │ 10:28 AM     │ 0.25784 │ 0.9084 │ │├──51b0324 [exp-2b728]   │ 10:17 AM     │ 0.25829 │ 0.9058 │ └─────────────────────────┴──────────────┴─────────┴────────┘

Once you decide which of these experiments are worth sharing with the team, they can be converted into Git branches:

$ dvc exp branch exp-333c9 conv-units-64
Git branch 'conv-units-64' has been created from experiment 'exp-333c9'.
To switch to the new branch run:git checkout conv-units-64

This way you will avoid creating a clutter of branches in your repo, and can focus on comparing only promising experiments.

Conclusion

To summarize, experiment versioning allows you to codify your experiments in such a way that your experiment logs are always connected to the exact data, code, and pipeline that went into the experiment. You have control over which experiments end up shared with your colleagues for comparison, and this can prevent clutter.

Finally, reproducing a versioned experiment becomes as easy as running a single command, and it can even take less time than initially, if some of the pipeline steps have cached outputs that are still relevant.