Evaluating Experiments AFTER Training, with MissingLink

Improving a machine learning model is a lot like debugging software. When debugging, it’s common that after your process completes, you find yourself wishing you could have inspected one more variable that wasn’t logged. Without a time machine, you can’t do anything but start over because the process already shut down, but all you lose is time. In machine learning where training can take hours and cost money, this challenge escalates quickly. In this blog post, we will describe ways to mitigate the pain of evaluating an experiment and show how you can analyze your training session, even after it’s ended, with MissingLink’s Experiment Management.

Life After Process Death on the MissingLink Dashboard

Jupyter notebooks are a good first step towards debugging live Python objects in a training script. But even with notebooks, your kernel will eventually die, it doesn’t remember results from previous cell runs, and it’s hard to collaborate with multiple team members. The MissingLink service supports working with and without notebooks – and we solve the kernel death problem by sending your metrics to a web dashboard. We built the dashboard to let your team see every experiment that’s live right now or long gone. After integrating the Python SDK with a few lines, the dashboard gets populated with every run of your script. The cool thing about having a service that auto-tracks hyperparameters, code, data and metrics is everyone has access to every runs’ results. When multiple people are working on improving one model – this is a critical capability. The amount of changes, ideas and tests can be large, and they don’t always synergize. Visibility and transparency are the most basic requirements for collaboration, and being able to bring in all the changes to move the team forward.

Analyzing a Single Experiment

This is what a single experiment looks like. From the main page you can toggle between metrics, see the hyperparameters, code, data and command line arguments. It’s all there, forever. Storing the metrics pays off in the long-term when you deploy a model to production and have to circle back to understand how the model was trained. In the short-term it helps analyze why an idea worked or didn’t.

Analyzing Multiple Experiments

One experiment running by itself is a rare sight. Any team with the most basic AI infrastructure is running a handful of experiments at a time if not tens or hundreds. Consolidating all the results can be a tremendous task without the right tools. MissingLink’s dashboard lets you select multiple different experiments and compare them. The comparison links to your private git repositories for code diffs, lets you compare any metric, and every hyperparameter, per batch or per epoch.

Discovering the One

After a day of launching 100 experiments with different code and parameter changes – it gets pretty tricky to find the needle in the haystack – the good results. Assuming we’re optimizing for accuracy and you set up auto-tracking, this is what that looks like:

But if you’re paying attention, you might be wondering, which accuracy is this? Is it the first one measured? The last one? An average? A median? We just shipped a new feature – the ability to decide just that. When clicking a sort button, you can choose which metric value is the one shown and sorted by. This feature also works on the main experiment charts page!

Where in my case – the best last value for accuracy was a model from this impressive specimen of an experiment:

Recalling the One

In the previous section we found the best model we could. We might discuss it as a team, choose to commit to it and improve it, or ship it to production. The communication around this experiment can be stored in MissingLink, right next to the metrics, weights, data, and everything else about the experiment. There are three main locations for it, the display name which is always visible from the experiments table, tags which group related experiments, and the note which appears when you hover over it with your mouse. You can always modify any of these.

After you’ve labeled, tagged and noted a special experiment for a special model – you can find these through a powerful and intuitive search tool:

If we want to make sure we don’t forget an experiment – we can label it in a variety of ways. A highly visible display name is great to notice while you’re scrolling, a more in-depth note when zooming into the experiment, or a tag when a group of experiments have something in common. The search tool helps us find what we’re looking for in any one of these fields, so we never misplace a model we care about.

Where Did the Weights Go!?

A cool feature when you launch experiments through resource management is that every file written to /output/ is saved and associated with that experiment as an “artifact”. You can find them in the “artifacts” tab. The usual use-case is for outputs of the experiment like the model checkpoints, outlier data points, a tensorboard file, etc.

When you go back to a past experiment, you can get all the artifacts associated with it. You’ll never again scratch your head wondering where that model was saved.

The goals driving the design of this dashboard are to keep what you need, where you need it, where everything about an experiment is always in one place and it gets stored there seamlessly. MissingLink’s ideal is to let you keep your workflow, while making it easier, more reproducible, and allowing you to work on it as a team.

We’re Here to Help

Processes die, but your ability to analyze what happened doesn’t have to. MissingLink’s experiment management dashboard helps teams collaborate on machine learning research, and keep their sanity. The best part is that all of this can be easily integrated into your own projects with just a few lines of code. We’re happy to schedule a demo to show you how, drop us a note.