It’s not just about the data – it’s what you do with it

A week ago I presented alongside CloudFactory a webinar titled “How to Accelerate Data Labeling and DL Training”. In this blog post I’ll share the core of the presentation with the challenges mentioned in it. If you’d like to hear more on the different solutions – make sure to catch a recording.

Quality data isn’t enough – you need quality labels

Every machine learning solution starts with finding data you can leverage to solve a problem. Whether you found it on the internet or assembled your own data – it’s only the start of the journey for that data. How do you go about labeling it and making sure it’s labeled correctly? My co-hosts for the webinar were CloudFactory, experts in data labeling workforce and this is how they present the spectrum of solutions:

Hiring full or part time labelers – completely in-house. This might be a good solution if you have a constant and large stream of data that needs labeling.

Using a specialized labeling workforce – this is what CloudFactory offers. This is an especially tricky segment to get done right. How do you ensure that the labeling is done properly? How do you make sure the right people are analyzing the right data? Some companies have extremely specific needs such as a medical experts for labeling where not even CloudFactory have such a workforce at hand. But they build tools to help manage and scale teams of labeling experts.

Crowd-sourcing – using solutions like Amazon Mechanical Turk, you can arrange to get cheap, but often poor quality labels because the label task force don’t have the adequate amount of training and context.

With the data starts the research

The moment you get enough labeled data – you’re going to try and build a model to match it. Many teams build their AI research workflows differently. As part of my role at MissingLink – I’ve surveyed over 100 AI teams, learning what they’re trying to achieve and the challenges they face. A result of that survey is this chart depicting what a neural network developer ideally wants to work on.

Even though that’s what they want to work on, many data scientists find themselves working on entirely different things. A ton of AI infrastructure tasks which these folks don’t really want to own but have to. Here are a few questions that pop up once you get your data ducks in a row and start building models:

Experiments

Running experiments. The first step after you run your python script is to check the loss and the accuracy or whatever metric you use. You’re going to want to run the model many times with many different hyper parameters and tweaks to find a meaningful improvement to your metrics. But how do you remember which run had the best metrics?

Data

Imagine having one terabyte of data – suddenly it becomes a challenge to collaborate on it. For a variety of reasons.

Data grows with new data coming back from production, and it changes with new labels from your data labeling workforces, like CloudFactory.

Most of the time only want a part of the data. Even if it’s local on your drive – it takes hours to process a terabyte.

On a team – you want to be able to share and collaborate on it. Does each person have their own, unique copy of the 1TB of data?

To work effectively – you need to be able to reproduce your results.

Machines

An entirely different set of challenges arises from needing to spin machines up and down. The right kind of machines. Moving data, code, metrics and hyperparameters in and out of them. But what we really want is one easy button that just does everything.

Go check out the webinar

If you think this blog post is a bit of a tease – it totally is. Make sure to check out the webinar to get the full presentation and join in on the discussion. If this interests you – make sure to drop me a line at [email protected].