A Brief History of Training Data

A Brief History of Training DataRobert MunroBlockedUnblockFollowFollowingJul 3A old friend, Aman Naimat, recently hosted a conference that brought together people with a common profile: leaders in technology who have shipped useful Machine Learning models for real-world use cases.

It is a surprisingly small group of people who fit that profile today.

For all the attention being paid to Machine Learning today, there are very few people in leadership roles in technology companies who have the technical skills to build Machine Learning models and have built Machine Learning systems that are central to the success of a real-world application.

The people at the conference came from very different industries: self-driving cars, marketing, finance, hardware, accommodation, media, security, etc.

Despite this variety of industries, there were a number of threads that everyone had in common, and one was training data: everyone valued data as highly as algorithms for their success.

I led a talk and breakout session on training data, which inspired this article.

A brief history of training dataA Quarter-Century of Training DataThere are interesting cycles in the history of training data.

In the 1990’s, before Machine Learning dominated AI, programmers hard-coded rules to improve the performance of their systems, based on the behavior of their models.

When Machine Learning came to dominate almost 20 years later, we returned to similar Human-in-the-Loop systems, but with non-expert human annotators creating the training data based on model behavior.

In the middle of this 20 year span, between the 1990s and 2010s, the promise of Machine Learning was limited by the prohibitive expense of getting labeled training data.

It resulted in an academic focus on testing different algorithms on a relatively small number of canonical datasets.

It is an academic focus that hasn’t changed too much today.

The rise of pay-as-you-go training data options in the late 2000’s, starting with Amazon’s Mechanical Turk (MTurk), changed the way that people viewed training data creation.

There was a small movement in academia at this time, with Active Learning rising as a strategy for selecting the right data to annotate for human labeling.

But the largest change in the late 2000’s was in industry, not academia.

Training data and algorithms have been equally important for everyone building real-world Machine Learning models since this time.

There was another repeat cycle in the early-to-mid 2010’s.

The data-hungry neural models of that time required an amount of training data that was prohibitively expensive for most use cases, once again.

This led to a slow initial industry adoption for neural methods, with the notable exception of a handful of Computer Vision use cases where the change in accuracy was significant enough to create new use cases.

Today, advances in adaptive neural models and transfer learning has meant that smaller datasets can be enough to get state-of-the-art performance in focused applications of Machine Learning.

Creating Training DataTraining data questions todayWe strategize about training data in similar ways across many different use cases.

How much data do we need?.Who are the right people to annotate the data and how to we evaluate annotation quality?.Can we bypass paying for training data with synthetic data or pre-trained models?.On the algorithm side, how can we adapt models quickly to new labeled data and how can we interpret the uncertainty of our models to help sample the right unlabeled data for human review?Just like the methods for algorithms have evolved dramatically over the last 20 years, so have methods for creating good training data.

It was good to share a lot of these at the conference, as they are not as widely discussed as algorithms in Machine Learning circles.

How Does AI Diversity Apply to Training Data?The biggest open question today is: how does AI diversity apply to training data?In Discriminating Systems: Gender Race and Power in AI, Myers West, Whittaker, Crawford, talk about the importance of diverse representation among the people creating AI.

Their focus is on the algorithms and the people creating Machine Learning models.

To continue their arguments, there can be even greater disparity on the training data side of Machine Learning.

Algorithm-focused technologies disproportionately make life easier for wealthy people.

If you are a programmer, then chances are that a model that you create will contribute to your income for as long that model is deployed.

However, training data-focused technologies disproportionately squeeze value out of less wealthy people.

If you create training data for that same model, you will most likely get paid once for your time, while the person who built the algorithm on your data will continue to benefit.

When the people creating the algorithms in the 1990s were also the people creating the data (or rules), we had to value both contributions equally.

I hope that this is a cycle too, so that we can return to a more fair system for compensating people fairly for the value they create in training data.