Repeatable sampling of data sets in BigQuery for machine learning

Doing machine learning on distributed data sets is methodologically similar to working with data that fits in-memory—train your algorithm on a subset of the data, validate on another subset, and finally test with a different subset. In this post, we’ll discuss how to pull data from BigQuery (the no-ops data warehouse that is part of Google Cloud Platform) into machine-learning-ready data sets. We’ll use Airline Ontime Performance data, a 70 million row data set from the U.S. Bureau of Transportation statistics, that is available to all users in BigQuery as the airline_ontime_data.flights data set.

Naively, the easiest way to sample data in BigQuery is to use the built-in random function in a WHERE clause:

The RAND() function returns a value between 0–1, so approximately 80% of the rows in the data set will be selected by this query. However, there are several problems with using this sampling method for machine learning:

You want to create three data sets: training, validation, and testing, and while you got 80% of the data above, it is not nearly as easy to get the 20% that were not selected, let alone split that data into two parts.

The RAND() function returns different things each time it is run, so if you run the query again, you will get a different 80% of rows. It is not repeatable. This can play havoc if you are experimenting with different machine learning models so that you can choose the best one.

The order of rows in a BigQuery result set is not guaranteed—it is essentially the order in which different workers return their results. So, even if you set a random seed to make RAND() repeatable, you’ll still not get repeatable results. You’d have to add an ORDER BY clause to explicitly sort the data (on an id field) before doing the RAND(). This is not always going to be possible.

A repeatable way to split your data set

For machine learning, you want repeatable sampling of the data you have in BigQuery. One way to achieve this is to use the last few digits of the HASH function on the field that you are using to split your data.

Let’s say that you are building a machine learning algorithm to predict arrival delays. You might want to split up your data by date and get approximately 80% of the days in the data set:

This is now repeatable—because the HASH function returns the same value any time it is invoked on a specific date, you can be sure you will get the same 80% of data each time. If you want to split your data by arrival_airport (so that 80% of airports are in the training data set), compute the HASH on arrival_airport instead of date.

It is also pretty clear what you have to do to get the validation data: change the < 8 in the query above to ==8, and for testing data, change it to == 9. This way, you get 10% of samples in validation and 10% in testing.

Repeatable sampling

This is good if you want 80% of the entire data set as training, but what if you want a smaller data set than what you have in BigQuery to play around with? The flights data is 70 million rows, and perhaps what you want is a small data set of perhaps a million flights. How would you pick one in 70 flights, and then 80% of those as training?

You can not pick one in 70 rows and then pick one in 10. Can you figure out why? Well, if you are picking numbers that are divisible by 70, of course they are also going to be divisible by 10! That second modulo operation is useless.

Note: In the code above, your one million flights come from only 1/70th of the days in the data set. This may be precisely what you want—for example, you may be modeling the full spectrum of flights on a particular day when experimenting with the smaller data set. However, if what you want is 1/70th of the flights on any particular day, use RAND() (as shown in the first code sample) and save the result as a new BigQuery table (for repeatability). From this smaller table, you can sample 80% of dates using HASH(). Because this new table is only one million rows and only for experimentation, the duplication may be acceptable.

Valliappa (Lak) Lakshmanan is currently a Tech Lead for Data and Machine Learning Professional Services for Google Cloud. His mission is to democratize machine learning so that it can be done by anyone anywhere using Google's amazing infrastructure, without deep knowledge of statistics or programming or ownership of a lot of hardware. Before Google, he led a team of data scientists at the Climate Corporation and was a Research Scientist at NOAA National Severe Storms Laboratory, working on machine learning applications for severe weather diagn...