How Kooky Kaggle Contests Advance Data Science

Kaggle.com lets organizations and individuals post problems to its massive community of scientists. The prospect of crowdsourcing some of the world’s thorniest data conundrums is evidently a popular one: more than 82,000 people from 100 countries have signed on, and many of them have submitted at least one entry to the more than 250 different contests hosted since the Website opened its virtual doors in 2010.

Each Kaggle problem set is run as a competition, complete with prizes, deadlines, and rules aplenty. (Kaggle takes a percentage cut off the top in administrative fees.) The Website also has a blue-chip roster of customers who also conduct privately sponsored contests. “This is because some of their data is too sensitive to be public,” CEO Anthony Goldbloom told me. Examples include Microsoft, which used Kaggle to improve gesture recognition on the Xbox; NASA, for better dark matter imaging tools; and GE, for more accurate airline arrival time estimation.

“There are some pretty amazing people who compete,” says Goldbloom. “And some enter 80 or more times per contest, devoting a lot of their time.” Even Goldbloom has tried his hand on a few, although he isn’t highly ranked.

Kaggle’s success has led to imitators, including India-based CrowdAnalytix.com, Innocentive.com (for the life sciences), and TunedIT.org (mainly for education and research projects). But Kaggle has been around the longest and has the largest talent pool to draw on.

Here are five contests, somewhat off the beaten path, that illustrate the depth and breadth of the Website’s reach and influence.

This contest awarded just $500, but almost 100 teams entered—it isn’t always about the dough. One of the top entries was from Jason Karpeles, who ranks in the top ten of all Kagglers and has participated in 36 different contests. Karpeles isn’t your typical data scientist: he has economics degrees and a MBA from Duke and works as a marketing forecaster in Texas.

“I don’t know if it is impressive or pathetic the number of contests that I have entered,” he said. He signed up early in Kaggle’s history and admits that he is “obsessed with the site.” His total dollar winnings are miniscule, especially in comparison to his total time spent on various contests.

Why enter so many contests? Mainly for his own self-education. “Being in a Kaggle contest is a lot like getting a post-graduate education,” he said. “It is also a good way to sharpen my skills, expand my knowledge and see how to manipulate particular data sets that I don’t often come into contact with.”

He added: “I was afraid that I might fall behind in the marketplace because data science is moving so quickly.” Kaggle is a great way to get his name out there, he feels, and it allows him to see how well he performs across various industries. “Get out and start doing something, just to try it. Don’t be afraid of failure, or your ranking. Experience is the best teacher.”

During World War II, the science of operations research got its start when it was trying to track German submarine movements and keep Allied ships from getting torpedoed. So it is somewhat fitting that a current Kaggle contest, which ends in April, is doing something similar. Only instead of German subs, they’re looking at audio recordings of whales and trying to prevent them from hitting transatlantic ships.

Cornell University’s Bioacoustic Research Program, which boasts experience in identifying endangered whale species, has deployed a 24/7 buoy network to guide ships from colliding with a particular species of whale. The contest will pay out $10,000 to the best detection algorithm; so far, there are 137 teams hard at work on this contest, including two graduate students who have inevitably called their team Free Willyzx and another team named Herman Melville.

This one paid out $3000 to a Slovakian and was a bit of fun. “Santa needs help choosing the route he takes when delivering presents around the globe. Every year, Santa has to visit every boy and girl on his list. It’s a tough challenge, and Santa admits he scored a B- on his combinatorial optimization final.” The winner had to find two shortest-distance paths through a route of chimneys.

How many of us have been insulted by a comment posted online? What, are you stupid or something? Exactly. This contest tried to predict when something would be considered insulting to someone else. Or as the contest introduction puts it, “create a generalizable single-class classifier which could operate in a near real-time mode, scrubbing the filth of the Internet away in one pass.”

It wasn’t all that altruistic: security vendor Impermium sponsored the contest, looking to “identify new ways to defend against malicious language and social spam online, and help clean up the web by scrubbing away unwanted obscenities from user-generated content.” Not surprisingly, the competition found out that people tend to be most abusive between 9:00 P.M. and 10:00 P.M.

This was big money, with a prize of $10,000 and had 50 entries. The winner was Vivek Sharma, who has entered numerous Kaggle contests. He and other top finishers were offered a job interview at the company along with the prize purse. While they ultimately did not hire anyone, “the Kaggle competition was useful and we were able to examine many interesting algorithms,” Impermium’s PR rep wrote in an email. Their engineering team had a fresh perspective on the problem and “helped ensure against tunnel vision.”

This competition, held last year, was sponsored by the William and Flora Hewlett Foundation, with the top prize of $60,000 going to a team named “SirGuessalot” who could match the average of two human teachers grading high school essays. The team submitted more than 140 different attempts before wining the top prize. “It almost sounds like science fiction,” said Goldbloom.