The Best Bracket Big Data Can Build

March Madness isn’t over, but one thing is certain: no one is going to win “Buffet’s Billion.”

Before even half of the NCAA college men’s basketball tournament games had finished, every bracket entered into Yahoo’s contest to win a billion dollars had at least one red strikethrough. And while some may blame upsets like No. 14 seed Mercer knocking out No. 3 seed Duke, or No. 12 seed Harvard beating No. 5 seed Cincinnati, the odds were never in anyone’s favor but Warren Buffet’s, who offered a cool billion to anyone picking a perfect bracket.

Despite the long odds, the offer generated more interest than usual in the creation of the perfect bracket. And while luck may play as much of a role as hard math in determining winners, it hasn’t stopped statisticians and mathematicians from trying their hands at creating the ultimate algorithm.

Credit: Rowan McNaught, Kaggle.com

One website, Kaggle, is even offering a prize, albeit significantly less than $1 billion, for the best-performing prediction model. The good news is that someone will in fact win the $15,000 that the competition’s sponsor, Intel, has offered. But Kaggle’s is not your average pool. Kaggle is a website for data enthusiasts and experts to pit their skills against each other for the chance to win prizes from a number of companies seeking to solve problems through crowd sourcing. And the prize goes not to the best bracket, but to the model that performs best throughout the tournament. Competitors use their models to assign a likelihood score to each of the possible matchups, so bracket-busting upsets don’t completely knock you out of contention.

Will Cukierski, a data scientist at Kaggle, says that unlike the basketball challenge, most of the competitions they host are real problems that big companies like Amazon and Facebook want to solve. In one, insurance company Allstate wanted to predict likely insurance claim payments based on characteristics of cars involved in accidents. Prizes are typically around $25,000, Cukierski says, though the largest was $3 million. Over the course of a competition, competitors can see where their models stand on the leaderboard, which displays their score and rank.

While people from a variety of fields compete on Kaggle, Cukierski says that there’s a common thread: “the ability to manipulate data and use predictive modeling.” There are a lot of students on the site that take advantage of the chance to collaborate and learn, plus “the physicists, the econometrics people, statisticians, actuaries, business people.” Despite the prizes, the competitions are more of a hobby than a source of income for most people. “It’s kind of the cold reality of crowd sourcing,” Cukierski says. “It’s almost impossible to pay people on an hourly basis…. If you do the actual math and compute the expected values and all that stuff, you’ll find it’s not worth it just for the prizes alone.”

Most of the problems on Kaggle require so-called “big data” to solve. Such an approach is useful for problems that are “data-hungry,” Cukierski says, meaning that they will “improve as you feed in more and more data.” One such example is a movie recommendation engine, such as the one that Netflix uses. In fact, Netflix held a competition similar to those on Kaggle to improve their recommendations, and awarded the $1 million prize in 2009. A problem like that, Cukierski says, is “very nuanced,” and requires a model that can consider a vast array of parameters.

However, Cukierski thinks that the use of big data has become too trendy. “The whole big data idea is really within a big hype cycle,” mainly driven by a particular software framework for dealing with information, called Hadoop. “It’s not that Hadoop isn’t useful,” Cukierski says, but when companies look to it to solve small problems, “the people who are data scientists and actually statistically literate are kind of laughing, because you don’t need Hadoop to do most problems.”

Boyd Davis, vice president and general manager of Intel’s Datacenter Software Division, is hoping that Kaggle’s basketball competition will help show big data’s potential to businesses that haven’t yet embraced it. “It’s still hard, particularly for business leaders who aren’t technology people, to get their arms around,” Davis says. Starting with March Madness brackets, for which many people already use some level of statistics and multiple data sources, is a good way to introduce the idea of big data. “The Kaggle competition will give us a chance to show how much of a better result, hopefully, you can get if you actually use a lot more data sources and then harness them with data analytics,” Davis says. (Intel recently launched the Intel Data Platform, based on Hadoop, for companies to process big data, which Cukierski thinks is at least part of the reason that Intel sponsored the competition).

Cukierski agrees with Boyd that using sports is a good way to introduce people to the field of data science. Making brackets “is one of the few places where people will actually tolerate some amount of statistics in their real life,” he says. “They don’t realize it, but underneath the covers they’re doing a kind of rough mathematical modeling.”

The views expressed are those of the author and are not necessarily those of Scientific American.

Share this Article:

Comments

Welcome to the Scientific American Blog Network, a forum for a diverse and independent set of voices to share news and opinions and discuss issues related to science. For more information see our About page and Guidelines....more