I've been working as the Social Media Editor and a staff writer at Forbes since October 2011. Prior to that, I worked as a freelance writer and contributor here. On this blog, I focus on futurism, cutting edge technology, and breaking research. Follow me on Twitter - @thealexknapp. You can email me at aknapp@forbes.com

Data Scientists Figured Out What Songs You'll Like - In 24 Hours

A visualization of some of the data scientists worked with to figure out what songs people will like. (Credit: Greg Mead of Musicmetric)

It should be no surprise that the music business is also a big data business. There are thousands of bands and artists out there, each with their own unique sound. And while some artists have been able to make their way on their own in an era of social media and digital sales, most of the money in the music industry is still made by the record companies. And when it comes to figuring out which bands to sign and how to promote them, they rely on a lot of marketing data. That need is what led EMI to produce its One Million Interview Dataset - an intense set of interviews with music fans talking about their attitudes and behaviors when it comes to music. Each interview took around 20 minutes and was conducted in many different countries.

Gathering this information is an arduous task but it’s impressive. It can provide real insights into how music affects people, and also help figure out what kinds of music people are likely to buy. But sorting through that data is obviously a daunting task. But that didn’t stop the data scientists who regularly compete in Kaggle competitions. Last month, 138 competitive data teams got together for a 24 hour ‘Data Science Hackathon,’ which was put together by Data Science London on data competition company Kaggle‘s platform and sponsored by EMI and EMC.

“What they were trying to predict was the rating that a person would give to a given song,” Anthony Goldbloom, President of Kaggle explained to me on the phone. ”The prediction was based on the age, geography, and questions about music preferences. They were also asked to give descriptive words for different songs.”

One of the first thing that the competitors determined is that traditional marketing approaches didn’t work. Factors like age and socioeconomic data weren’t accurate predictors of songs. Instead, general interests and attitudes were much better drivers of predictions. Not that they didn’t learn anything based on people’s ages.

“As it turns out, older, retired people were much less discriminating and more open in their musical taste than younger people, which is the opposite of the stereotype,” noted Goldbloom.

The winner of the contest was Shanda Innovations, a tech incubator of the Chinese Shanda Corporation. They applied machine learning techniques to the data to reach their victory, but one of their biggest obstacles wasn’t the math – it was human beings themselves.

In a blog post about their victorious algorithm, the team noted, “We were very surprised to find that the variation of the track scores given by different people was a lot more than we expected. For instance, User ID 41072 scored 100 to track 156 whereas User ID 41286 gave merely 4 to the same track! It was very interesting to find that people were so different in music preference and we believed that was why so many different types of music existed.”

What did help was a technique that Goldbloom referred to as “collaboartive filtering.” That is, the teams – not just Shanda, but many of them – paid attention to the descriptive words people used in rating songs, found commonalities between them, and used that as a basis to predict ratings. You can check out their open-sourced solutions here.

Even more interesting, though, is that the top-rated teams also looked at the descriptive words that people who didn’t like a song used, and compared them to the words people who liked the song used. What they discovered were some strong correlations between postive and negative keywords. For example, someone who didn’t like a song might call it “superficial” while someone who liked it would call it “playful.”

“Basically, one person’s ‘noisy’ is another person’s ‘inspirational,’” said Goldbloom.

Post Your Comment

Post Your Reply

Forbes writers have the ability to call out member comments they find particularly interesting. Called-out comments are highlighted across the Forbes network. You'll be notified if your comment is called out.

Comments

Quite sad: Why nobody mentions Data Science London and all the parties involved in this project? Our non-profit organisation and our community members are one of the driving forces behind this project. http://musicdatascience.com The fact that all the winners have documented and open sourced their code is also not mentioned. http://musicdatascience.com/the-winners/

We often think about retail, mobile and professional service companies as the primary users of big data – trying to figure out what type of products we like, how we use or phones to get them or what our financial history is like etc. It’s great to be reminded by this example of the music industry that big data analytics is beneficial across all industries and for all consumers. http://bit.ly/Qy62pL

I feature my music on Jango and have a diverse audience. Jango’s data indicates I have slightly more male than female listeners with age groups, depending on the song, spiking at 25-34 or 35-44. Top three states are California, Texas and Florida. I haven’t a clue. Are we attempting to make sense out of the nonsensical? I believe some consummers approve products because they are afraid not to, due to perceived peer pressure. buckbaran.com