Spotify Embraces Hortonworks, Dumps Cloudera

World's largest music service switches Hadoop distributions to take advantage of Hortonworks Hive improvements, support services.

Spotify, the 24-million-user-strong music service based in Stockholm and London, announced Monday that it's migrating its massive, 690-node Hadoop cluster from Cloudera's software distribution to the Hortonworks Data Platform (HDP) and Hortonworks enterprise support.

Among the largest Hadoop implementations in Europe, Spotify's cluster is used to develop analytics that drive the company's personalized services, such as Spotify Radio. It also drives data-driven analyses for advertisers and partners. For example, Spotify can do listener segmentation to help advertisers place ads. It can also do geospatial analyses of listening patterns to help record labels and artists determine optimal concert locations.

"[Hortonworks'] true open source approach and the work they have done to improve the Apache Hive data warehouse system aligns well with our needs," said Wouter de Bie, team lead for data infrastructure at Spotify, in a statement. "We use Hive extensively for ad-hoc queries and for the analysis of large data sets."

Most Hadoop software distributors have supported the so-called SQL-on-Hadoop movement this year -- Cloudera with Impala, IBM with Big SQL, MapR with Drill, and Pivotal with HAWQ -- but Hortonworks is alone in doing so by focusing on improving Hadoop's existing Hive interface through its Stinger initiative.

Hive relies on behind-the-scenes MapReduce processing, which has a reputation for being slow, but Hortonworks executives insist that the company's design improvements will drive a 100X performance improvement that will yield ad-hoc query results within "a handful of seconds."

"Spotify is undertaking some really innovative work in the data analytics field and realized the need for a deep level of open source Apache Hadoop domain experience and expertise," commented Herb Cunitz, president of Hortonworks, in a statement.

Spotify launched in 2008 and soon thereafter launched a 30-node cluster on Amazon Web Services. The company switched to an on-premises 60-node cluster less than two years ago and was scaled out quickly to today's 690 nodes. The company collects more than 200 gigabytes of compressed user activity data per day and has more than 4 petabytes of capacity in its cluster.

Spotify could not be reached in time to comment on whether it's simply using Cloudera's distribution of open source software or also employing its commercial management software and support services. Spotify is said to have a highly skilled, 12-plus-engineer internal Hadoop team that would seem quite capable of running Hadoop independently. That team developed Luigi, a Python framework for batch data processing, dependency resolution and monitoring of Hadoop that Spotify has since contributed to open source.

"The cultural fit was an important factor in our selection and we have appreciated Hortonworks' relaxed, helpful and open approach," said Wouter de Bie. "We were looking for a true partner relationship and the team at Hortonworks [is] committed to enabling the overall ecosystem."

Check out this big presentation from Wouter de Bie on Spotify's implementation and uses of Hadoop http://bit.ly/153evDr I didn't see any mention of Cloudera in the slides, so I suspect it's another of the many enterprises that have been setting up and supporting Hadoop clusters on their own (without benefit of support from the likes of Cloudera or Hortonworks). That's clearly changing now at Spotify with the selection of Hortonworks, but I'm still waiting to hear whether it was actually using proprietary Cloudera management software and/or support services.

To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.

IT pros at banks, investment houses, insurance companies, and other financial services organizations are focused on a range of issues, from peer-to-peer lending to cybersecurity to performance, agility, and compliance. It all matters.

Join us for a roundup of the top stories on InformationWeek.com for the week of November 6, 2016. We'll be talking with the InformationWeek.com editors and correspondents who brought you the top stories of the week to get the "story behind the story."