Meta

A blog about analytics and music

Sparkling Song Recommendations

Having taught myself a bit of Python I was keen to start using Spark. Spark is an open source project from Apache building on the ideas of MapReduce. It still uses the HDFS but where Hadoop processes on disk, Spark runs things in memory, which can dramatically increase process speeds. Spark also has a great machine learning library you can use with it. Hive on Hadoop limits you to data processing and manipulation, Spark allows you to build models at scale. Spark can be called with its native Scala, or there are APIs so you can use Python or Java (release 1.4 also includes an R API).

I decided to take the lastfm dataset i have been playing with recently and generate some recommendations for the users based on their listening habits.

There are a number of ways of building recommendation engines.

Content based

This is where you use information (meta data) about the items you are recommending, and user preferences towards these. For example, if recommending films, you might suggest ore films by a director that a user likes

Collaborative Filtering

These methods user other peoples preferences to help recommend. There are a few types:

User to User

This is where you find similar users according to their views on items and use them to generate recommendations. Person A and Person B have very similar tastes in movies. Person B really liked a film that Person A hasn’t seen, therefore that film is a good recommendation for person A.

Item to Item

This is the flip of user to user. Lots of the people that like Movie A also like Movie B, therefore anyone who liked Movie A and hasn’t seen Movie B is likely to enjoy Movie B

Latent Factors

These assume there are underlying factors in the relationships between users and items that can be exploited for providing recommendations. To find these underlying factors data reduction techniques such as SVD or PCA are used.

If you want to learn more on recommendation systems Coursera has this great free course

The machine learning library in Spark offers a SVD based Latent Factors form of Collaborative filtering, and the data i have suits this approach. So that is the method I will use. It uses the Alternate Least Squares (ALS) method for resolving the SVD. You can learn more about this here. For the sake of simplicity I also decided to implement the implicit version of the algorithm. The explicit version requires a rating of the item by the user. Here I am implicitly assuming listening to an artist means that they like them.

To build a recommendation model on this scale I was going to need the AWS infrastructure. But before I dived into that and started spending money, I thought it would be best to build a processes locally. I installed Spark on my mac which was super easy with these instructions

Once installed you then need to fire up a Spark instance. To do this just go to the Spark folder you have just created and start a terminal. In the command line enter

./bin/pyspark

Thats it. You now have a Spark instance running with Python. The local process to create a recommendation engine with annotations can be found in this gist.

It took me a while to get going with Spark. Some things were really straightforward – reading data (using lines and split functions), sampling data (sample function). However coming from a SQL mindset, I found joining data confusing. In order for the latent factors ALS method to work I had to store my users and items as unique integers rather than strings. This required me to create a unique lookup (which was easy using zipWithUniqueID function), then join this to my data and replace the string ID with the integer ID. This was more tricky.

Spark works in tuples i.e. your data is stored as an ID and a value. Its easy to think about if you have say (ID, Value1) and (ID, Value2) you can join on ID to give (ID, {Value1, Value2}). However it gets more tricky when you have multiple values and an ID. You always have to think of the tuple as an ID and the multiple values associated with it (ID, {Value1, Value2, Value3…}). So if you data is stored (ID, Value1, Value2) you need to map it to a tuple so that it is stored (ID, {Value1, Value2}) then you can perform the join. In my case (user, item, rating) is mapped to (user, {item, rating}) then joined to (user, newUserID). Then I map again to get (newUserID, item, rating). If you wanted to join on multiple IDs you could map to ({ID1, ID2..}, Value1, Value2, Value3..}) and then join.

Implementing the recommendations part was simple you just train your model with ALS.trainImplicit(). Rank is the number of underlying latent factors you want to find, and iterations are the number of the alternating iterations (i.e resolving the SVD) you want to run. Then you can just apply your model using .predictAll

So once I’d developed my process locally on a small sample. It was time to switch to AWS. I moved my data into S3, and fired up a cluster with Spark installed. This was really easy through the AWS console. Go to the EMR, then create cluster. In the software configuration section you just need to add in Spark in the additional application settings.