In-Memory Analytics

This benchmark uses Apache Spark and runs a collaborative filtering algorithmin-memory on a dataset of user-movie ratings. The metric of interest is thetime in seconds of computing movie recommendations.

The explosion of accessible human-generated information necessitates automatedanalytical processing to cluster, classify, and filter this information.Recommender systems are a subclass of information filtering system that seek topredict the 'rating' or 'preference' that a user would give to an item.Recommender systems have become extremely common in recent years, and areapplied in a variety of applications. The most popular ones are movies, music,news, books, research articles, search queries, social tags, and products ingeneral. Because these applications suffer from I/O operations, nowadays, mostof them are running in memory. This benchmark runs the alternating leastsquares (ALS) algorithm which is provided by Spark MLlib.

Running the Benchmark

The benchmark runs the ALS algorithm on Spark through the spark-submit scriptdistributed with Spark. It takes two arguments: the dataset to use fortraining, and the personal ratings file to give recommendations for. Anyremaining arguments are passed to spark-submit.

The cloudsuite/movielens-dataset image has two datasets (one small and onelarge), and a sample personal ratings file.

To run a benchmark with the small dataset and the provided personal ratingsfile:

Tweaking the Benchmark

Any arguments after the two mandatory ones are passed to spark-submit and canbe used to tweak execution. For example, to ensure that Spark has enough memoryallocated to be able to execute the benchmark in-memory, supply it with--driver-memory and --executor-memory arguments:

Multi-node deployment

This section explains how to run the benchmark using multiple Spark workers(each running in a Docker container) that can be spread across multiple nodesin a cluster. For more information on running Spark with Docker look atcloudsuite/spark.