How do recommendation systems know what you might like?

Create powerful recommendations with Hadoop and in-memory analytics

By Wayne Thompson, Manager of Data Sciences Technologies, SAS

If you’ve ever shopped on Amazon, Pandora or Netflix, you’ve experienced recommendation systems in action. These sophisticated systems analyze historical buying behavior and make recommendations in real time, while you’re shopping.

To find the best recommendations for individuals or groups, you need analytics that can solve a tough computational and processing challenge – fast.

Recommendation systems take several approaches. The models that drive them can be:

Content-based, recommending items based on descriptions of what customers looked at or purchased in the past along with descriptions of the items they looked at or bought.

Community-based, recommending items that similar customers have liked and purchased in the past.

Recommendations systems are further broken into those that take direct or indirect feedback:

Explicit, based on customer ratings or tags (think restaurant reviews on a scale of 1 to 5).

Implicit, suggesting items that have historically been purchased together.

Recommendation systems have been deployed across a wide range of industries and contexts. When you get coupons printed out at the supermarket checkout, the offers are tailored based on your previous purchases. Online retailers and entertainment providers make real-time recommendations or show advertisements based on what a customer is viewing or listening to. Social networking sites such as LinkedIn and Facebook suggest additional connections or friends based on your existing network.

How do recommendation systems work?

To find the best recommendations for individuals or groups, you need analytics that can solve a tough computational and processing challenge – fast. You need to be able to analyze masses of data in real time, not later when the customer and the opportunity are gone.

That’s why many organizations are moving to Hadoop for recommendation engines. Hadoop is an open-source software framework for running applications on large cluster of commodity hardware. Hadoop delivers enormous processing power – the ability to handle virtually limitless concurrent tasks and jobs – which makes it a remarkably low-cost complement to a traditional enterprise data infrastructure.

Bringing a high-tech twist to the adage, “many hands make light work,” data is stored on local disks of a distributed cluster of servers. Processing is pushed to the data rather than vice versa, and done in parallel for blazing processing speeds.

In a typical scenario, you would have huge volumes of customer data stored in Hadoop – in multiple data nodes and one head node, also referred to as a general or edge node. Once the data is loaded into memory in the Hadoop cluster, in-memory analytics software takes over. The system divides to conquer. The data nodes compute recommendation models in parallel, and then return the best user-item combinations to the head node at the edge of the cluster for decision making.

Suppose an online retailer wants to make recommendations based on data about 1 million users, 500,000 books and 5 million book ratings. Given the size of the data set, this would be a computationally intensive algorithm that could take hours or days to process using traditional analytical software. With distributed processing and in-memory analytics, it runs in minutes, and delivers accurate recommendations while the customer is still shopping the site.

Choosing the best algorithm

As mentioned earlier, there are several approaches to arriving at a relevant recommendation, and therefore a choice of algorithm(s) to use in recommendation models. For instance:

K nearest neighbor is a collaborative filter based on measures of association between items or users.

Matrix factorization is a way to create latent factors representing groups of items or families of items.

Association rules automatically recommend related items as you browse or place an item in the cart.

Clustering groups of similar users or items together to streamline analysis of massive data matrices.

Slope One estimates preferences for new items based on average difference in preference value (ratings) between a new item and the other items a user prefers.

Which algorithm will do the best job for your situation? Test them. Have the different algorithms compete against each other to identify which works best for a specific business problem. The champion model could even be one that combines different analytical techniques.

The distributed, parallel processing capacity of Hadoop overcomes the bandwidth and coordination issues associated with processing complex algorithms against billions of records. In-memory analytics can quickly solve complex problems using big data, making ever more accurate recommendations over time as it “learns.” Together, these technologies can deliver more relevant recommendations to all customer touch points in real time, for both rated and non-rated items.

The last piece of the puzzle

We discussed how recommendation systems work and how to build recommendation models quickly, but a vital piece is missing: i.e. how to deploy the recommendation models. Event stream processing (ESP) is used as the engine for collecting data, filtering or aggregating data and deploying recommendations within various ESP windows in real time, which completes the whole ecosystem of recommendation systems from data collection and preprocessing to modeling and deployment.

So, would you like some fries with that?

Wayne Thompson, Manager of Data Science Technologies at SAS, is a globally renowned presenter, teacher, practitioner and innovator in the fields of data mining and machine learning. He has worked with the world’s largest organizations to help them harness analytics for high performance. During this 20 years at SAS, Thompson has been credited with bringing many landmark SAS Analytics technologies to market. His current focus is on self-service data mining tools for business analysts, outlier detection and description, entity analytics and recommendation engines, with a focus on in-memory analytics optimized for Hadoop.