Coming off a data science buzz from Open Data Science Conference (ODSC) in Santa Clara. I was a huge fan of the crowd that was there. Genuinely curious people with very substantive questions. We had some wonderful conversation, and I appreciate the insight and thought a lot of my colleagues are putting into their work. I was invited for two talks: one 4-hour training session on Data Science 101, and a 45-minute talk on recommendation systems.

The goal of my talk was to go through the thought process and critical thinking in developing a recommender system and refining the model. My slides are below, and pretty self-explanatory. What I wanted to talk about were some of the great followup questions and emails I got, particularly around developing a baseline and normalizing your data.

There are a lot of tutorials about how to execute a basic recommendation system, but few on how to refine them. Let's assume you've built out your recommendation system, and are able to validate. One way of measuring validity is by a basic root means squared error (RMSE) measure. I won't get into the debate on whether it's a *good* error measurement or not, but let's say it's the one we use. Here's an article on Data Science Central about it, if you're so inclined.

What you will likely find is an okay model that has more error than you'd like. That is, your model is good, but not great, at predicting what your audience wants. Normalization is a great first step (and in some cases, the most impactful step) in fine-tuning your model. Normalization will be the most effective if you have a very diverse group of users or items, and/or you do not have many data points per item/user. In data science speak, both contribute to your variance.

Normalization is one way to approach this problem. The idea is to create a baseline for 'normal' and then produce an offset for that item or user. The theory is that each user and item can vary from the expectation. Think about it this way - some people are just naturally more cheerful or more grumpy. In terms of items, some items are just perceived better or worse than others - for example, there is a huge cult of Mac that will just love every Apple branded product. Brand recognition will impact an item's perception even before the rating happens.

Normalization is a fairly easy process. For all users, get some mid-range (probably median) rating value across all users. For all products, get some sort of median rating value across all products. So you'll come out with "people tend to rate products at 3.5 on average" or "products tend to get a 3.5 rating on average".

Next, take the median score for each user across all products, and the median score for all ratings for a products. Yes, that means you will need some min number of ratings per product and per user, so this will work if you have more data. You'll come out with "user x tends to rate products at a 3.8" or "product x tends to have a rating of 3.3".

Third, you subtract the score from the baseline to get the offset. In my example above User x would have an offset of 3.5 - 3.8 = -0.3, and product x would have an offset of 3.5 - 3.3= 0.2.

To implement, you take your predicted value from your model - so that's a user/item prediction in your matrix - and add in the offset. Let's say I predicted that User x would give a rating of 4.0 to product x in my naive model. With normalization, I would contribute my offset and arrive at <4.0 - 0.3 + 0.2> = 3.9. My actual predicted value for User x reviewing Product x is 3.9.

Of course this nuance has even further nuance. It's a bit hand-wavy to say 'baseline is average across all users or products.' There may be category specific characteristics that are a better way to develop a baseline. For example, maybe a new laptop should be judged against the baseline of all laptops, and not all products. This is particularly important if you're, say Amazon and have everything from crayons to chandeliers.

Normalization is just one of many ways to refine your model, and I talk about others in the presentation. I'll also post the video from my talk when it's posted.

PCA is a much-used and poorly understood way of reducing the number of features in your analysis. At this year's PyBay, I gave a lecture on the intuition behind PCA. What drew me to this lecture was the challenge of explaining something that appears mathematically intimidating in a more accessible way.

You can look through for the lesson, but what I enjoyed about designing this talk was actively trying to stay away from equations and code. It's easy to fall back on yet another python tutorial or yet another linear algebra explanation. Let's be honest, most newcomers to data science will often apply tools like PCA without fully understanding what it does and why. Most will tell you the following:

1) You reduce dimensions the same way you do feature selection, but without 'losing information' (whatever that means!)

2) You type in a line of code and voila! instant dimensions reduced, and plow ahead with your classification or clustering.

A deeper understanding helps you become a better data scientist. It's easy and tempting to plug and play code, but a good data scientist knows when not to use a tool:

First, a data scientist should understand the appropriate use of dimensionality reduction. In many cases, it's approached incorrectly - a user will say, "I want to project my data down to n dimensions." Instead, check your scree plot for the cumulative percent variance explained with each additional dimension, to make sure you're picking up an adequate amount of variance.

Second, PCA doesn't change your data. It's just a shift in perspective used to combat the curse of dimensionality. In my talk, I use the example of the duck/rabbit optical illusion to illustrate how a different perspective can make a projection look completely different.

We comprehend dimensionality reduction every day - on our phones, televisions, laptops and movie theaters. All an image is is a 2 dimensional representation of a 3 dimensional image. Think about the picture below - it is clearly three dimensional, but we are clearly able to understand what is imparted, as the third dimension, though useful, is not necessary for our understanding of what the image portrays. In PCA terms. we're able to collapse the data by removing one dimension without sacrificing information:

Finally, it's important to note that you do lose interpretability for a non-data science audience. A simple way to explain it is that you weed out the signal from the noise. However, it's much easier to explain a beta coefficient of a linear model than to explain the eigenvalues that compose the eigenvector of each component. It's quite important to keep that in mind when choosing to perform dimensionality reduction on your model. For people in sensitive data environments (banking or healthcare), you may have to stick to basic feature selection.