Category Archives: Data Science

Last week I was invited to give a talk at the Big Data Innovation Summit in Boston. My colleague, Patrick Philips, and I gave a talk about how LinkedIn uses crowdsourcing to improve its machine learning algorithms and data products. Since this post is not about this talk, I will just mention that we got great feedbacks and you can watch it here.

Parallel to the data innovation summit, there was another conference held in Boston, the Sports Analytics Innovation Summit. Since Patrick and I are big sports fans, our work involves doing a lot of analysis and we had a free pass because both conferences were organized by the same group, we decided to drop by.

We weren’t wrong. It was incredibly interesting to see the difference between our world – data products oriented consumer internet companies and the world of sports analytics. While state of the art for the internet companies is analyzing terabytes of data using distributed big data frameworks and applying machine learning algorithms such as deep learning neural networks, the state of the art for sports analytics to put it mildly … is different.

One of the most informative sessions at the conferences was given by the data analysts of the National Football League. In this session, the people in charge of analytics for the NFL explained that since data gathering and cleaning is a task that needs to be performed by all NFL teams, the NFL has built a new platform for teams to consume this data.

Platform is a pretty big word. One might imagine this platform as something similar to Google analytics where a team coach could log in and watch fancy graphs and charts about his team performance. It’s not exactly like that. It’s more like an Excel spreadsheet that holds very similar data to the one you might encounter at Yahoo! Sports or ESPN. Actually, it’s not more like an Excel spreadsheet, but it is exactly an Excel spreadsheet that is emailed to the teams in the league every week.

The spreadsheet contains a lot of tabs and a lot of canned reports about which team played, what was the performance of the players, which players were on the field for any given play and links to the videos of the plays. Pretty straightforward stuff. But there was something also a little bit different there. Something that immediately caught my eye, because it was a data product, and one that in the idea, very similar to many of the products we develop at LinkedIn. The product was named “Similar Running Backs”, which sounds very similar to the “Similar Profiles”, “Similar Companies” and “Similar Schools” LinkedIn data products.

The way the NFL analysts explained the idea behind Similar Running Backs is that every year the teams need to renegotiate contracts with their players. To make the negotiations (which sometimes go up to more than ten million dollars a year contracts) efficient, it is very helpful to the teams to understand how similar players are compensated. So the league created this tool for the teams as part of their new platform and the first version of this tool was comparing running backs.

Here is how it works – you select a player from a drop down list and the select two numbers which represent the similarity range of the players you are looking for. The smaller the range, the less players will fit the criteria and vice versa. For example: the values 95% and 105% will return the players whose regular season stats are between 95% and 105% of the corresponding statistic of the selected player.

Now let’s look at real data and see how the algorithm works. Note: for this analysis I only looked at players who in the 2012 season played at least 10 games and who had at least 10 rushing attempts. Let’s see what data do we have. The NFL used the following stats to assess running back similarity: number of games played, rushing attempts made, total rushing yards, rushing yards per game, rushing yards per attempt and rushing touchdowns.

Issue #1

Since the job of the running back is to carry the ball through the defensive line, which is basically a set of about seven 300 lbs. guys, running backs tend to get injured a lot. This causes them to miss a lot of games and makes hard to compare players who played all games in a season to players who played part of the games.

Solution #1

Normalize the data by the amount of games played. That is, instead of counting total rushing attempts and total rushing touchdowns, use rushing attempts per game and rushing touchdowns per game.

Now that we have the stats right, let’s try to find all the players with +/-5% range to every statistic.

Issue #2

Not all stats are alike. For example, in 2012 the top rushing yards player was Adrian Peterson with 131 yards per game while the lowest player, Jorvorskie Lane, rushed only 0.8 yards per game, about 160X times less. This means that the gaps between the rushing yards per game of players can be very significant. In comparison, in the rushing yards per attempt category, the top player, Cedric Peerman, rushed for 7.2 yards per attempt, while the lowest player rush for 1 yard per attempt which is only 7.2X lower. Since the differences between players on those two metrics are very different, it doesn’t make sense to compare 5% difference on these two stats as they symbolize the same “similarity”. Being within 5% of rushing yards per game is very similar, while for rushing per attempt it’s not.

Solution #2

Normalize the data to have the same units of distance. What we want to do here is to transform of our measurements to have the same range. One way to do so is to use the standard deviation metric. Standard deviation is a measurement for how wide is our range. Think of a bell curve, the wider the bell curve, the higher the standard deviation. We want the bell curves for all of our stats to look similar to each other. To accomplish this, we can normalize our data by the standard deviation. (This post is too short to explain why this concept works. Feel free to read more in this article).

Now that we have the right stats and they are all comparable, we can start looking at what players are similar to each other. Remember, since we have only four stats to work with, the most similar players will have all of their stats within 95%-105% of each other. Less similar players will have only 3, 2, so on and so forth.

It appears there are no two players in the league who are very similar on all four stats, but there are some who share three. Here is a visualization of these similarities:

We can see from this graph that there are three pairs of players that are very similar to each other in their stats and a cluster of six players who are also similar to one another.

Issue #3

While Darius Reynaud is similar to D.J. Ware who is similar to Le’Ron McClain who is similar to Jason Snelling. The first and the last are not very similar to each other. While both their output wasn’t high, Jason Snelling rushed twice more per game and per attempt than Darius Reynaud.

Issue #4

This similarity metric is too coarse. It’s all or nothing, either the players are within 5% from each other in most stats or they don’t. Even if we reduce the number of stats players have to be similar at to two as can be seen in this graph.

We still get only pairs of players who are similar to each other, but it is hard to see how they compare to other players.

Issue #5

This similarity product has too many levers. We need to provide it with the range of what it means that two players are similar to each other and we also need to provide how many stats should be similar.

Solution to #3, #4 and #5

We can provide a visualization that:

Displays all the players

Uses a continuous similarity metric where closer means more similar instead of the binary similar or not similar we used before

Doesn’t need any levers

In order to achieve that, we will just cluster all players into groups and then display all the players on a chart where similar players will be close to each other and dissimilar people are far.

Now we can see all the players in a single graph separated into five groups. The red group (number 1) are the superstars, guys like Adrian Peterson, Marshawn Lynch and Arian Foster. These are the guys with the most rushing attempts, the highest yardage per game and the guys who by far scored the most touchdowns. The group closest to it, in magenta (number 5), are the second tier guys. These guys are very productive running backs, just not as productive as the guys in the first group. But while these two groups are interesting, pretty much every football fan could break these players into these three buckets. What is more interesting is who are the other three groups.

The second magenta group (number 3) is our least productive players. These players rushed only for 4 yards on average in a game with each attempt advancing them slightly more than 2 yards. The blue group (number 4) is made of players that while rushing 5 times the yards per game and twice the yardage per attempt, managed to score about the same numbers of touchdowns, 0.07 a game. The green group is made of players who are very similar to the blue group, only twice more effective in scoring touchdowns.

While this analysis does not provide a myriad of insights that are not already known to subject matter experts it does provide a nice and robust framework to understand player similarities with a single look. Also, while it’s very easy to compare players according to only their rushing abilities, things become more complicated when we add more dimensions to look at like fumbles and catches. Which running backs finished last season most similar to each other in terms of all this stats combine? This is a much harder question to answer for which I will let you guess in the comments, but the answer could be easily displayed once again as point on a flat surface.

A lot of articles have been written about data science being the greatest thing since sliced bread (including this one by me). Data products are the driving force behind new multi-billion dollar companies and a lot of the things we do today on a day to day basis have machine learning algorithms behind them. But unfortunately, even though data science is a concept invented in the 21st century, in practice the state of data science is more similar to software engineering in mid 20th century.

The pioneers of data science did a great job of making it very accessible and fairly easy to pick up, but since it’s beginning circa 2005, not much effort has been made to bring it up to par with modern software engineering practices. Machine learning code is still code, and as any software that reaches production environments it should follow standard software engineering practices, like modularity, maintainability and quality (among many others).

The talk, Scalable and Flexible Machine Learning, which I gave with Christopher Severs in multiple venues reflected our frustration around this issue and proposed a solution that we feel brings us closer to where data science should be.

The first thing to understand is that data science is mostly manipulation of data. Usually the data can be of large scale and complex, but these manipulations are commonly found in non data science code bases as well. Even more, since in the work of a data scientist you don’t know which functionality achieves the desired result, for example whether the right metric is mean or median, the argument for modular code becomes much stronger.

Our second proposition was that the current tooling that data scientist use is inadequate for the type of systems that end up in production. Real world data processing in many cases is at least as complicated as writing regular software tasks such as fetching data from a database, passing messages between mobile devices or throttling the bit rate of a streamed video, only in the former case the tools being used are on one spectrum the SQL-like Hive and the very basic scripting language Pig. Although these language can be extended using user defined functions, those functions are very ad-hoc in nature and very difficult to reuse. On the other spectrum, there is the vanilla Java MapReduce which consists of an awful amount of boilerplate code that has very little to do with the actual desired functionality.

We tried to propose a more modern alternative that will combine the best of both worlds, concise and high level as Pig and Hive and a fully powered programming language like Java. Our technology of choice was Scalding which is a high level abstraction framework over MapReduce written in Scala. It has all of the functionality of Pig and Hive achieved sometimes with fewer lines of code. The fact that it is written in Scala, which is a modern language over the Java Virtual Machine is a great feature since all of the Java libraries can be reused. But Scala is not just a modern version of Java, it is a language designed for the functional paradigm.

I won’t go in this post on the what is a functional programming language and all the features that make it better for data processing, but if I would have to provide just a single proof to support my case, is that MapReduce is a very functional concept. Both map and reduce are higher-order functions, a cornerstone of functional programming. It is not a coincidence that Google chose this simple functional paradigm to be the base of their entire data processing framework.

Our last point was addressing the tendency of many data scientists to overcomplicate their algorithms. These data scientists begin their search for the solution for their problem by reading all the literature on the topic of their problem. One of the problems with this approach, is that it tends to over complicate a solution for the sake of a minor gain. The best example of this phenomena is the Netflix Challenge. Netflix offered one million dollars to a team that will beat its inhouse algorithm by at least 10%. The winning algorithm was so complicated that Netflix decided to pay the prize without actually implementing it. In many cases, the first algorithm can be an off-the-shelf algorithm that can be found in almost any machine learning package, for example: PageRank for ranking or Collaborative Filtering for recommendations.

To sum it all up, if you are writing your data processing code in complete disregard of all the engineering principles of the last few decades and developing your algorithms according to the state of the art publications without trying the simple ones first – You’re Doing It Wrong!

My first position as a data scientist, was in the israeli intelligence, in a unit which is the equivalent of the American National Security Agency (NSA). I do not know anything about PRISM, NSA’s surveillance program, but my experience both in government positions and afterwards working for two big data companies helps me understand what are the drivers of the people who work on it.

The first thing to understand about this problem, is that it is not a big data problem. It is a huge data problem. Words cannot describe the amounts of data we are talking about. Storing all the communication information from Google, Facebook and the wireless carriers would require multiple data centers each the size of several football fields. Such amounts of data cannot be processed by people, instead, intelligence organizations rely on sophisticated algorithms to do most of the work to find this needle in a haystack they are looking for.

There are a lot of different challenges intelligence agencies are trying to solve. For this post I will focus on a single challenge – which individuals present a threat to national security. This problem is referred to in the machine learning world as a classification problem. Classification is the problem of identifying to which category an observation belongs, for example: given the level of force applied to a car’s door handle determine whether the alarm should go off, given an image determine whether it contains a human face, or given a person’s communication records determine whether she poses a threat.

The way to build a good classification model is to have a good training set. A training set is a set of records which you already know the answer for. If we use the face recognition challenge, the training will have some images that have faces and some that do not. It is very easy to come up with an extensive training set for the facial recognition challenge. Not so much for terrorists. There are just not so many cases of real threats to national security to learn from, and those that who exist, differ considerably from each other. A very small training set leads to a very inaccurate classification model.

In the case of national security, this is simply not good enough. Every threat that the model fails to detect might result in a very bad outcome. After 9/11 no one wants to be the person who let a terrorism event happen on his watch. This concern drives data scientist to constantly try to improve their models. Unfortunately, it is not always possible to improve models using the same training data. So when data scientists exhaust their modeling capabilities, they turn to get more data. But in the case of identifying people who don’t want to be identified, the most useful data is not one you would find in public records.

At a certain point, the fear of being responsible for people’s lives numbs the sense of right and wrong. People start to see the model and forget about its consequences. When every percentage point of improvement is another terrorist that will be detected before committing something terrible, privacy takes a backseat.

But privacy is not necessarily less important than safety. In some sense, privacy is safety and “National Security” is not a magic term that allows the government to trample over basic rights. In a democratic society, the government is required to do better than paternalize its people. Drastic times call for drastic measures, but these are not drastic times; so drastic measures call for drastic explanations.

disclaimer: This post does not represent the official opinion of anyone but myself. I have no knowledge whatsoever on PRISM or equivalent programs. This post is just my educated guess on what is driving the people who work on it.

My romance with data science began when someone recommended the book “Moneyball” by Michael Lewis. If you haven’t read it, please do. At the very least, watch the movie. Moneyball is the story about the transformation of the Oakland A’s baseball team from being one of the worst teams in baseball to a team that set the american league record of 20 wins in a row. One of the main reasons for this transformation is their reliance on statistics instead of general gut feeling and domain expertise, the way baseball was always managed. To understand how amazing this transformation is, let’s look at baseball numbers. The average salary of a team is about 90 million dollars a year, which is roughly the average number of wins in a season. This means that each win costs a baseball team about a million dollars. I don’t know how much the A’s paid Paul DePodesta, their data scientist, but I’m sure he was very well worth it. Of course most data scientists do not work for sports teams, but the stakes are even higher with internet companies considering current valuations. Companies like Google, Facebook and my current employer, LinkedIn, have really grown immensely and become Fortune 500 companies with 10 times fewer employees than other companies on that list. There are a multitude of reasons for these companies’ success, but their proficiency in handling data is surely one of them.

In 2009, Google released a paper titled The Unreasonable Effectiveness of Data and though I agree with most of it, I would argue that the effectiveness comes first and foremost from the data scientists themselves. Most people see the role of the data scientist as one who takes a problem, gathers some data, applies some machine learning algorithm and gets results in a form of a chart. Let’s look at this process more closely.

Problems – More important than solving a data problem, is finding the right data problem to solve. The right problem is one that can move the needle on a metric important to your business. However, most of the time, this is not a data problem. Great data scientists use data to find a chain of causes and effects that leads them to a solution of a data problem that also solves the business problem. For example,

Netflix wants to retain customers after their trial period – problem

Customers who watch more movies are more likely to sign up – cause/effect #1

Customers who discover good movies will watch more movies – cause/effect #2

Build a system to recommend movies to customers – solution

At first glance it doesn’t seem that building a recommender system for movies helps Netflix to retain more customers. Using this technique, not only helps to solve the problem, it also provides valuable data and insights to the rest of the company and significantly lowers the business risk of the project.

Data – Without context, data is just bytes on your hard drive. In the age of big data, people tend to measure companies by the amount of data they store, this is no more reasonable than measuring software by the number of lines in it. Not all data is created equal. A good data scientist knows the value of each data set in her possession, a great one will also know the value of those which are not and how to get them. One of the best examples of this principle can be found on the Google Image Search project. The people who worked on this project realized that getting more labels on the images they have will yield better results than improving their machine learning algorithms. In 2006, Google released a game where two players would receive the same image and their goal was to describe the image they were seeing. If both players used the same word, they would get points, and Google would get an invaluable piece of information about this image. The conception of the game did not involve fancy PhD level statistics, but a very clever sense about general problem solving.

Results – This is what truly matters. Michael Lewis did not write the book about the excellent analysis performed by Paul DePodesta on his computer to prove that the way scouts analyze players is wrong. In fact, Sabermetrics, the field of baseball analysis that Paul DePodesta based his analysis on, has existed since 1964, almost 40 years before Billy Beane, the A’s general manager, implemented it. The legend was born only after the analysis led to something that mattered, a record for most consecutive wins and got into the playoffs despite losing their three best players at the beginning of the season. Same goes for data scientists, the work does not end once the analysis has been completed. In fact, it just begins. Great data scientists know the difference between theory and practice and will follow through with their ideas to see them through to completion.

In summary, data science is a great tool to have in a company’s toolbelt and can have a disproportionate impact on its achievements. Great data scientists understand the business needs of a company, use data to find the best solution and make this solution a reality.