A melting pot of statistics, machine learning and data visualization

It’s no secret that I enjoy basketball, but I’ve often wondered about the carbon footprint that can be caused by 30 teams each playing an 82-game season. Ultimately, that’s 2460 air flights across the whole of the USA, each carrying 30+ individuals.

For these reasons, I decided to investigate the average distance travelled by each NBA team during the 2013-2014 NBA season. In order to do so, I had to obtain the game schedule for the whole 2013-2014 season, but also the distances between arenas in which games are played. While obtaining the regular season schedule was straightforward (a shameless copy and paste), for the distance between arenas, I first had to extract the coordinates of each arena, which could be achieved using the geocode function in the ggmap package.

Once the coordinate of all NBA arenas were obtained, we can use this information to compute the pairwise distance matrix between each NBA arena. However we first had to define a function to compute the distance between two pairs of latitude-longitude.

Using the function above and the coordinates of NBA arenas, the distance between any two given NBA arenas can be computed with the following lines of code.Computing the distance matrix between all NBA arenas:

By performing this operation on all pairs of NBA teams, we can compute a distance matrix, which can be used in conjunction with the 2013-2014 regular season schedule to compute the total distance travelled by each NBA teams. Finally, all that was left was to visualize the data in an attractive manner. I find the googleVis is a great resource for that, as it provides a convenient interface between R and the Google Chart Tools API. Because wordpress.com does not support javascript, you can view the interactive graph by clicking on the image below.

Total distance (in km) travelled by all NBA teams during the 2013-2014 NBA regular season

Incredibly, we see that the aggregate number of kilometers travelled by NBA teams amounts to 2,108,806 kms! I hope the players have some kind of frequent flyer card…We can take this a step further by computing the amount of CO2 emitted by each NBA team during the 2013-2014 season. The NBA charters standard A319 Airbus planes, which according to the Airbus website emits an average of 9.92 kg of CO2 per km. Again, you can view the interactive graph of CO2 by clicking on the image below.

Total amount of CO2 (in kg) consummed by all NBA teams during the 2013-2014 NBA regular season

Not surprisingly, Oregon and California-based teams travel and pollute the most, since the NBA is mid-east / east coast heavy in its distribution of teams. It is somewhat ironic that the hipster / recycle-crazy / eco-friendly citizens of Portland are also the host of the most polluting NBA team 🙂
What is also interesting is to plot the trail of flights (or pollution) achieved by the NBA throught the season.

Great circle maps of all airplane flights completed by NBA teams during the 2013-2014 regular season.

I’ve been thinking about designing an algorithm that finds the NBA season schedule with minimal carbon footprint, which is essentially an optimization problem. The only issue is that there are a huge amount of restrictions to consider, such as christmas day games, first day of season games etc… More on that later.
As usual, all the relevant code for this analysis can be found on my github account.

While good draft picks and deft management can help you win championships, there is no doubt that NBA teams can massively gain, or lose, by trading players with one another. Here, I played around with some publicly available data given at basketball-reference.com, and had a look at the numbers behind all trades undertaken in the NBA from 1948 to present.

After some quick Python scraping and data cleaning, I first looked at the overall number of trades that were performed in the NBA during the period 1948-present.

Total number of trades completed by all NBA teams active between the 1948 and 2014 seasons.

Clearly, we see that the number of trades grows as we move along the years, which can probably be attributed to many factors such as the increasing ease of travelling/mobility and the growing number of teams in the NBA (Of course, the 2014 season is still ongoing so all the number are not all in yet!). Next, I set out to look at whether any NBA teams showed preferential attachment with one another, i.e. do any NBA teams show apreference towards trading with one another rather than with other teams? This could easily be summarized by constructing an adjacency matrix M of dimension N x N (where N is the number of NBA teams), in which each cell N(i, j) gives the number of trades operated between team i and j. For simplicity, I restricted the analysis to teams that are currently active.

While the plot above is pretty(ish), it is not very informative. The data is a lot more instructive if visualized as a mixed correlation plot (use the corrplot package!… we’ll also conveniently ignore the obvious caveat that I have not normalized the data for how long each team has been in the NBA….)

Adjacency matrix of number of trades completed between all pairs of currently active NBA teams. This includes all historical trade data from 1948 to present.

We can take this a step further and ask ourselves which teams have had the most success with trades, and also what are the best indivual trades ever performed in the history of the NBA? For this, I collected the win share data associated to each trade. The win share (WS) metric is an estimate of the number of wins produced by a player, and is a good way to determine how many victories an NBA player contributed to his team during his tenure there (more details can be found here). By computing the differential win share per trade (WS gained in trade – WS lost in trade), it is possible to gain an insight in the quality of each trade.

Distribution of win shares gained or lost by each team in the NBA. This includes all historical trade data from 1948 to present.

In the plot above, I marked inactive teams with a hyphen. We can see that the three currently active teams with the highest mean win share per trade are the LA Lakers, Dallas Mavericks and LA Clippers. In terms of win shares, the three greatest trade ever completed were:

Ultimately, I would like to create an interactive version of the plot above, so that details of each trade appear when the mouse is hovered over any given point. This is currently a work in progress that I will eventually publish here. All the relevant code for this analysis can be found on my github account.

NBA basketball is the one the sports I enjoy watching the most. As I was ordering my (undisclosed amount)th beer while watching a game during after-work hours, it occurred to me how often I had seen sparsely populated arenas during games, with large areas of seats going unoccupied. This got me to thinking about the average fan attendance for NBA teams, what could be the factors influencing attendance, and ultimately, which NBA team had the most loyal fans?

After some online browsing, Python scraping and data cleansing, I was able to obtain a good amount of data from the awesome guys at basketball-reference.com. Unfortunately, I could not find any records of fan attendance beyond 1981, so this analysis will be restricted to the period between 1981 to 2013 (with records for 2002-2006 also missing). First, I wanted to see if there were any trends in NBA fan attendance per season.

Fan attendance for each NBA teams during the seasons 1981 to 2013. Years marked with a red asterik represent shortened seasons due to a lockout. Data for the year 2002 to 2006 was not available

The two most striking features of the plot above are the obvious increase in fan attendance from 1981 to 1995, and the subsequent stagnation thereafter. This makes sense, since this period is widely regarded as the golden era and renaissance of basketball, full of rivalries and Hall of Fame players in their prime. Unsurprisingly, the year 1999 and 2012, which were both shortened by ~4 months due to a lockout, saw a drop in total number of fan attendance (purely as a result of lesser games being played – if I were more rigorous, I would normalize for this and also the overall US population, but I wanted to visualize the raw numbers).

Next, I investigated whether team success (the net number of wins per season) during a season could be an indicator of fan attendance. Not surprisingly, teams that won more also attracted more fans (doh!). This was true regardless of the conference in which the team was (East or West).

Fan attendance as a function of number of wins for all NBA teams during the period of 1981-2013

I also looked at whether fans were more attracted by teams that scored a lot, or by teams that put an emphasis on defense. However, I had to consider historical trends in scoring, and adjust for the fact that defenses/offenses have gotten more sophisticated over time. Therefore, I decided to look at the fan attendance numbers of each NBA team during a given season, and plot that as a function team’s deviation from the median number of points scored by all teams during that season. The plot below shows the aggregate of all points after considering each individual season between 1981 and 2013. Interestingly, although teams that score more attract more fans, it seems that good defense is even more likely to attract crowds.

Fan attendance as a function of the number of points scored for and against the home team. To adjust for the variability in offensive/defensive points scored at each season, the attendance numbers are plotted against the home team’s deviation from the season average.

Of course, the caveat of the above plot is that teams that score a lot and/or defend well are more likely to win, and thus attract more fans. Indeed, winning teams usually develop bandwagon fans and thus inflate their attendance numbers. Therefore, I sought to find out who were the most loyal fans in the NBA. In my mind, the mark of a truly loyal fanbase is one that shows up to support its team regardless of win/loss ratio. For these reasons, I plotted the fan attendance of each NBA team normalized per number of wins.

And so the most loyal fanbase are the good people of Memphis, Minnesota and Toronto!

I will add all the relevant code to my github account soon (basically as soon as I’ve commented it!)

This post is going to differ slightly from the data-orientated material that I usually publish. I was recently playing around with the Google trends API and came across some very interesting…well….trends. There has definitely been a huge amount of publicity surrounding “Big Data”, maybe even too much. For those of us who have been working in academia, large datasets were becoming a natural day-to-day occurrence that, in my opinion, was a byproduct of the ever-increasing computational power at our disposal. While, there is no doubt that we have arrived in an era in which diverse data can be continuously collected in large volumes, this will only be of any use if statistically and computationally-savvy individuals are put the task of analyzing and retrieving the most relevant elements of the data. Here, I will show just how Big Data and Data Science has been on everyone’s mind for the last few years.

A search for Big Data reveals the sharp growth in searches from 2011 onwards:

Google trend data for searches of the phrase “Big Data”.

We can also see the origin of these searches across the globe on the scale of countries and cities, with a particularly strong cluster located in India but also South Korea.

Geotagging of countries where Google searches involving the phrase “Big Data” were made.

Geotagging of cities where Google searches involving the phrase “Big Data” were made.

Alongside the rise of Big Data was the acknowledgement that data scientists would be required to analyze this data, which was reflected by the sharp increase of searches for “data scientist” and also “data scientist jobs”

Google trend data for searches of the phrase “Data Scientist”.

Google trend data for searches of the phrase “Data Science Jobs”.

Clearly, the interest is there and doesn’t appear to slow down for now. On that note, I would love to start a project in which you could predict whether a trend is there to stay based off historical trend data for google – that would be a neat little side-project!

I have been watching the awesome Netflix show “House of Cards” and been fascinated by the devious schemes that Underwood is constantly plotting. The show often mentions approval ratings and it got me to wondering what Obama’s ratings currently were, and all other past US president for that matter. However, I didn’t have much chance finding publicly available data that was a) easily accessible and b) free. (granted – I was quite lazy in my search).

Ultimately, I resorted to scraping the Roper Center website for the data that I needed. Below is the distribution of approval ratings for each president from Roosevelt (when records began) to Obama.

JFK, Bush-Sr and Eisenhower rank as the top three presidents which the highest approval rate during their tenure in the presidential office. However, the variation in ratings for Bush-Sr was considerably larger. Similarly, Truman and Bush-Jr has large variance in their ratings, but were also the two most unpopular presidents. As we can also see, Obama does not rank very high amongst presidential approval ratings, with only four other presidents with lower ratings (although it should be noted that Obama still has three remaining years to bump up his average).

Below is the breakdown of approval ratings for each individual president. Note some of the sharp peaks that we see for some presidents, like the spike in approval ratings for Bush-Jr after the 9/11 tragedy; or the drop in ratings for Bush-Sr and Nixon after the start of the Iraq war and Watergate, respectively.

I recently discovered the Capitol Words API and have had some fun playing around with it. One of the categories in the API allows you to search for the words spoken by the senators of each state in the USA, and I was interested in finding out the number of times the words “gun” were recorded on a state bill between January 2012 and December 2013.

As we can see, the most densely populated states of New York, California, Illinois and, to a lesser extent, Texas, mention the word “gun” the most often. It is in interesting (but not surprising) to note that the more Republican and pro-gun Midwestern states are conspicuously quiet about mentioning guns. We can also track the monthly occurence at which the word “gun” was mentioned in state bills between January 2012 and December 2013:

The sharp peak we observe across many states on April 2013 illustrates the national response and outrage that followed the tragic Boston marathon bombing and subsequent shootings. We can also see that the state of California shows some peaks in February, June and November 2013, which can be associated to the Christopher Dorner shooting, the June 7 Santa Monica shooting and the November 1 LAX shooting.

Finally, we can explore the underlying relationship between references to “education” in state bills and that of “gun” and “shooting”. Again, the obvious outliers are Connecticut, California and Illinois, which all refer to education an unordinary amount of times. Interestingly, if these three outliers were removed, we could argue that a decent linear fit (with positive coefficient) could be achieved between the number of times the word “education” is stated in a bill and that of “gun” and “shooting”. In that case, we could interpret this as education being mentioned as a result of gun crime and shooting (a causal analysis will be in order for future work, namely finding the average lag time between shooting events and the reaction of statesmen).

Relationship between the number of times the words “shooting” and “education” were mentioned in state bills between January 2012 and December 2013

Relationship between the number of times the words “gun” and “education” were mentioned in state bills between January 2012 and December 2013

Although I have a heavy background in statistics (and therefore am primarily an R user), I find that my overall knowledge in computer science is generally lacking. Therefore, I have recently delved deeper into learning about data structures, their associated ADT’s and how they can be implemented. At the same time, I am using this as an opportunity to play around with more unfamiliar languages like Julia, and to a lesser extent, Python.

In Part 1 of 2 of this series, I investigated some of the properties of dynamic arrays in R, Python and Julia. In particular, I was interested in exploring the relationship between the length of an array and its size in bytes, and how this was handled by different languages. For this purpose, I wrote extremely simple functions that recursively added an integer (for this purpose the number 1) to a vector (or one-dimensional array) and extracted its size in bytes at each step.

A call to each of these functions using n=100 yields the following plot

The results of this experiment are quite striking. Python starts with an empty array of size 72 bytes, which increases to 104 as soon as an element is appended to the array. Therefore, Python automatically adds (104-72) = 32 = 4 x 8 bytes. This means that Python extends the array so that it is capable of storing four more object references than are actually needed. By the 5th insert, we have added (136-72) = 64 = 8 x 8 bytes. This goes such that the growth pattern occurs at points 4, 8, 16, 25, 35, 46, 58, 72, 88. interestingly, you can observe that the position i at which the array is extended can be related to the points at which the array itself grows through the relation i = (number_of_bytes_added_at_i – 72) / 8.

For R, I found this link, which does a great job of explaining the mechanisms behind the memory management of R vectors.

Finally, it appears that the push!() function in Julia does not proceed to any kind of preemptive dynamic reallocation of memory. However, I observed that repeatedly calling push!() was fast (more on that later) but not constant time. Further investigation led me to the C source for appending to arrays in Julia, which suggests that Julia performs an occasional exponential reallocation of the buffer (although I am not sure so please correct if wrong!).

Next, I looked at the efficiency of each languages when dealing with dynamic arrays. IN particular, I was interested in how quickly each language could append values to an existing array.

The results clearly show that Julia is far quicker than both Python and R. It should be noted that when defining a function in Julia, the first pass will actually compile and run. Therefore, subsequent calls are generally faster than the first (an important fact to consider during performance benchmarking). Overall, it seems that the statements made by the Julia community are true in this context, namely Julia is a lot faster than Python and R (will add a C++ benchmark eventually 🙂 )