Fresher than seeing your model doesn't have heteroscedastic errors

While I am fairly confident I am the only member of the dataists that is a sports fan (Hilary has developed a sports filter for her Twitter feed!), I am certain I am the only America football fan. For those of you like me, this past weekend marked the conclusion of another NFL season and the beginning of the worst stretch of the year for sports (31 days until Selection Sunday for March Madness).

To celebrate the conclusion of the season I—like many others—went to a Super Bowl party. As is the case with many Super Bowl parties, the one attended had a pool for betting on the score after each quarter. For those unfamiliar with the Super Bowl pool, the basic idea is that anyone can participate in the pool by purchasing boxes on a 10×10 grid. This works as follows: when a box is purchased the buyer writes his or her name in any available box. Once all boxes have been filled in, the grid is then labeled 0 through 9 in random order along the horizontal and vertical axis. These numbers correspond to score tallies for both the home and away teams in the game. Below, is an example of such a pool grid from last year’s Super Bowl.

After the end of each quarter of the game, whichever box corresponds to the trailing digit of the scores for each team wins a portion of the pot in the pool. For example, at the end of the first quarter of Super Bowl XLIV the Colts led the Saints 10 to 0. From the above example, Charles Newkirk would have won the first quarter portion of the pot for having the box on zeroes for both axes. Likewise, the final score of last year’s Super Bowl was New Orleans, 31; Indianapolis, 17. In this case, Mike Taylor would have won the pool for the final score in the above grid with the box for 1 on the Colts and 7 for the Saints.

This weekend, as I watched the pool grid get filled out and the row and column numbers drawn from a hat, I wondered: what combinations give me the highest probability of winning at the end of each quarter? Thankfully, there’s data to help answer this question, and I set out to analyze it. To calculate these probabilities I scraped the box scores from all 45 Super Bowls on Wikipedia, and then created heat maps for the probabilities of winning in each box based on this historical data. Below are the results of this analysis.

First Quarter

Half Time

Third Quarter

Final

The results are an interesting study in Super Bowl scoring as the game progresses. You have the highest chance of winning the first quarter portion of the pool if you have a zero box for either team, and the highest overall chance of winning anything of you have both zeroes. This makes sense, as it is common for one team to go scoreless after the first quarter. After the first quarter, however, you winning chances become significantly diluted.

Into half time having a zero box is good, but having a seven box gives you nearly the same chance of winning. Interestingly, into the third quarter it is best to have a seven box for the home team, while everything else is basically a wash. With the final score, everything is basically a wash, as teams are given more opportunity to score and thus adding variance to the counts for each trailing digit. That said, by a very slight margin having a seven box for either the home or away team provides a better chance of winning the final pool.

So, next year, when you are watching the numbers get assigned to the grid; cross your fingers for either a zero or a seven box. If you happen to draw a two, five, or six, consider your wager all but lost. Incidentally, one could argue that a better analysis would have used all historic NFL scoring. Perhaps, though I think most sports analysts would agree that the Super Bowl is a unique game situation. Better, therefore, to focus only on those games, despite the small-N.

Finally, the process of doing this analysis required mostly heavy-lifting on data wrangling; including, scraping the data from Wikipedia, then slicing and counting the trailing digits of the box scores to produce the probabilities for each cell. For those interested, the code is available on my github repository.

There are two R functions used in this analysis, however, that may be of general interest. First, a function that converts an integer into its Roman Numeral equivalent, which is useful when scraping Wikipedia for Super Bowl data (graciously inspired by the Dive Into Python example).

Second, the function used to perform the web scraping. Note that R’s XML package, and the good fortune that the Wikipedia editors have a uniform format for Super Bowl box scores, makes this process very easy.

How would you rank the popularity of a programming language? People often discuss which languages are the best, or which are on the rise, but how do we actually measure that?

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

Solution: measure both, and compare.

This week John Myles White and I set out to gather data that measured both the number of projects using various languages, as well as their community sizes. While neither metric has a straightforward means of collection, we decided to exploit data on Github and StackOverflow to measure each respectively. Github provides a popularity ranking for each language based on the number of projects, and using the below R function we were able to collect the number of questions tagged for each language on StackOverflow.

The above chart shows the results of this data collection, where high rank values indicate greater popularity, i.e., the most popular languages on each dimension are in the upper-right of the chart. Even with this simple comparison there are several items of note:

Metrics are highly correlated: perhaps unsurprisingly, we find that these ranks have a correlation of almost 0.8. Much less clear, however, is whether extensive use leads to large communities, or vice-a-versa?

Popularity is tiered: for those languages that conform to the linear fit, there appears to be clear separation among tiers of popularity. From “super-popular” cluster in the upper-right, to the more specialized languages in the second tier, and then those niche and deprecated languages in the lower-left.

What’s up with VimL and Delphi?: The presence of severe outliers may be an indication of weakness in these measures, but they are interesting to consider nonetheless. How is it popular that VimL could be the 10th most popular language on Github, but have almost no questions on StackOverflow? Is the StackOverflow measure actually picking up the opaqueness of languages rather than the size of their community? That might explain the position of R.

We Dataists have a much more shallow language toolkit than is represented in this graph. Having worked with my co-authors a few times, I know we primarily stick to the shell, Python, R stack; and to a lesser extent C, Perl and Ruby, so it is difficult to provide insight as to the position of many of these languages. If you see your favorite language and have a comment, please let us know.

Yesterday the good folks at IA Ventures asked me to lead off the discussion of data visualization at their Big Data Conference. I was rather misplaced among the high-profile venture capitalists and technologist in the room, but I welcome any opportunity to wax philosophically about the power and danger of conveying information visually.

I began my talk by referencing the infamous Afghanistan war PowerPoint slide because I believe it is a great example of spectacularly bad visualization, and how good intentions can lead to disastrous result. As it turns out, the war in Afghanistan is actually very complicated. Therefore, by attempting to represent that complex problem in its entirety much more is lost than gained. Sticking with that theme, yesterday I focused on three key things—I think—data visualization should do:

Make complex things simple

Extract small information from large data

Present truth, do not deceive

The emphasis is added to highlight the goal of all data visualization; to present an audience with simple small truth about whatever the data are measuring. To explore these ideas further, I provided a few examples.

As the Afghanistan war slide illustrates, networks are often the most poorly visualized data. This is frequently because those visualizing network data think it is a good idea to include all nodes and edges in the visualization. This, however, is not making a complex thing simple—rather—this is making a complex thing ugly.

Below is an example of exactly this problem. On the left is a relatively small network (V: ~2,220 and E:~4,400) with weighted edges. I have used edge thickness to illustrate weights, and used a basic force-directed algorithm in Gephi to position the nodes. This is a network hairball, and while it is possible to observe some structural properties in this example, many more subtle aspects of the data are lost in the mess.

On the right are the same data, but I have used information contained in the data to simplify the visualization. First, I performed a k-core analysis to remove all pendants and pendant chains in the data; an extremely useful technique I have mentioned several times before. Next, I used the weighted in-degree of each node as a color scale for the edges, i.e., the dark the blue the higher the in-degree of the node the edges connect to. Then, I simply dropped the nodes from the visualization entirely. Finally, I added a threshold weight for the edges so that any edges below the threshold are drawn with the lightest blue scale. Using these simple techniques the community structures are much more apparent; and more importantly, the means by which those communities are related are easily identified (note the single central node connecting nearly all communities).

To discuss the importance of extracting small information from large data I used the visualization of the WikiLeaks Afghanistan War Diaries that I worked on this past summer. The original visualization is on the left, and while many people found it useful, its primary weakness is the inability to distinguish among the various attack types represented on the map. It is clear that activity gradually increased in specific areas over time; however, it is entirely unclear what activity was driving that evolution. A better approach is to focus on one attack type and attempt glean information from that single dimension.

On the right I have extracted only the ‘Explosive Hazard’ data from the full set and visualized as before. Now, it is easy to see that the technology of IEDs were a primary force in the war, and as has been observed before, the main highway in Afghanistan significantly restricted the operations of forces.

Finally, to show the danger of data deception I replicated a chart published at the Monkey Cage a few months ago on the sagging job market for political science professors. On the left is my version of the original chart published at the Monkey Cage. At first glance, the decline in available assistant professorships over time is quite alarming. The steep slope conveys a message of general collapse in the job market. This, however, is not representative of the truth.

Note that in the visualization on the left the y-axis scales go from 450 to 700, which happen to be the limits of the y-axis data. Many data visualization tools, including ggplot2 which is used here, will scale their axes by the data limits by default. Often this is desirable; hence the default behavior, but in this case it is conveying a dishonest perspective on the job market decline. As you can see from the visualization on the right, by scaling the y-axis from zero the decline is much less dramatic, though still relatively troubling for those of us who will be going on the job market in the not distant future.

These ideas are very straightforward, which is why I think they are so important to consider when doing your own visualizations. Thanks again to IA Ventures for providing me a small soap box in front of such a formidable crowd yesterday. As always, I welcome any comments or criticisms.

Last Monday I—humbly—joined a group of NYC’s most sophisticated thinkers on all things data for a half-day unconference to help O’Reily organize their upcoming Strata conference. The break out sessions were fantastic, and the number of people in each allowed for outstanding, expert driven, discussions. One of the best sessions I attended focused on issues related to teaching data science, which inevitably led to a discussion on the skills needed to be a fully competent data scientist.

As I have said before, I think the term “data science” is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits.

What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist. Unfortunately, simply enumerating texts and tutorials does not untangle the knots. Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowdedmarket of ideas, I present the Data Science Venn Diagram.

How to read the Data Science Venn Diagram

On Monday we spent a lot of time talking about “where” a course on data science might exist at a university. The conversation was largely rhetorical, as everyone was well aware of the inherent interdisciplinary nature of the these skills; but then, why have I highlighted these three? First, none is discipline specific, but more importantly, each of these skills are on their own very valuable, but when combined with only one other are at best simply not data science, or at worst downright dangerous.

For better or worse, data is a commodity traded electronically; therefore, in order to be in this market you need to speak hacker. This, however, does not require a background in computer science—in fact—many of the most impressive hackers I have met never took a single CS course. Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker.

Once you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. This is not to say that a PhD in statistics in required to be a competent data scientist, but it does require knowing what an ordinary least squares regression is and how to interpret it.

In the third critical piece—substance—is where my thoughts on data science diverge from most of what has already been written on the topic. To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. On the flip-side, substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Part of this is the culture of academia, which does not reward researchers for understanding technology. That said, I have met many young academics and graduate students that are eager to bucking that tradition.

Finally, a word on the hacking skills plus substantive expertise danger zone. This is where I place people who, “know enough to be dangerous,” and is the most problematic area of the diagram. In this area people who are perfectly capable of extracting and structuring data, likely related to a field they know quite a bit about, and probably even know enough R to run a linear regression and report the coefficients; but they lack any understanding of what those coefficients mean. It is from this part of the diagram that the phrase “lies, damned lies, and statistics” emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created. Fortunately, it requires near willful ignorance to acquire hacking skills and substantive expertise without also learning some math and statistics along the way. As such, the danger zone is sparsely populated, however, it does not take many to produce a lot of damage.

I hope this brief illustration has provided some clarity into what data science is and what it takes to get there. By considering these questions at a high level it prevents the discussion from degrading into minutia, such as specific tools or platforms, which I think hurts the conversation.

I am sure I have overlooked many important things, but again the purpose was not to be speific. As always, I welcome any and all comments.