But given this is live data that will change as more polls are added I thought it best to use a plot that automatically updates and is interactive. So this gave me my first chance to needrCharts by Ramnath Vaidyanathan as seen at October’s meetup.

There are still a lot of things I am learning, including how to use a categorical x-axis natively on linecharts and inserting chart titles. I found a workaround for the categorical x-axis by using tickFormat but that is not pretty. I also would like to find a way to quickly switch between a line chart and a bar chart. Fitting more labels onto the x-axis or perhaps adding a scroll bar would be nice too.

The class starts with the very basics such as variable types, vectors, data.frames and matrices. After that we explore munging data with aggregate, plyr and reshape2. Once the data is prepared we will use ggplot2 to visualize it and then fit models using lm, glm and decision trees.

An often requested feature for Hadley Wickham'sggplot2 package is the ability to vertically dodge points, lines and bars. There has long been a function to shift geoms to the side when the x-axis is categorical: position_dodge. However, no such function exists for vertical shifts when the y-axis is categorical. Hadley usually responds by saying it should be easy to build, so here is a hacky patch.

All I did was copy the old functions (geom_dodge, collide, pos_dodge and PositionDodge) and make them vertical by swapping y's with x's, height with width and vice versa. It's hacky and not tested but seems to work as I'll show below.

Compare that to the multiplot function in coefplot that was built using geom_dodge and coord_flip.

multiplot(mod1, mod2, shorten = F, names = c("Base", "Interaction"))

With the exception of the ordering and plot labels, these charts are the same. The main benefit here is that avoiding coord_flip still allows the plot to be faceted, which was not possible with coord_flip.

Hopefully Hadley will be able to take these functions and incorporate them into ggplot2.

Visually, we see that until 2011 the Giants preferred to run on first and second down. Third down is usually a do-or-die down so passes will dominate on third-and-long. The grey vertical lines mark Super Bowls XLII and XLVI.

A friend of mine has told me on numerous occasions that since 1960 the Yankees have not won a World Series while a Republican was President. Upon hearing this my Republican friends (both Yankee and Red Sox fans) turn incredulous and say that this is ridiculous. So I decided to investigate. To be clear this is in no way shows causality, but just checks the numbers.

The plot above shows every Yankee win (and loss) since 1960 and the party of the President at the time. It is clear to see that all nine Yankees World Series wins came while a Democrat inhabited the White House. The fluctuation plot below shows Yankee wins both before and after 1960 and the complete lack of a block for Republican/Post-1960 simply makes the case.

With tonight’s Mega Millions jackpot estimated to be over $640 million there are long lines of people waiting to buy tickets. Of course you always hear about the probability of winning which is easy enough to calculate: Five numbers ranging from 1 through 56 are drawn (without replacement) then a sixth ball is pulled from a set of 1 through 46. That means there are choose(56, 5) * 46 = 175,711,536 possible different combinations. That is why people are constantly reminded of how unlikely they are to win.

But I want to see how likely it is that SOMEONE will win tonight. So let’s break out R and ggplot!

As of this afternoon it was reported (sorry no source) that two tickets were sold for every American. So let’s assume that each of these tickets is an independent Bernoulli trial with probability of success of 1/175,711,536.

Running 1,000 simulations we see the distribution of the number of winners in the histogram above.

With the Super Bowl only hours away now is your last chance to buy your boxes. Assuming the last digits are not assigned randomly you can maximize your chances with a little analysis. While I’ve seen plenty of sites giving the raw numbers, I thought a little visualization was in order.

In the graph above (made using ggplot2 in R, of course) the bigger squares represent greater frequency. The axes are labelled “Home” and “Away” for orientation, but in the Super Bowl that probably doesn’t matter too much, especially considering that Indianapolis is (Peyton) Manning territory so the locals will most likely be rooting for the Giants. Further, I believe Super Bowl XLII, featuring the same two teams, had a disproportionate number of Giants fans. Bias disclaimer: GO BIG BLUE!!!

Below is the same graph broken down by year to see how the distribution has changed over the past 20 years.

Fig. 1: This graph shows received and sent text messages by month. Notice the spike in July 2010.

A few weeks ago my iPhone for some reason erased ALL of my previous text messages (SMS and MMS) and it was as if I was starting with a new phone. After doing some digging I discovered that each time you sync your iPhone a copy of its text message database is saved on your computer which can be accessed without jailbreaking.

My original intent was to take the old database and union it with the new database for all the texting I had done since then, thus restoring all of my text messages. But once I got into the SQLite database I realized that I had a ton of information on my hands that was begging to be analyzed. It also didn’t hurt that I was in a lovely but small Vermont town for the week without much else to do at night.

My first finding, as seen above, is that my text messaging spiked after my girlfriend and I broke up around July of last year. Notice that for both years there is a dip in December. That’s because in 2009 I was in Burma during December and for 2010 the data stopped on December 6th when the last backup was made. A simple t-test confirmed that my texting did indeed increase after the breakup.

Fig. 2: This graph shows my text messaging pattern over time for both men and women. Notice the crossover around August 2010.

More interestingly, is that before my girlfriend and I broke up last year I texted more men than women, but shortly after we broke up that flipped. I don’t think that needs much of an explanation. The above graph and further analysis excludes her and family members because they would bias the gender effect. Being a good statistician I ran a poisson regression to see if there really was a significant change. The coefficient plot below (which is on the logarithmic scale) shows that my texting with males increased after the breakup (or Epoch) by 74% (calculated by summing the coefficients for “Epoch”, “Male” and “Male:Epoch” and then exponentiating) while my texting with females increased 127%.

Fig. 3: Here the “Male” coefficient seems statistically insignificant but its direction makes sense so it is left in the model. The “Intercept” is interpreted as the texting rate with females before the breakup, “Epoch” is the increase with females after the breakup, “Intercept” plus “Male” is the rate with males before the breakup. “Epoch” combined with “Male:Epoch” is the change in rate for texts with males after the breakup.

A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right. I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’scoefplot() in the arm package. Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:. Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor. One big difference between his and mine is that his does not include the Intercept by default. Mine includes the intercept with the option of excluding it.

I managed to solve the latter problem pretty quickly using some regularexpressions. Now the levels of factors are displayed alone, without being prepended by the factor name. As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.

Both of these changes made for a vast improvement over what I had avialable before. Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.

The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness. I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be. I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will. Examples of my old version and of Dr. Gelman’s are after the break.