The intention behind doing the replication is two-fold: firstly, published papers presumably do “proper stats”, so by pinching the methodology, I can hopefully pick up some tricks about how you’re supposed to do this stats lark by apprenticing myself, via the web, to someone who has done something similar; secondly, it provides an answer scheme, of a sort: I can check my answers with the answers in the back of the book published paper. I’m also hoping that by working through the exercise, I can start to frame my own questions and start testing my own assumptions.

So… let’s finally get started. The data I’ll be using for now is pulled from my scraper of the FormulaOne website (personal research purposes, blah, blah, blah;-). If you’re following along, from the Download button grab the whole SQLite database and save it into whatever directory you’re going to be working in in R…

…because this is an R-based replication… (I use RStudio for my R tinkerings; if you need to change directory when you’re in R, setwd('~/path/to/myfiles')).

To load the data in, and have a quick peek at what’s loaded, I use the following recipe:

f1 = dbConnect(drv ="SQLite", dbname="f1com_megascraper.sqlite")
#There are a couple of ways we can display the tables that are loaded
#Directly:
dbListTables(f1)
#Or via a query:
dbGetQuery(f1, 'SELECT name FROM sqlite_master WHERE type="table"')

The last two commands each display a list of the names of the database tables that can be loaded in as separate R dataframes.

To load the data in from a particular table, we run a SELECT query on the database. Let’s start with grabbing the same data that the paper analysed – the results from 2009:

We’re now in a position whereby we can start to look at the data, and do some analysis of it. The paper shows a couple of example scatterplots charting the “qualifying” position against the final race position. We can go one better in single line using the ggplot2 package:

At first glance, this may all look very confusing, but just take your time, actually look at it, see it you can spot any patterns or emergent structure in the way the points are distributed. Does it look as if the points on the chart might have a story to tell? (If a picture saves a thousand words, and it takes five minutes to read a thousand words, the picture might actually take a couple of minutes to read get across the same information…;-)

First thing to notice: there are lots of panels – one for each race. The panels are ordered left to right, then down a row, left to right, down a row, etc, in the order they took place during the season. So you should be able to quickly spot the first race of the season (top left), the last race (at the end of the last/bottom row), and races in mid-season (middle of them all!)

Now focus on a single chart and thing about that the positioning of the points actually means. If the qualifying position was the same as the race position, what would you expect to see? That is, if the car that qualified first finished first; the car that qualified second finished second; the car that qualified tenth finished tenth; and so on. As the axes are incrementally ordered, I’d expect to see a straight line, going up at 45 degrees if the same space was given on each axis (that is, if the charts were square plots).

We can run a quick experiment to see how that looks:

#Plot increasing x and increasing y in step with each other
ggplot() + geom_point(aes(x=1:20, y=1:20))

Looking over the race charts, some of the charts have the points plotted roughly along a straight line – Turkey and Abu Dhabi, for example. If the cars started the race roughly according to qualifying position, this would suggest the races may have been rather processional? On the other hand, Brazil is all over the place – qualifying position appears to bear very little relationship to where the cars actually finished the race.

Even for such a simple chart type, it’s amazing how much structure we might be able to pull out of these charts. If every one was a straight line, we might imagine a very boring season, full of processions. If every chart was all over the place, with little apparent association between qualifying and race positions, we might begin to wonder whether there was any “form” to speak of at all. (Such events may on the surface provide exciting races, at least at the start of the season, but if every driver has an equal, but random, chance of winning, it makes it hard to root for anyone in particular, makes it hard to root for an underdog versus a favourite, and so on.)

Again, we can run a quick experiment to see what things would look like if each car qualified at random and was placed at random at the end of the race:

Note that I have: i) filtered the data to get just the data required for each race; ii) added a little bit more styling with the theme_bw() call; iii) used the gridExtra package to generate the side-by-side plot.

Here’s what the data looked like in the original paper:

NOT the same… Hmmm… How about if we plot the grid.int position vs. the final position? (Set x = grid.int and tweak the xlab()):

So – the “qualifying” position in the published paper is actually the grid position? Did I not read the paper closely enough, or did they make a mistake? (So much for academic peer review stopping errors getting through. Like, erm, the error in the visualiastion of the confidence interval that also got through? ;-)

Generating similar views for other years should be trivial – simply tweak the year selector in the SELECT query and then repeat as above.

In the limit, I guess everything could be wrapped up in a single function to plot either the facetted view, or, if a list of explicitly named races are passed in to the function, as a set of more specific race charts. But that is left as an exercise for the reader…;-)

Okay – enough for now… am I ever going to get even as far as blogging the correlation analysis., let alone the regressions?! Hopefully in the next post in this series!

PS I couldn’t resist… here’s the summary set of charts for the 2012 season, in respect of grid positions vs. final race classifications:

Was it as you remember it?!! Were Australia and Malaysia all over the place? Was there a big crash amongst the front runners in Belgium?!

PPS Rather than swap this blog with too many posts on this topic, I intend to start a new uncourse blog and post them there in future… details to follow…

Rate this:

Share this:

Like this:

Related

4 comments

Some not very stats-y responses – by the looks of the race charts, in both years, the first race was amongst the most random. My first question would be “is this because of changes to the car regulations?” and therefore are ther teams under/over-performing compared to past performance because their engineers have just got this one new thing right?

Also by the looks of them, most of them seem to show a positive correlation and it be interesting to see what the Pearson’s R value is for them, especially for betting purposes, because if you had a race circuit where the relationship between starting grid slot and final position was something like 0.99 in multiple years, you’d be far more comfortable putting money on the person on pole winning.

Your stats have indeed pulled out the huge crash at Belgium that took out Alonso, Hamilton, Perez and Grosjean and severely hampered Kobyashi and Maldonado. My race notes for Australia just say that it was exciting not why and Malaysian got hit by Biblical levels of rain which tends to add chaos.