Monday, December 31, 2012

Spreadsheets for dogsled races: calculating run times

Many dogsled race organizers have become very good at publishing the standings online as they come in from the checkpoints. The Top Of The World 350 has been an example of diligence, and having done this myself, I know it is not always an easy task.

But sometimes we want to use the data right away for our own purposes. Maybe we want the tables sorted by arrival in the next checkpoint rather than by departure from the previous one, or we're interested in run times and want to compare them. So I thought I'd give an example laying in three easy steps how I used my usual tool set this morning to calculate the TOTW mushers' run times between Chicken and the finish in Tok. In the end I'll also list a few alternatives for those who use other computer software.

Step 1: Copy the race stats table into a spreadsheet

I want to use the table entitled "Chicken to Tok Finish Line- 75 Miles" to calculate run times between Chicken and Tok. The run time is simply the arrival time in Tok minus the departure time in Chicken, but calculating this musher by musher, even with the help of a handy time difference calculator, is tedious. Therefore, I'd rather have the table in a spreadsheet I can manipulate. Microsoft Excel would work perfectly fine, but then I might want to share the result with my friends online, so an online spreadsheet like the one in Google Docs is even better. (Also I have Excel installed only on my work computer in the office, not at home.) If you've never used a spreadsheet, no fear -- this is an easy introduction to them.

I use an older Macbook Pro with OS X Snow Leopard, but all of these steps would work identically on Windows or even GNU/Linux, if you use the Google Chrome browser with the (free) "Table Capture" extension installed. This is a very simple extension that makes it particularly easy to get a table from a web page into a Google Spreadsheet or to copy it anywhere else really, like a MS Word document. Here is a screenshot of how it works (click to embiggen, as they say):

When you click the little red table icon in the address bar (right edge, next to a few other tool icons), a list of all tables on the page appears, and scrolling down the list each table is highlighted (red frame) in turn. If you hit "To Google Doc", a new Google spreadsheet opens (you're asked to log into your Google account if you aren't), and you're asked to use the "paste" function to transfer the table:

Voilà.

Step 2: Clean up the spreadsheet

The previous step leaves us with the data correctly inserted in the spreadsheet, but before we can use them, it is time for a few clean-up tasks. The easy ones first:

We want to give the spreadsheet a name, such as "TOTW 350 Chicken - Tok Finish - unofficial".

Then we observe that the first row, the one that says "Bib Musher..." is not a data row but contains the labels of the data below. To make Google understand that, I grab the little grey-striped handle at the top-left (red arrow just above the number 1) and pull it down one row. This handle separates the data from the header, and you can just pull it down. It ensures that when the table is re-sorted, the labels always stay on top.

Third, Crispin Studer's arrival time was indicated as 24:11:00. But when we calculate differences, we would rather have this as 0:11:00, 11 minutes after midnight, so I changed the 24 to a 0.

Google correctly recognizes the data as times, which is great, and formats it accordingly. But we still have the problem that when we want to calculate the difference between "Time Into Tok" and "Time Out Of Chicken", the times aren't all from the same day: some are Dec 30, and some Dec 31. Also, if we want to compare times across the entire race, there may be several days between points-in-time. So we have to bite the bullet and add dates to the times. The most straightforward way is to double-click on the first time field (Lance Mackey - Time Out Of Chicken - 6:52:00), then click once more in front of the text, and to type in "12/30/2012 " (the space at the end is important). Once you're done with the first row (Lance), you can select/copy the bit of date text and use "paste" for all the other fields. Careful, some of the times in the "Time Into Tok" column need 12/31/2012 " instead.

Phew, that was the most tedious step of the entire task. Luckily this isn't the Iditarod with 150 time fields to edit! The end result of the cleaned-up spreadsheet looks like this:

Step 3: Calculate run times

The run time is now simply the difference between the Time Into Tok (column E) and the Time Out Of Chicken (column C). I add a column at the end to hold this data and call it "Run time", and start at the top with Lance Mackey, row 2. The formula for his run time is "=E2 - C2", with a minus sign, and an equal sign to indicate that this is a formula, not text, that I'm typing into the first field of my new column. I could also type "=", then select Lance's Time Into Tok, then "-", then select his Time Out Of Chicken. The result appears as soon as I hit Return, and oops! It's a number like 0.39236111...! No panic, Google simply needs to be told that the difference between two date-time stamps is a time (I'm not sure why it can't figure this one out by itself). The format of the field needs to be changed via the menu:

That is, go to Format > Number > Time, and the 0.39... will change into a time (9:25:00). Now I only have to grab hold of the little blue square at the bottom-right of the new field and pull it down the entire column (at least for the mushers who HAVE times in both the Time Into Tok and the Time Out Of Chicken fields), and there we have it -- run times:

We can even sort the table by run time (the down arrow next to the column letter G), and see that Jake had the fastest time on the last leg, Joar had a fabulous run time as well, and also that Matt Hall stepped on the gas again.

That's it! It took me a lot less time to do than to eat my breakfast.

At this point you may have some objections.

Objection 1: I don't use Chrome!

The main difficulty is to get that damn HTML table out of the web page and into a spreadsheet. There are many ways to do this, but unfortunately they tend to be fiddly and not work on every computing platform the same. You could try to just select the entire table text with the mouse, copy, open the spreadsheet, and then paste, but whether this works depends on a rather large number of factors: the web page itself (how the table code is written behind the scenes), the operating system you're using (which manages the clipboard for copy-paste), the browser (for interfacing with the clipboard) and the spreadsheet software. I've had good luck and bad luck. Microsoft browsers on Windows with Excel usually work well together, but that's not what I use at home. However, for many browsers, there are now extensions available that make copying tables easier. For Chrome, the Table Capture extension works great for me. For Firefox, I've seen a few extensions that do this and for this article tried two:

Objection 2: Math?! Date formats?! Are you kidding me?!

It's really not hard. If you've never used a spreadsheet formula before, you'll probably be surprised how easy these tools make it to get really useful small tasks done. The main point to understand is that you can simply subtract, say, 5 pm on the 30th of December from 2 am on the 31st and get a time difference -- how many hours, minutes and seconds are between these two points-in-time -- and that the computer needs to be fed clean and consistent values to be able to understand times, dates and timestamps (the combination of a date and a time-of-day, simply speaking). So data needs to be in a format the computer can read, and for Google in the USA language setting, "12/30/2012 19:36" works fine. It will also understand the international (ISO) format "2012-12-31 19:36" for example. I admit that getting this right is the most fiddly part of the entire thing, but in the end it's just about getting used to conventions of how to write dates and times.

Objection 3: I'm a programmer and I really don't want to edit data fields by hand!

OK, if you aren't a programmer, stop reading. No harm done.

Seriously, the following comes with a jargon warning.

You are a programmer? You have heard the term "regular expression" before? OK, on your head be it!

Google Docs does in fact offer regular expression find-and-replace, but it's not well documented. So with this information, here's what I did:

For the non-programmers who have read on out of curiosity, this means in words:

Find any text in the spreadsheet that has at least one digit (\d+) at the beginning of the cell (^) followed by a colon (:)

Replace it with the text "12/30/12 " followed by itself (that is, the text we just found), referred to as $1.

I then hand-edited the small number of cells that needed 31 instead of 30. Google takes care of reformating the 12 into 2012. The parentheses are required. Otherwise you'll get an error message.

That's it! But really, editing the 30 or so fields by hand wasn't that hard either.