Replicating the 2013 Warmest 100

I’ve replicated the methods used by the Warmest 100 crew to check their results.

I managed to collect 15,588 total votes compared with their 17,800, so my list will be slightly less accurate, thought I’ve used a slightly more complex matching technique than what @flossinspace did, so that might help things a bit.

My Method

Firstly, I wrote a Python script to get a list of images (and some metadata, like creation time) tagged with “hottest100” from Instagram, using their API and the python-instagram library. I found 2,950 images. Then I used wget to download the images themselves, at standard resolution. I used the creation time to exclude any images created before voting opened, because a lot of them were for the 2012 list.

I wrote another script (in Python) to OCR the images using Tesseract, parallelised so it would run faster on my 8-core machine, using the multiprocessing library. This processed the images in about 5.5 minutes.

I then parsed the corpus of text through some data-cleansing scripts I wrote (in Python), to fix up some of the obvious mis-codings from the OCR process. Stuff like backquotes instead of apostrophes, L0fdy instead of Lordy, F1ume instead of Flume, that sort of thing. This gave me a file with 55,377 lines, many of which were blank, nonsense, or not votes.

Then I ran a matching and counting program I wrote in Python to tally up the results. I was looking for lines that were either identical, or a close enough match that they were obviously supposed to be identical. Basically this was compensating for the miscoding of the OCR process to find close matches. Lnrde, L0rde, 10rde, l5rde, etc. are all probably supposed to be Lorde, so they all get counted as the same thing. Otherwise I’d have to combine different matches manually, and that would be super-time consuming, not to mention boring.

My matching algorithm cascaded through matching methods from a fast dictionary exact match, through some simple partial matching methods, to the quite slow Nilsimsa locality sensitive hash used by @flossinspace. I tuned the algorithm to be conservative so I reduced the risk of over-counting things that weren’t actually the same, and I kept a tally of all the matches that were made to double check.

The matching program was also written to run in parallel on my multi-core machine using a simple map/reduce approach, which made a significant difference. The first non-parallel version of the program took 192 minutes to run, so a bit over 3 hours. In contrast, the parallel version completed in just over 18 minutes.

SPOILER WARNING

The list I got is below, so if you don’t want to see my results, don’t click through. You’ve been warned. I mean, honestly, what are you doing reading this post in the first place if you don’t want to see spoilers?

The final tallies concur, in some ways, with the Warmest 100, but there are some significant differences.

Now we wait for the big day to see how many we got right! Stats is fun!

My result concur with @flossinspace and the Warmest 100 guys for who will win, but my results differ from the Warmest 100 for number 3, 4 and 5, which surprised me. It’ll be fascinating to watch on the day and see which approach was more accurate.

I’ve including the vote tally so you can see how few votes separate many of the entries, and how many songs nearly made the top 100 but didn’t quite.