Last Hottest 100 Revision

This list is generated using a more sophisticated method (yet again) after a discussion over Twitter I had with @chrisjrn about the kinds of bias there might be in the sample.

I refined the list with three techniques.

Firstly, I address a line-wrap problem with the images. Because of the way the votes are displayed when you finish putting them in, songs get line-wrapped onto a second line if they’re longer than about 30-40 characters. That unfairly penalised long song/artist names, because the matching algorithm uses lines as a delimiter (it needs some sort of stop character to know when to stop counting text as a possible song name to match). I wrote some code to “unwrap” these lines most of the time, though some will still be unwrapped because it still requires character recognition to get things part of the way there.

Secondly, I moved to only counting “complete” ballots. In the list of images, some of the song/artist titles get munged as part of the OCR process. In my previous code, I was counting any song I could unmunge enough to find a match, but that meant I was discarding songs semi-randomly.

Only songs aren’t independent of each other: a person’s taste in music isn’t completely random, so their choices of songs won’t be random, so each “ballot” will have some similarity. If I only count completed ballots, I bias the sample towards readable ballots instead of just song names that are harder to OCR than others.

The new code counts a “complete” ballot as one that the program can match between 8 and 10 songs on. The others are marked for human intervention and could be fixed up by hand if I wanted to, which I don’t.

Finally, I swapped to using a much faster locality hash method: a Levenshtein distance ratio. This is a different way to measure string similarity, and the Python implementation here is much, much faster than the pure-Python Nimsimsa hash I was using. Each run now completes in 19 seconds instead of about 20 minutes.

I found the idea of a Levenshtein distance via the PyPI page for Fuzzy while looking into soundex implementations courtesy of an in-passing suggestion from another colleague of mine earlier today.

I ran both the “ballot only” and a “all matched songs” matching programs, and the results are beyond the click-through. They’re remarkably close, but there are a couple of significant differences that suggest to me that the ballot method will prove a better predictor.

We’ll find out in a little over 24 hours. I’m looking forward to it!

Here are the rankings for both methods in one table for a side-by-side comparison.