I need to write a script, probably in Ruby, that will take one block of text and compare a number of transcriptions of recordings of that text to the original to check for accuracy. If that's just completely confusing, I'll try explaining another way...

I have recordings of several different people reading a script that is a few sentences long. These recordings have all been transcribed back to text a number of times by other people. I need to take all of the transcriptions (hundreds) and compare them against the original script for accuracy.

I'm having trouble even conceptualising the pseudocode, and wondering if someone can point me in the right direction. Is there an established algorithm I should be considering? The Levenshtein distance has been suggested to me, but this seems like it wouldn't cope well with longer strings, considering differences in punctuation choices, whitespace, etc.--missing the first word would wreck the entire algorithm, even if every other word were perfect. I'm open to anything--thank you!

Edit:

Thanks for the tips, psyho. One of my biggest concerns, however, is a situation like this:

Original Text:

I would've taken that course if I'd known it was available!

Transcription

I would have taken that course if I'd known it was available!

Even with a word-wise comparison of tokens, this transcription will be marked as quite errant, even though it's almost perfect, and this is hardly an edge-case! "would've" and "would have" are commonly pronounced extremely similarly, especially in this part of the world. Is there a way to make the approach you suggest robust enough to deal with this? I've thought about running a word-wise comparison both forward and backward and building a sort of composite score, but this would fall apart with a transcription like this:

3 Answers
3

Tokenize your input into words (convert a string containing words, punctuation, etc. into an array of lowercase words, without punctuation).

Use the Levenshtein distance (wordwise) to compare the original array with the transcription arrays.

Possible improvements:

You could introduce tokens for punctuation (or replace them all with a simple token like '.').

Levenshtein distance algorithm can be modified so that misspelling a character that with a character that is close on the keyboard generates a smaller distance. You could potentialy apply this, so that when comparing individual words, you would use Levenshtein distance (normalized, so that it's value ranges from 0 to 1, for example by dividing it by the length of the longer of the two words), and then use that value in the "outer" distance calculation.

It's hard to say what algorithm will work best with your data. My tip is: make sure you have some automated way of visualizing or testing your solution. This way you can quickly iterate and experiment with your solution and see how your changes affect the end result.

EDIT:
In response to your concerns:

The easiest way would be to start with normalizing the shorter forms (using gsub):

str.gsub("n't", ' not').gsub("'d", " had").gsub("'re", " are")

Note, that you can even expand "'s" to " is", even if it's not grammatically correct, because if John's means "John is", then you will get it right, and if it means "owned by John", then most likely both texts will contain the same form, so you will not further the distance by expanding both "incorrectly". The other case is when it should mean "John has", but then after "'s" there probably will be "got", so you can handle that easily as well.

You will probably also want to deal with numeric values (1st = first, etc.). Generally you can probably improve the result by doing some preprocessing. Don't worry if it's not always 100% correct, it should just be correct enough:)

After experimenting with the issues I noted in this question, I found that the Levenshtein Distance actually takes these problems into account. I don't fully understand how or why, but can see after experimentation that this is the case.