Snippets from a computing educator and researcher.

What's the best language model?: part 2

In part 1 of this post, I talked about the range of language models we could use. But that still leaves the question of which is best.

As it's not obvious which is the best langauge model, we'll perform an experiment to find out. We'll take some samples of text from our corpus of novels, encipher them with random keys, then try to break the key. We'll use different language models on each sample ciphertext and count how many each one gets.

When developing things like this, it's often easier to start from the end point and then build the tools we need to make that work.

As we're just running some tests in a small experimental harness, I'll break some rules of good programming and keep various constants in global variables, where it makes life easier.

Testing some models

Let's assume we have some models to test, in a list called models and a list of message_lengths to try. We want to build a dict of dicts: the outer dict has one element for each model, and the inner dicts have one element for each test message length.

…but it does highlight that we need two pieces of information for each model: a name we can use when talking about it, and a func, the function which we call to use that model.

Given that, we can eval_one_model by just making trials number of random ciphertexts, trying to break each one, and counting successes when the breaking function gets it right. (As we'll be testing tens of thousands of ciphertexts, the print is there just to reassure us the experiment is making progress.)

Generating random cipher text

The approach we'll use is to take a lot of real text and then pull samples out of it. On first sight, an alternative approach would be to generate random text from the letter frequencies, but that won't help when we come to test bigram and trigram models.

To read the text, we make use of the sanitise function defined earlier. We just read the three novels we have lying around, join them together, sanitise them, and call that our corpus:

To generate a random piece of ciphertext, we pick a random start position in the corpus (taking care it's not too close to the end of the corpus), pick out a slice of corpus of the appropriate length, pick a random key, and encipher the sample with that key. We return both the key and the ciphertext, so that eval_one_model can work.

We need to test all combinations of these. For the sake of consistency, we'll use the same norm for both vector scalings.

In addition, the norm-based measures return the distance between two vectors, while the cipher breaking method wants to maximise the similarity of the two pieces of text. That means that, for some comparisons, we want to invert the function result to turn the distance into a similarity.

Let's start with what we know. For each metric for comparing two vectors, we need the func that does the comparison, an invert flag to say if this is finding a distance not a similarity, and a name. For each scaling, we need the corpus_frequency for the English counts we're comparing to, the scaling for scaling the sample text, and the name for this scaling.

All that's left is the make_frequecy_compare_function. make_frequecy_compare_function takes all sorts of parameters, but we want it to return a function that takes just one: the text being scored. The trick is that, inside make_frequecy_compare_function, we can refer to all its parameters. If we create a function in that context and return it, the returned function can still access these parameters! This is termed a closure: the returned function encloses the parameters that were in scope when the closure was created. See the Wikibooks and Wikipedia articles.

An example might make it clearer (taken from Wikibooks). make_adder(x) returns a function which adds x to some other number. This returned function rembers the value of x when it was created.

That's the idea. The functions returned by make_adder, which I've called add1 and add5, remember the "number to add" which was used when the closure was created. Now, even outside make_adder, we can use that closure to add 1 or 5 to a number.

This little example isn't that useful, but we use the same concept of closures to create the scoring function we need here. We build a closure which implements the scoring function we want, so that when the closure is passed a piece of text, it returns the appropriate score.

We now have all the pieces in place to do the experiment! Apart from one thing…

Writing results

Now we've generated all the results with the call to eval_models, we need to write them out to a file so we can analyse the results. The standard csv library writes csv files for us, and just about every spreadsheet and data analysis package reads them. We use the library to create a csv.DictWriter object, which writes dicts to a csv file. The only tweak is that we add the name to each row of results to that things appear nicely.

What does this show us? For "long" ciphertexts (20 letters or more) it doesn't really matter what langauge model we use, as all of them perform just about perfectly.

For short ciphertexts, the n-gram models significantly outperform the norm-based models. This means the n-gram models win out both on performance, and on ease of use and understanding.

But that's really surprising for me is how short the ciphertexts can be and still be broken. Even with just five characters of Caesar-enciphered text, the trigram model gets it right about 75% of the time, and even a very naïve unigram approach gets the right answer nearly 50% of the time.