Breaking Secret Codes with Mathematica

Mathematica can make you feel like a computational superman. Armed with that attitude and some schoolboy knowledge of cryptography, I turned my attention to cipher breaking this week, only to discover buried kryptonite.

The weakness of ciphers (where you swap every occurrence of a particular letter in your message with the same different letter) is that they don’t change the patterns of letters. The simplest attack that exploits this fact is frequency analysis. The most common letter in English is “e”, and so it follows that the most common character in an encoded message (assuming the message is written in English) will correspond to “e”. And so on through the alphabet.

Mary Queen of Scots famously lost her head when Queen Elizabeth’s spymaster broke Mary’s cipher using frequency analysis. I figured that if sixteenth century spies could do it by hand, I should be able to automate it in Mathematica in about 10 minutes.
Let’s get started. First I want to generate random test ciphers.

For this post, I am going to look at the simplest case by limiting myself to ciphers where upper and lowercase letters are the same (i.e., “e” and “E”) and map to the same symbol, and punctuation and spaces are not encoded. The approach would be the same with a larger character set. Here is one of the 4 X 1026 such ciphers:

Encoding a message with this cipher is so direct, I won’t bother creating a function:

Reversing the process is trivial, if you know the encoding key. (This is another weakness of ciphers, that you need secure key exchange).

OK, two minutes in and that cipher’s implemented. Now let’s write the frequency attack code. First we need to sort the letters in a text into frequency order.

Now all that we need to break the cipher is to pair up the characters in the message sorted by frequency with the letters in some calibration text, also sorted by frequency. By not hard-coding the frequency order, this code will work for other languages, as long as you provide calibration text in the right language. It will also take into account writing style, if you have sample text from the original author.

There we go—cipher breaking through frequency analysis implemented in just a couple of lines! Let’s test. I will encode the first 10,000 characters of Pride and Prejudice (in lowercase).

And for the calibration text, I will use the last 341,000 characters of the book (about half of it).

Here is our guessed key:

And here is the message decoded:

Kryptonite! Superman is on his knees! Why hasn’t this worked?

After some debugging angst and a few experiments, it finally dawned on me that my schoolboy theory about how easy it is to crack ciphers isn’t all it was cracked up to be. What was left of my admiration for my school math teacher took another hit! The problem is that the difference in frequencies between some of the letters is less than 1%, but the standard deviation of those characters’ frequencies on 10,000 character samples can be up to about 0.5%, making it reasonably likely that one letter will be in the wrong place in the frequency order. Let’s look for examples at “s” and “r”. We can make a probability distribution out of successive samples from our text.

If we look at “s” and “r”, their frequencies are perilously close compared to their standard deviations.

Using those distributions, we see that the more common letter, “s”, will actually only appear ahead of “r” in the rankings 54% of the time. In other words, 46% of the time the frequency analysis will get them the wrong way round.

When you accumulate all of the chances of ordering errors, the chance that the frequency analysis will actually fully decode your message becomes vanishingly small. Things improved surprisingly little as the length of the sample text increased. Even when I used the whole first half of the book, the result was unintelligible. How useless is that?

But kryptonite never quite stops Superman. And as I looked carefully at the decoded text, I realized that some of the letters were actually correct.

The first word of our message should have been “chapter”, and the frequency analysis has got the “…ter” right. Perhaps frequency analysis has worked better than it appears. How could I make further progress? Two approaches sprang to mind:

1) Use further frequency analysis—frequencies of letter pairs (“th”, “sh”, “ed” will be high in English), including double letters (“oo”, “ee”,“tt”, etc.); frequencies of first letters of words and last letters of words; frequencies by word length (e.g., one letter words will mostly be “I” and “a”); and so on. There are a lot of ways to slice that data.

2) Our letter ordering might be wrong, but it is probably close; we could try perturbing the order—move letters up and down the order slightly and see if that improves the result.

But in the end I used neither of those. For both approaches I needed a way to resolve conflicting suggestions. One obvious answer was to see how many valid English words were in the result. If two approaches give two different suggestions for what a letter maps to, we will go with the one that improves the number of valid words in the message.

Here is a function to pull all the words that are not in the dictionary. (Note that from this point onward, it DOES matter that I have not encoded punctuation. In a real-world scenario, I would need to determine if substrings are valid, not just whole words, and would need a different approach for punctuation.)

At this point, faced with quite a bit more work than I had planned, it occurred to me that we could take an even simpler approach of spell-checking the text and use this improvement test on the suggested corrections. OK, slightly more involved than just spell-checking, but that’s the basic concept.

For each invalid word, we get the list of dictionary words that are the same length…

…and find the nearest in EditDistance. If there are several equally near, then we ignore them, since we are more likely to feed ourselves false information in what is already a guessing process.

Having picked the closest known word, we then align the characters and delete the ones that match up so that we are left with the corrections. We then turn these into replacement rules.

The next step is to take all of the suggested correction rules that we have discovered by this means and sort them by how common they are. There is no point trying to apply contradictory rules, so I delete all of the less common rules that map to or from the same characters as the popular suggestion.

Some of these suggested replacements will be right, and will improve things; some will be spurious and make things worse. My intuition is that popular suggestions will be better than unpopular ones, so my next step is to take the most popular n suggestions, apply them, and count the number of invalid words. We then pick the n that minimizes remaining invalid words.

That was wasn’t the trivially simple code I set out to create, and I am way past my 10-minute target, but pleasingly, the code usually does a very good job on 10,000 character texts, though it can depend on the cipher that it is trying to crack.

With a little ingenuity backed by his superpowers, our computational Superman has beaten his foe!

Great post… (BTW, for those interested in the history of codes and code breaking, I *highly* recommend “The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography” by Simon Singh)

Suggestion: After arriving at your first guess solution (i.e., the failed one), it would be cool to just have a dynamic interface set up so users could crack the rest by hand.

Here’s a marketing idea: solve the K4 with M, the seemingly elusive 97 characters of the Kryptos in front of the CIA headquarters (although 6 characters are known by now, the word “Berlin”). 21 years of crypto-research and a yahoo discussion group of 2000 people, and 91 characters remain completely impenetrable until today.

McLoone’s discussion of his evolving program was very interesting. I’d like to see how well his program would work if the punctuation and the spaces between words were also enciphered. Knowing where the words are is probably the most valuable information for solving the cipher puzzles in the daily paper.

@Harry
I gave some thought to punctuation, but decided it would make the blog item too long. Frequency analysis will work on punctuation about as well as it does on the letters. But there is another characteristic of punctuation that I suspect is more powerful and that is distribution analysis. Spaces might have similar frequencies to a vowel, but their distribution will be much more consistant. Typically ever 5-6 letters, rarely every 2-3 letters and rarely separated by more than 10-14 letters. Likewise periods might be no more frequent than a “k” but appear quite regularly. I think a good program use different methods to pick off different symbols with different frequency and distribution characteristics.

Great post.. Just a thought: I would think that HammingDistance would be a better choice than EditDistance for this problem, because we only care about character substitutions. I don’t know how much effect it would have on the final solution – I’m still working through it – but anyway, thought it was worth mentioning.