Cryptanalysis of the Simple Substitution Cipher

The Simple substitution cipher is one of the simplest ciphers, simple enough that it can usually be broken with pen and paper in a few minutes.
On this page we will focus on automatic cryptanalysis of substitution ciphers, i.e. writing programs to solve these ciphers for us.

The substitution cipher is more complicated than the Caesar and Affine ciphers.
In those cases, the number of keys were 25 and 311 respectively. This allowed a brute force solution of trying all possible keys.
The number of keys possible with the substitution cipher is much higher, around 2^88 possible keys. This means we cannot test them all, we have to 'search' for good keys.

We will be using a 'hill-climbing' algorithm to find the correct key. For this approach, we need a way of determining how similar a piece of text is to english text.
This is called rating the 'fitness' of the text. A piece of text very similar to english will get a high score (a high fitness),
while a jumble of random characters will get a low score (a low fitness). For this we will use a fitness measure based on quadgram statistics. For a guide on how to generate quadgram statistics, and some python code for rating the fitness of text, see this tutorial. This method works by first determining
the statistics of english text, then calculating the probability that the ciphertext comes from the same distribution. An incorrectly deciphered (i.e. using the wrong key)
message will probably contain sequences e.g. 'QKPC' which are very rare in normal english. In this way we can rank different decryption keys, the decryption key we want is the
one that produces deciphered text with the highest likelyhood.

The hill-climbing algorithm looks like this:

Generate a random key, called the 'parent', decipher the ciphertext using this key. Rate the fitness of the deciphered text, store the result.

Change the key slightly (swap two characters in the key at random), measure the fitness of the deciphered text using the new key.

If the fitness is higher with the modified key, discard our old parent and store the modified key as the new parent.

Go back to 2, unless no improvement in fitness occurred in the last 1000 iterations.

As this cycle proceeds, the deciphered text gets fitter and fitter, the key becomes better until either the solution appears, or,
the solution is not found. In this case the run has failed and must be repeated with a different starting key. This means the hill-climbing algorithm is stuck in a 'local maximum',
where there are no simple changes that can be made to the key to improve fitness, and yet it is not at the true solution. If this happens you can run the
algorithm again with a different parent in the hope it may reach the true solution this time. In the implementation below,
we may restart the algorithm 100's of times in the search for the best key.

The algorithm depends on the fitness function correctly distinguishing whether the plaintext from one key is better than plaintext from another.
This is done, as explained earlier, by comparing quadgram statistics from the plaintext to quadgram statistics of english text.
If the statisitics match closely, we say that the fitness is high. However, this system fails when the true plaintext does not have statistics similar to english,
this example comes from Simon Singhs "The Code Book":

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

This is full of unusual quadgrams, so we expect it to have a fairly low score. The hill-climbing algorithm will most likely find a key that gives a
piece of garbled plaintext which scores much higher than the true plaintext. This is a limitation of any algorithm based on statistical properties of text,
including single letter frequencies, bigrams, trigrams etc. One possible way to overcome this problem, at the expense of algorithm speed, is to try to find
words in the plaintext and base the fitness on the presence of these.

A final mention should be made about breaking short ciphers. In general, you will have trouble breaking ciphers less than 100 characters in length. This is because the statistics of short messages can deviate significantly from the long term statistics of english. In addition, there is a theoretical limit, given by the unicity distance which says that 28 characters are required to get a unique decryption, any cipher shorter is not breakable unless more information is available such as a crib.

The fitness of our first plaintext attempt is -2304.04, something we should be able to improve on. By swapping 'Y' and 'B' in the key,
and deciphering again we get a fitness of -2200.78, an improvement! But the text is still a long way off being readable.
We must continue with this procedure until there are no two letters we can swap in the key that will result in an improvement of the fitness.

After many iterations of this approach, the final key that was found was

Provided here is python code for breaking the Substitution cipher. The code here uses pycipher for the cipher itself. It implements the steps described above, using the ngram_score.py file available on the quadgram statistics page.