Creating a Substitution Cipher with C#

I thought for a bit of fun I'd invent a new cipher and, depending on how
'strong' it turned out to be, perhaps use it in my projects to encrypt string
literals, files etc.

Also, I wanted to base this on traditional 'substitution' principles, which are
easily understand, rather than the complex mathematical algorithms which are
used to manipulate bit patterns in modern cryptography.

So what is a substitution cipher?

In the simplest kind of substitution cipher, one simply substitutes one letter
for another. Here's a basic program which does that using a key which consists
of all 26 letters of the alphabet:

Notice that you don't necessarily need to substitute different letters in
every case and here we leave the letter 'z' unchanged. Notice also that the
'plain text' is usually presented as a lower case string with all punctuation
except spaces removed.

How easy is it to 'crack' a substitution cipher?

A substitution cipher is not very secure and can be attacked in the following
main ways:

Various studies have shown that the letters of the alphabet occur in roughly the same frequencies in a piece of English text. For example, the commonest letters are: E, T, A and O and the least common are: X, J, Q and Z. By calculating the number of times each letter occurs in the cipher text and comparing it with the frequencies in a table (such as the one at http://en.wikipedia.org/wiki/Letter_frequency), you can usually guess what some of the letters represent. Of course, the longer the cipher text or the more extracts you have, the easier this is.

English has two one-letter words ('a' and 'I') and many common two and three letter words. A knowledge of these helps one to guess the meaning of other letters in the cipher.

Although 'E' is the commonest letter in English, it is far less common as the first letter of a word. Similarly, although 'O' and 'A' are also 'high frequency' letters, they don't often occur at the end of words.

Some two or three letter combinations of letters are quite common and others never occur at all. By analysing the frequency of digraphs and trigraphs (as they are technically called) in the cipher text, one may be able to guess what further letters in the cipher mean.

There are some letters which are frequently doubled (ss, ee, tt, ff etc) and others which are very seldom doubled in English (aa, uu, yy). A study of doubled letters in the cipher text may therefore yield some further clues to deciphering the cipher text.

Can anything be done to make a substitution
cipher more secure?

If all letters had the same frequencies and there were no spaces between words,
then it would be far harder to crack. This suggests that we should represent the
most frequent letters by not just one but by several different letters or
symbols and we should also represent spaces by a number of symbols as well so
it's difficult to tell where a word begins or ends. For each substitution, one
of the available symbols should be chosen at random. So, for example, in a given
cipher sometimes 'd' might be represented by 'a' and sometimes by 'w'.

By rounding each letter's frequency to the nearer 2% and then allocating one
symbol for each 2% in the Wikipedia table, I found that the number of symbols
needed would be as follows:

Number of Symbols

Letters

7

Space

6

E

5

T

4

A, I, O

3

H, N, R, S

2

C, D, L, M, U

1

B, F, G, J, K, P,Q, V, W, X, Y, Z

It will be noticed that a space has been given 7
symbols compared to E's 6 symbols because it says in the Wikipedia article that
the former is about 7% more frequent which would give it a percentage of 13.87%.

Also, after looking at some other studies, I decided to allocate 4 (rather than
3) symbols to 'I' and, for various other reasons, I decided to allocate 2
(rather than 1) symbols to 'U', 'C' and 'M'.

This means that we need a total of 64 symbols to represent letters and spaces.

Of course, the lowest frequency letters : Z, X, Q, J , K, V and (to some extent)
B are going to be under-represented in the cipher text even though they have
only been allocated one symbol. Moreover, the symbols for U, C and M are also
going to be under-represented as I've decided to allocate them two symbols each
when they don't strictly merit this treatment.

What about capitalization, punctuation and numerals?

In traditional substitution ciphers, capitalization and punctuation (except
spaces) are usually ignored with the result that the text can be challenging to
read even after you've deciphered it!

This is unsatisfactory and so I decided to allow for both in my cipher.

I found a study which indicated that, on average, every 100 letters would
include 94 lower case and 6 upper case letters. To be consistent with the other
assumptions, we therefore need three symbols (6%) to indicate that the following
letter should be capitalized. Bearing in mind that capital letters often begin
words or sentences, the relative frequency of capital letters in English is not
the same as the relative frequency of lower case letters but, in the interests
of simplicity, I decided to ignore this.

According to the Wikipedia article, the frequency of punctuation and numerals is
equivalent to a letter between 'A' and 'T' in the table and so needs four
symbols (say), split equally between the two, to indicate the nature of the
following symbol. However, we can use this 'following symbol' to 'even up' the
occurrence of the 10 letters which were identified as being under-represented in
the previous section which is a useful bonus.

By far the most common punctuation symbols are the comma and period. These are
followed by quotes, apostrophes and dashes. So, by allocating the least used
letters to the commonest punctuation marks, we can address the unbalance to some
extent.

Although one might expect that the numerals would occur with equal frequency, in
fact 0 and 1 are the most frequent and, in general, the smaller digits are
slightly more frequent than the larger ones. We can again use this fact to
further address the imbalance in the less frequently used letters.

So, the total number of symbols we need has now increased to: 64 + 3 + 2 + 2 =
71.

OK, but what are we going to do about initial, final and double letters?

Even though it's now going to be difficult for someone to apply the frequency
tables for initial and final letters to the cipher text, there's still the
problem of the initial and final words of the text as a whole.

Fortunately, this is easy to deal with - all we need to do is to add a random
number of 'noise'' symbols to the beginning and end of the cipher text, say
between 1 and 5 at each end (the average length of a word is about 5 letters).
These symbols will have the same frequency as they would otherwise have in the
middle of words.

For double letters, I decided to create a special symbol which doubles the
previous one so that adjacent symbols will always be different in the cipher
text.

This means that we need a grand total of 72 symbols in our 'alphabet'. As well
as the 62 alphanumeric symbols: 'A' to 'Z', 'a' to 'z' and '0' to '9', I added
10 other symbols which should be available on most keyboards, namely: !, #, $,
%, &, *, +, =, ? and @.

All 72 of these symbols have an equal chance of appearing in the noise symbols
at the beginning and end of the cipher text which makes them difficult to
distinguish from 'normal' symbols.

The same key will, of course, be used for both encryption and decryption and
will be 74 characters long to encompass the 72 symbol alphabet, plus the number
of noise symbols (which can be represented as a single upper case letter as
there are only 25 combinations) plus a further character which I'll discuss
later. I've also allowed for a special 'no noise' key (see later).

What about control codes, foreign letters, mathematical symbols and the like?

The only control codes which actually affect the layout of text nowadays are:
\r, \n, and \t which represent carriage return, new line and tab respectively.
All of these occur quite frequently in a long piece of text and it's common for
\r and \n to occur together.

I decided to represent \n by (capitals) space, \t by (numerals) space and \r by
(numerals) E. As 'space' and 'E ' can be represented by 7 and 6 symbols and
'capitals' and 'numerals' by 3 and 2 symbols respectively, there are a large
number of possible combinations available which will make these characters
difficult to spot.

The other control codes can be dealt with using a (punctuation)(double) prefix.

The 32 unicode punctuation symbols with code points between 160 and 191 fit
nicely with the symbols below 127 into the 'punctuation' category.

The 64 accented letters with code points between 192 and 255 can be dealt with
using a (capitals)(double) prefix and any other unicode characters with code
points above 255 can be represented by a (letter)(double) prefix followed by a
'base 64' three symbol code.

To give complete unicode coverage, the non-printing character DEL (127) is
represented by (numerals) P and the Euro symbol (Unicode 8364) which appears on
many European keyboards as (numerals) Y, rather than by the five symbol code
that would otherwise apply.

Where's the source code so I can try this?

Although some of this sounds complicated, in fact it's quite easy to code and
the download for this article contains the complete class, which I've called
FASC ('Frequency Adjusted Substitution Cipher') all of whose methods are static.

If you want to use this class in your projects, then it can either be pasted in
as it stands or it can be compiled to a class library first and then a reference
to the .dll added to your project.

How should I use this class in practice?

You can if you wish generate a key and then use that key to encrypt/decrypt a
string or a whole file. It's also possible to generate a special key which
produces cipher text without any noise, if you want to minimize the length of
the cipher text for some reason.

However, a better way of proceeding is to generate a key file which consists of
72 different keys, each of which contains a unique symbol (the 74th character
mentioned earlier). Each time a string or file is encrypted a key is selected at
random from this file and used to encrypt the plain text. The cipher text is
then prefixed with the symbol used (a single character), in a slightly disguised
form.

When you then decrypt the cipher text, the key symbol is read, the appropriate
key is loaded from the file and the decipherment then takes place using this
key.

The beauty of this system is that you never need to deal with the keys - it's
all done for you in the background. Moreover, if you were encrypting a bunch of
stuff in the same program, different keys would probably be used to encrypt each
one making it much more difficult for someone to decipher it.

Of course, you still have to know which key file was used to encrypt a piece of
text; there's no way to record that as the system is currently constituted. You
might also want to encrypt the key file using, say, windows data security and
bury it somewhere in the filing system where it cannot be easily identified.

By default, an unencrypted key file called default.fkf is generated and placed
in the same directory as the assembly itself.

Even if you use the single key system, then the 74th character in the key is
still used as a rough and ready way of telling whether the cipher text was
produced with this key or not.

Can I have an example?

Here's an example which generates an individual key and then encrypts the famous
'pangram' (a phrase containing all 26 letters of the alphabet) mentioned earlier
which has now been decorated with some upper case letters, numerals and
punctuation marks.

Clearly, a substitution cipher of this nature is not to going to be anywhere
near as 'cryptographically strong' as modern algorithms such as Rijndael but it
should certainly prevent casual inspection of confidential information and won't
be too easy for professional hackers to crack even if they know the underlying
basis. It also works quickly and on text of any length. You can even encode an empty string!

Although it's certainly possible to encrypt text written in other Western
European languages which use the Latin alphabet (including French, German,
Spanish, Portuguese and Italian) using this cipher, it will not be as secure
because the frequency of letters in those languages will differ from English.
Any native speaker of those languages would therefore be better developing his
or her own version of the cipher using the same principles.

Although , for completeness, the FASC cipher can encode all Unicode characters,
it is not suitable for encoding binary, as opposed to text, files.