The Caesar Cipher

Authors: Chris Savarese and Brian Hart '99

One of the simplest examples of a substitution cipher is the Caesar cipher, which is said to have been
used by Julius Caesar to communicate with his army. Caesar is
considered to be one of the first persons to have ever employed
encryption for the sake of securing messages. Caesar decided that
shifting each letter in the message would be his standard algorithm,
and so he informed all of his generals of his decision, and was then
able to send them secured messages. Using the Caesar Shift (3 to the
right), the message,

"RETURN TO ROME"

would be encrypted as,

"UHWXUA WR URPH"

In this example, 'R' is shifted to 'U', 'E' is shifted to 'H', and
so on. Now, even if the enemy did intercept the message, it would be
useless, since only Caesar's generals could read it.

Thus, the Caesar cipher is a shift cipher
since the ciphertext alphabet is derived from the
plaintext alphabet by shifting each letter a certain number of spaces.
For example, if we use a shift of 19, then we get the following pair
of ciphertext and plaintext alphabets:

To encipher a message, we perform a simple substitution by looking up
each of the message's letters in the top row and writing down the
corresponding letter from the bottom row. For example, the message

THE FAULT, DEAR BRUTUS, LIES NOT IN OUR STARS BUT IN OURSELVES.

would be enciphered as

MAX YTNEM, WXTK UKNMNL, EBXL GHM BG HNK LMTKL UNM BG HNKLXEOXL.

Essentially, each letter of the alphabet has been shifted nineteen
places ahead in the alphabet, wrapping around the end if necessary.
Notice that punctuation and blanks are not enciphered but are copied
over as themselves.

Breaking a Caesar Cipher (Cryptanalysis)

Can a computer guess what shift was used in creating a Caesar cipher?
The answer, of course, is yes. But how does it work?

The unknown shift is one of 26 possible shifts. One technique might be
to try each of the 26 possible shifts and check which of these
resulted in readable English text. But this approach has
limitations. The main problem is that the computer would need a
comprehensive dictionary in order to be able to recognize the
words of any given cryptogram.

A better approach makes use of statistical data about English letter
frequencies. It is known that in a text of 1000 letters of various
English alphabet occur with about the following relative frequencies:

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

73

9

30

44

130

28

16

35

74

2

3

35

25

78

74

27

3

77

63

93

27

13

16

5

19

1

This information can be useful in deciding the most likely shift used
on a given enciphered message. Suppose the enciphered message is:

K DKVO DYVN LI KX SNSYD, PEVV YP CYEXN KXN PEBI, CSQXSPISXQ XYDRSXQ.

We can tally the frequencies of the letters in this enciphered message, thus

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

0

1

2

4

3

0

0

0

3

0

4

1

0

4

1

4

3

1

6

0

0

4

0

7

4

0

Now we can now shift the two tallies so that the large and small
frequencies from each frequency distribution match up roughly. For
example, if we try a shift of ten on the previous example, we get the
following correspondence between English language frequencies and the
letter frequencies in the message.

English Language Frequencies

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

73

9

30

44

130

28

16

35

74

2

3

35

25

78

74

27

3

77

63

93

27

13

16

5

19

1

Enciphered Message Frequencies

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

A

B

C

D

E

F

G

H

I

J

4

1

0

4

1

4

3

1

6

0

0

4

0

7

4

0

0

1

2

4

3

0

0

0

3

0

Note that in this case the large frequencies for cipher X and Y
correspond to large for English N and O, the bare spots for cipher T
and U correspond to bare spots for English J and K. Also, an isolated
large frequency for cipher S correpsonds to a similar one for English
I. In view of this evidence we needn't even worry too much about the
drastic mismatch for English E, which is usually the most frequent
letter in a random sample of English text.

If we now apply this substitution to the message we get:

A TALE TOLD BY AN IDIOT, FULL OF SOUND AND FURY, SIGNIFIYING NOTHING.

Using the Chi-square Statistic

The chi-square statistic allows
compare how closely a shift of the English frequency distribution
matches the frequency distribution of the secret message. Here's an
algorithm for computing the chi-square statistic:

Let ef(c) stand for the english frequency of some letter of the alphabet

Let mf(c) stand for the frequency of some letter of the message

For each possible shift s between 0 and 25:

For each letter c of the alphabet

Compute the sum of squares of mf((c + s) mod 26) divided by ef(c)

That is, for a given character, say 'a', we compute the square of the
frequency of that character shifted by one of the possible Caesar
shifts and then divide it by the English frequency of that
character. For a given shift, say 5, we do this for each of the 26
letters in the alphabet. We thereby get 26 different chi-square
values. The shift s for which the number ChiSquare(s) is
smallest is the most likely candidate for the shift that was used to
encipher the message.