Simple substitution

One of the most common (and very easy to crack) ciphers is
substitution. One sometimes sees these in a newspaper somewhere near the
crossword puzzle. This scheme is also the one used in Poe's story ``The Gold
Bug'', and the Sherlock Holmes ``Mystery of the Dancing Men''. The idea is
straightforward: choose a rearrangement of the letters of the alphabet, and
replace each letter in the plaintext by its corresponding one.

Such substitutions are very simple to implement in Maple. We will use the
CharacterMap function from the StringTools package4.2which does exactly what we want.
First, we define two ``alphabets''- one for our plain text, and one for our
encrypted text. The second is a permutation of the first.

Note that our message contains a spaces which are preserved in the
encryption process, because the CharacterMap function only modifies
those characters which are found in the first string. If a character isn't
found, it is left alone. Thus, if we try to scramble a message containing
punctuation or upper-case letters, they aren't modified by Scramble.
Since our Cryptabet is all upper-case, a message which began with
mixed case will be distorted.

>
Scramble("He said I need direction; I head for the door.");
Unscramble(

It is traditional in pen-and-paper cryptography to use a 26-letter alphabet
(that is, to remove spaces and punctuation and ignore any distinction
between upper and lower case letters). If you wish to do this, it could be
done4.4using LowerCase (which translates all letters to lower case),
and a combination of Select and IsLower to remove anything
that isn't a letter.

In general we won't adhere to this convention in this chapter.
Rather, we will treat spaces and punctuation like any other
character. This is not a common approach, but one we happen to like.

There are two big drawbacks to such a substitution cipher. One
primary problem is that the key for enciphering and deciphering is the
permutation: you need to remember an ordering of the entire alphabet.

A second problem, shared with all such monoalphabetic substitutions
(i.e., those that substitute one letter for another), is that they
aren't that hard to break. If you have a long enough message,
you can use the fact that certain letters occur much more frequently
than others to help tease out the original text. In English text
the letter ``e'' is the occurs most frequently, followed by ``t'',
``a'', ``n'', ``i'', and so on. Spaces occur about twice as often as
the letter ``e'', assuming they are encrypted, rather than just
omitted. Once you guess a few letters, common words like ``the'' and
``and'' fall into place, giving more clues.

A variation on the arbitrary permutation discussed above is the ``Caesar
cipher'', named after Julius Caesar, who supposedly invented it himself.
This is also what was done by a ``Captain Midnight Secret Decoder Ring'',
popular in the 1930s and 40s. Here we convert our alphabet to numeric
equivalents (with, say A=0, B=1, and so on), add an offset to each numeric
equivalent (legend has it that Caesar used an offset of 3), then re-encode
the numbers as letters. We do the addition modulo the length of the
alphabet; that is, when we shift a Y by 3, we get B, since
24 + 3 = 27,
and
27 1 (mod 26). However, we won't always restrict ourselves
to a 26 letter alphabet-- we may want punctuation, numbers, and to preserve
upper- and lower-case distinctions.4.5

While we could easily implement the Caesar cipher without converting our
text to numbers, doing so makes it easier to understand and sets us up for
more complicated ciphers we will need later.

Treating characters as numbers

Computers already have a built-in numeric representation of the
alphabet-- the ASCII encoding.4.6At their lowest level, digital computers operate only on numeric
quantities-- they just manipulate ones and zeros. The smallest piece
of information is the state of a switch: it is off (0) or it is on
(1). This is called a bit, which is short for binary
digit. A group of 8 bits is called a byte, and is a
single ``character'' of information.4.7A byte can have 28 = 256 different states:

00000000, 00000001, 00000010, 00000011, ..., 11111111

which are generally interpreted either as representing the integers
0 through 255, or the highest bit is often taken to represent the
sign, in which case we can represent integers between -128 and
127. (We get an ``extra'' negative number because we don't need
-0, so the byte 10000000 can mean -128 instead.)
For notational convenience, bytes are often written in hexadecimal
(that is, in base 16), grouping the bits into two blocks of 4 bits
(sometimes called a nybble, half a byte), and using the
characters A-F for the integers 10-15. Thus, the byte
01011111 would be written as 5F in hexadecimal, since 0101 means
22 + 20 = 5, and 1111 is
23 + 22 + 21 + 20 = 15.

So how does a computer represent characters? There is an agreed upon
standard encoding of characters, punctuation, and control codes into
the first 127 integers. This is called ASCII,4.8an acronym for American
Standard Code for Information Interchange. Extended ASCII encodes
more characters into the range 128-255 (this is where you would find
characters for å and ±, for example), although the characters
found in this range vary a bit. Not all programs can deal successfully with
characters in the extended range4.9

Maple has a built-in routine to allow us to convert characters to and from
their ASCII equivalents. If it is called with a list of numbers, it returns
a string of characters, and if called with a character string, out comes a
list of numbers. For example,

>
convert("How do I work this?",bytes);

>
convert([72, 117, 104, 63],bytes);

We can use this to make a list of the ASCII codes. Maple prints a
when the ASCII code is not a printable character. We make the table
for characters up through 127 because the meanings in extended ASCII differ
depending on what character set you use. Feel free to make your own table
of all 255 characters. (In the interests of making a nicely formatted
table, we have used a somewhat complicated command. The much simpler
seq([i,convert([i],bytes)],i=0..127); would have produced the
same information, just not as neatly.)

Note that some of the characters (such as ASCII code 10) print as a backslash
and a letter, like
\n. These are codes for special control
characters:
\n indicates a newline,
\t is a tab, and
so on. Since Maple indicates a string by printing it inside double
quotes, it indicates the double quote4.10character (ASCII 34) using
\'', and the backslash character (ASCII 92) with a pair of
backslashes. While it doesn't print out in Maple, ASCII 127 is the control
sequence DEL (sometimes printed as
^?). This sometimes is
interpreted as ``delete the previous character''; if your output
device interprets it that way, characters can seem to mysteriously vanish.

Another special character is ASCII code 0: note that it prints as nothing,
not even a blank space (ASCII 32 is the space character).
This is the null character, and is typically used internally
to indicate the end of a string. If a null occurs in the middle of a string,
usually what comes after
it won't print out unless you take special steps. To see this for
yourself, try a command like
convert([73,116,39,115,0,103,111,110,101,33],bytes).
You'll see that the characters following the 0 don't print out.

The StringTools package also contains Char which gives the
character corresponding to a decimal number, and Ord which gives the
number corresponding to a character. We'll generally use
convert, although it makes little difference.

Now we are nearly ready to write a version of the Caesar cipher.
The basic idea is that for each character of the plaintext, we convert it a
numerical representation, add a pre-specified offset modulo 127, then
convert back to characters. Unlike the Scramble
function we wrote in §2.1, we needn't declare
an alphabet explicitly because we will use the ASCII codes. However, in
addition to the Maple command
convert( ,bytes), we will use another new command, map.

The map command allows us to apply a function to every element of a
vector or list in one call. For example, to compute the square root
of each of the first 10 integers, we could do

>
map(sqrt,[1,2,3,4,5,6,7,8,9,10]);

If the function we wish to apply takes two variables, we can
give the second one as a third argument to map. For example, the
command

>
map(modp,[1,2,3,4,5,6,7,8,9,10],3);

computes n mod 3 as n ranges over the list
[1, 2,..., 10].
(Recall that n mod p is the remainder after dividing n by p.)

We finally are ready.
See if you can understand it before reading the
explanation below. Remember to read from the middle outward.
Note that we are defining the function we are mapping directly as
the first argument.

If you don't see what is going on in Caesar, try to think of it as a
composition of several functions, each of which you DO understand. That is,
Caesar works as follows

First, we convert the input string (called plain) into a list
of numbers, using the call convert(plain,bytes). Note that
this is done on the third line, the innermost function.

Next, we apply the function
x ->(x+shift-1) mod 127 + 1
to each of numbers in the resulting list using map. This
call to map has two arguments: the definition of the shift,
and what to apply it to (the innermost function
convert(plain,bytes)).

Finally, we convert the list back to characters again. This is the
outermost function: it begins on the second line (immediately after
the -> and ends with the final parenthesis.

Viewing it as a composition, we have

: = convert-1omapoconvert.

Note that we didn't really compute
; instead we subtracted one
from the sum, computed mod 127, then added 1 again. Why?
Consider what would happen if, for example, our character was ``k''
(ASCII code 107) and we shifted by 20. Then
(107 + 20) mod 127
gives 0, which is the code for the null byte, indicating the end of
a string. This would be a problem. So instead, we add 1 afterward
to ensure that our answer is in the range
1,..., 127. And just
to avoid confusing ourselves, we subtract the 1 first, so that
shifting by 0 doesn't change anything.

Aside from the fact that this function is a bit confusing to
read and offers essentially no cryptographic security, what else is wrong
with it? Consider the following example:

>
Caesar("Who put the benzedrine in Mrs. Murphy's Ovaltine?", 86);

In addition to the fact that the result looks funny, if we were to try to
decipher this from printed output, we would have a little (but not much)
trouble, because the characters ``.'' and ``?'', after encoding, print out
as .4.11 This happens because the numeric values of these
characters, after shifting by 86, are 5 and 22, respectively. Neither
of those corresponds to a printing character, so we can't tell them apart.
Of course, if we had no intention of dealing with our encoded message in
printed form, this would be no problem. The actual character values are
different, they just print out the same.

How can we get around this? We can restrict ourselves to printing
characters using a variation of the idea in Scramble of
§2.1, that is, giving an explicit Alphabet
that we will deal in.

Before doing this, however, it will be helpful to discuss the proc
command, which allows us to construct functions that consist of more than
one Maple command.

Footnotes

The StringTools package was introduced in Maple 7, and the string
data type was introduced in Maple V release 5. Everything we do in this
chapter can easily be implemented without using StringTools- in
fact, the original version of this chapter was written for Maple Vr4
using names rather than strings.
Since Maple 6 is still in widespread use (Spring 2002), we'll
try to include alternatives to using this package when practical.

In Maple Vr5 and Maple 6, we could define CharacterMap as
CharacterMap := (old,new,txt)->cat(seq(new[SearchText(txt[i],old)],i=1..length(txt)));
This will only work if all characters in the txt correspond to ones
in old, so a space would have to be added to both Alphabet
and Cryptabet to get this example to work.

For example, the title of this chapter (fsqFsHn sGGousG) is actually the
result of using a Caesar cipher on its original title. A 53-character
alphabet consisting of all the upper-case letters, a space, and all the
lower-case letters was used, so the space in the middle may not correspond
to a space in the original. The three Gs that occur should be a good hint
to decoding it.

In fact, while most computers these days use ASCII, it is not
the only such arrangement. IBM mainframes, for example,
primarily use another encoding called EBCDIC. But this makes
no real difference to our discussion.

The next larger unit in a computer is a word. However,
different makes of computers have different sizes of words.
Most computers today have 4 byte words (32 bits), but word sizes
of 2 bytes (16 bits) and 8 bytes (64 bits) are not uncommon.
8-bit bytes are now essentially universal although in the
early days of computing, there were computers which used other
sizes, such as 6 bit bytes.

ASCII is not the only such standard, although it is currently
the most commonly used. Prior to the widespread use of personal
computers in the 1980s, EBCDIC (Extended Binary Coding for Data
Interchange and Communication) tended to be in more common use,
because it was what IBM mainframes used. There are many other
such assignments.

In the late 1980s and early
1990s, there was a big push to make programs ``8-bit clean'', that
is, to be able to handle these ``unusual'' characters properly.
There are a number of distinct assignments of characters to
the upper 128 positions. One common one is Latin-1
(ISO 8859-1), which contains the characters used in the
majority of western European languages. Other assignments include
characters for Scandinavian languages, Arabic, Cyrillic,
Greek, Hebrew, Thai, and so on.
Since the late 1990s, a multi-byte character encoding called
Unicode (ISO 10646) has been in use; this universal
character set allows computers to handle alphabets (such as
Chinese) that need more than 255 characters.
While use of Unicode is becoming more widespread, single byte
character sets like ASCII will likely predominate for quite a
long while.

The ASCII code does not have distinct codes for the open-quote (``)
and close-quote ('') characters used in typesetting. Neither do most
computer keyboards. However, the apostrophe (') and grave accent or
backquote (`) characters have distinct ASCII codes (39 and 96, respectively).