The gaffer says something longer and more
complicated. After a while, Waterhouse (now wearing his cryptoanalyst hat,
searching for meaning midst apparent randomness, his neural circuits exploiting
the redundancies in the signal) realizes that the man is speaking heavily
accented English.

“Cryptonomicon” by Neal Stephenson

A message encrypted with the transposition cipher can have
thousands of possible keys. Your computer can still easily brute-force this
many keys, but you would then have to look through thousands of decryptions to
find the one correct plaintext. This is a big problem for the brute-force
method of cracking the transposition cipher.

When the computer decrypts a message with the wrong key, the
resulting plaintext is garbage text. We need to program the computer to be able
to recognize if the plaintext is garbage text or English text. That way, if the
computer decrypts with the wrong key, it knows to go on and try the next
possible key. And when the computer tries a key that decrypts to English text,
it can stop and bring that key to the attention of the cryptanalyst. Now the
cryptanalyst won’t have to look through thousands of incorrect decryptions.

It can’t. At least, not in the way that human beings like
you or I understand English. Computers don’t really understand math, chess, or
lethal military androids either, any more than a clock understands lunchtime.
Computers just execute instructions one after another. But these instructions
can mimic very complicated behaviors that solve math problems, win at chess, or
hunt down the future leaders of the human resistance.

Ideally, what we need is a Python function (let’s call it isEnglish()) that has a string passed to it and then returns
True if the string is English text and False if it’s random gibberish. Let’s take a look at some
English text and some garbage text and try to see what patterns the two have:

One thing we can notice is that the English text is made up
of words that you could find in a dictionary, but the garbage text is made up
of words that you won’t. Splitting up the string into individual words is easy.
There is already a Python string method named split()
that will do this for us (this method will be explained later). The split() method just sees when each word begins or ends by
looking for the space characters. Once we have the individual words, we can
test to see if each word is a word in the dictionary with code like this:

We can write code like that, but we probably
shouldn’t. The computer won’t mind running through all this code, but you
wouldn’t want to type it all out. Besides, somebody else has already typed out
a text file full of nearly all English words. These text files are called dictionary
files. So we just need to write a function that checks if the
words in the string exist somewhere in that file.

Remember, a dictionary file is a text file that
contains a large list of English words. A dictionary value is a Python
value that has key-value pairs.

Not every word will exist in our “dictionary file”. Maybe
the dictionary file is incomplete and doesn’t have the word, say, “aardvark”.
There are also perfectly good decryptions that might have non-English words in
them, such as “RX-686” in our above English sentence. (Or maybe the plaintext
is in a different language besides English. But we’ll just assume it is in
English for now.)

And garbage text might just happen to have an English word
or two in it by coincidence. For example, it turns out the word “augur” means a
person who tries to predict the future by studying the way birds are flying.
Seriously.

So our function will not be foolproof. But if most of the
words in the string argument are English words, it is a good bet to say that
the string is English text. It is a very low probability that a ciphertext will
decrypt to English if decrypted with the wrong key.

The dictionary text file will have one word per line in
uppercase. It will look like this:

AARHUS

AARON

ABABA

ABACK

ABAFT

ABANDON

ABANDONED

ABANDONING

ABANDONMENT

ABANDONS

…and so on. You can download this entire file (which has
over 45,000 words) from http://invpy.com/dictionary.txt.

Our isEnglish() function will
have to split up a decrypted string into words, check if each word is in a file
full of thousands of English words, and if a certain amount of the words are English
words, then we will say that the text is in English. And if the text is in
English, then there’s a good bet that we have decrypted the ciphertext with the
correct key.

And that is how the computer can understand if a string is
English or if it is gibberish.

The detectEnglish.py program
that we write in this chapter isn’t a program that runs by itself. Instead, it
will be imported by our encryption programs so that they can call the detectEnglish.isEnglish() function. This is why we don’t
give detectEnglish.py a main()
function. The other functions in the program are all provided for isEnglish() to call.

7.# (There must be a "dictionary.txt" file in
this directory with all English

8.# words in it, one word per line. You can download this
from

9.# http://invpy.com/dictionary.txt)

These comments at the top of the file give instructions to
programmers on how to use this module. They give the important reminder that if
there is no file named dictionary.txt in the same
directory as detectEnglish.py then this module will
not work. If the user doesn’t have this file, the comments tell them they can
download it from http://invpy.com/dictionary.txt.

detectEnglish.py

10. UPPERLETTERS
= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

11.
LETTERS_AND_SPACE = UPPERLETTERS + UPPERLETTERS.lower() + ' \t\n'

Lines 10 and 11 set up a few variables that are constants,
which is why they have uppercase names. UPPERLETTERS
is a variable containing the 26 uppercase letters, and LETTERS_AND_SPACE
contain these letters (and the lowercase letters returned from UPPERLETTERS.lower()) but also the space character, the
tab character, and the newline character. The tab and newline characters are
represented with escape characters \t and \n.

detectEnglish.py

13.
def loadDictionary():

14.
dictionaryFile = open('dictionary.txt')

The dictionary file sits on the user’s hard drive, but we
need to load the text in this file as a string value so our Python code can use
it. First, we get a file object by calling open()
and passing the string of the filename 'dictionary.txt'.
Before we continue with the loadDictionary() code,
let’s learn about the dictionary data type.

The dictionary data type has values which can
contain multiple other values, just like lists do. In list values, you use an
integer index value to retrieve items in the list, like spam[42].
For each item in the dictionary value, there is a key used to retrieve it.
(Values stored inside lists and dictionaries are also sometimes called items.)
The key can be an integer or a string value, like spam['hello']
or spam[42]. Dictionaries let us organize our
program’s data with even more flexibility than lists.

A dictionary’s values are typed out as key-value pairs,
which are separated by colons. Multiple key-value pairs are separated by
commas. To retrieve values from a dictionary, just use square brackets with the
key in between them (just like indexing with lists). Try typing the following
into the interactive shell:

>>>
spam = {'key1':'This is a value', 'key2':42}

>>>
spam['key1']

'This is a
value'

>>>
spam['key2']

42

>>>

It is important to know that, just as with lists, variables
do not store dictionary values themselves, but references to dictionaries. The
example code below has two variables with references to the same dictionary:

The len() function can tell you
how many items are in a list or how many characters are in a string, but it can
also tell you how many items are in a dictionary as well. Try typing the
following into the interactive shell:

The in operator can also be used
to see if a certain key value exists in a dictionary. It is important to
remember that the in operator checks if a key exists
in the dictionary, not a value. Try typing the following into the interactive
shell:

Dictionaries are like lists in many ways, but there are a
few important differences:

1.Dictionary
items are not in any order. There is no “first” or “last” item in a dictionary
like there is in a list.

2.Dictionaries
do not have concatenation with the + operator. If
you want to add a new item, you can just use indexing with a new key. For
example, foo['a new key'] = 'a string'

3.Lists
only have integer index values that range from 0 to
the length of the list minus one. But dictionaries can have any key. If you
have a dictionary stored in a variable spam, then
you can store a value in spam[3] without needing
values for spam[0], spam[1],
or spam[2] first.

In the loadDictionary() function,
we will store all the words in the “dictionary file” (as in, a file that has
all the words in an English dictionary book) in a dictionary value (as in, the
Python data type.) The similar names are unfortunate, but they are two
completely different things.

We could have also used a list to store the string values of
each word from the dictionary file. The reason we use a dictionary is because
the in operator works faster on dictionaries than
lists. Imagine that we had the following list and dictionary values:

>>>
listVal = ['spam', 'eggs', 'bacon']

>>>
dictionaryVal = {'spam':0, 'eggs':0, 'bacon':0}

Python can evaluate the expression 'bacon'
in dictionaryVal a little bit faster than 'bacon' in
listVal. The reason is technical and you don’t need to know it for the
purposes of this book (but you can read more about it at http://invpy.com/listvsdict).
This faster speed doesn’t make that much of a difference for lists and
dictionaries with only a few items in them like in the above example. But our detectEnglish module will have tens of thousands of items,
and the expression word in ENGLISH_WORDS will be
evaluated many times when the isEnglish() function
is called. The speed difference really adds up for the detectEnglish
module.

The split() string method returns
a list of several strings. The “split” between each string occurs wherever a
space is. For an example of how the split() string
method works, try typing this into the shell:

The result is a list of eight strings, one string for each
of the words in the original string. The spaces are dropped from the items in
the list (even if there is more than one space). You can pass an optional
argument to the split() method to tell it to split
on a different string other than just a space. Try typing the following into
the interactive shell:

>>>
'helloXXXworldXXXhowXXXareXXyou?'.split('XXX')

['hello',
'world', 'how', 'areXXyou?']

>>>

detectEnglish.py

16. for
word in dictionaryFile.read().split('\n'):

Line 16 is a for loop that will
set the word variable to each value in the list dictionaryFile.read().split('\n'). Let’s break this
expression down. dictionaryFile is the variable that
stores the file object of the opened file. The dictionaryFile.read()
method call will read the entire file and return it as a very large string
value. On this string, we will call the split()
method and split on newline characters. This split()
call will return a list value made up of each word in the dictionary file
(because the dictionary file has one word per line.)

This is why the expression dictionaryFile.read().split('\n')
will evaluate to a list of string values. Since the dictionary text file has
one word on each line, the strings in the list that split()
returns will each have one word.

None is a special value that you
can assign to a variable. The None value represents the lack of a
value. None is the only value of the data type NoneType.
(Just like how the Boolean data type has only two values, the NoneType data
type has only one value, None.) It can be very
useful to use the None value when you need a value
that means “does not exist”. The None value is
always written without quotes and with a capital “N” and lowercase “one”.

For example, say you had a variable named quizAnswer which holds the user's answer to some
True-False pop quiz question. You could set quizAnswer
to None if the user skipped the question and did not
answer it. Using None would be better because if you
set it to True or False
before assigning the value of the user's answer, it may look like the user gave
an answer for the question even though they didn't.

Calls to functions that do not return anything (that is,
they exit by reaching the end of the function and not from a return statement) will evaluate to None.

detectEnglish.py

17.
englishWords[word] = None

In our program, we only use a dictionary for the englishWords variable so that the in
operator can find keys in it. We don’t care what is stored for each key, so we
will just use the None value. The for loop that starts on line 16 will iterate over each
word in the dictionary file, and line 17 will use that word as a key in englishWords with None stored
for that key.

After the for loop finishes, the englishWords dictionary will have tens of thousands of
keys in it. At this point, we close the file object since we are done reading
from it and then return englishWords.

detectEnglish.py

21.
ENGLISH_WORDS = loadDictionary()

Line 21 calls loadDictionary()
and stores the dictionary value it returns in a variable named ENGLISH_WORDS. We want to call loadDictionary()
before the rest of the code in the detectEnglish
module, but Python has to execute the def statement
for loadDictionary() before we can call the
function. This is why the assignment for ENGLISH_WORDS
comes after the loadDictionary() function’s code.

detectEnglish.py

24.
def getEnglishCount(message):

25.
message = message.upper()

26.
message = removeNonLetters(message)

27.
possibleWords = message.split()

The getEnglishCount() function
will take one string argument and return a float value indicating the amount of
recognized English words in it. The value 0.0 will
mean none of the words in message are English words
and 1.0 will mean all of the words in message are English words, but most likely getEnglishCount() will return something in between 0.0 and 1.0. The isEnglish() function will use this return value as part of
whether it returns True or False.

First we must create a list of individual word strings from
the string in message. Line 25 will convert it to
uppercase letters. Then line 26 will remove the non-letter characters from the
string, such as numbers and punctuation, by calling removeNonLetters().
(We will see how this function works later.) Finally, the split()
method on line 27 will split up the string into individual words that are
stored in a variable named possibleWords.

So if the string 'Hello there. How are
you?' was passed when getEnglishCount() was
called, the value stored in possibleWords after lines
25 to 27 execute would be ['HELLO', 'THERE', 'HOW', 'ARE',
'YOU'].

detectEnglish.py

29.
if possibleWords == []:

30.
return 0.0 # no words at all, so return 0.0

If the string in message was
something like '12345', all of these non-letter
characters would have been taken out of the string returned from removeNonLetters(). The call to removeNonLetters()
would return the blank string, and when split() is
called on the blank string, it will return an empty list.

Line 29 does a special check for this case, and returns 0.0. This is done to avoid a “divide-by-zero” error (which
is explained later on).

detectEnglish.py

32.
matches = 0

33.
for word in possibleWords:

34.
if word in ENGLISH_WORDS:

35.
matches += 1

The float value that is returned from getEnglishCount()
ranges between 0.0 and 1.0.
To produce this number, we will divide the number of the words in possibleWords that are recognized as English by the total
number of words in possibleWords.

The first part of this is to count the number of recognized
English words in possibleWords, which is done on
lines 32 to 35. The matches variable starts off as 0. The for loop on line 33 will
loop over each of the words in possibleWords, and checks
if the word exists in the ENGLISH_WORDS dictionary.
If it does, the value in matches is incremented on
line 35.

Once the for loop has completed,
the number of English words is stored in the matches
variable. Note that technically this is only the number of words that are
recognized as English because they existed in our dictionary text file. As far
as the program is concerned, if the word exists in dictionary.txt,
then it is a real English word. And if it doesn’t exist in the dictionary file,
it is not an English word. We are relying on the dictionary file to be accurate
and complete in order for the detectEnglish module
to work correctly.

Returning a float value between 0.0
and 1.0 is a simple matter of dividing the number of
recognized words by the total number of words.

However, whenever we divide numbers using the / operator in Python, we should be careful not to cause a
“divide-by-zero” error. In mathematics, dividing by zero has no meaning. If we
try to get Python to do it, it will result in an error. Try typing the
following into the interactive shell:

>>>
42 / 0

Traceback
(most recent call last):

File
"<pyshell#0>", line 1, in <module>

42 / 0

ZeroDivisionError:
int division or modulo by zero

>>>

But a divide by zero can’t possibly happen on line 36. The
only way it could is if len(possibleWords) evaluated
to 0. And the only way that would be possible is if possibleWords were the empty list. However, our code on
lines 29 and 30 specifically checks for this case and returns 0.0. So if possibleWords had
been set to the empty list, the program execution would have never gotten past
line 30 and line 36 would not cause a “divide-by-zero” error.

The value stored in matches is an
integer. However, we pass this integer to the float()
function which returns a float version of that number. Try typing the following
into the interactive shell:

>>>
float(42)

42.0

>>>

The int() function returns an
integer version of its argument, and the str()
function returns a string. Try typing the following into the interactive shell:

>>>
float(42)

42.0

>>>
int(42.0)

42

>>>
int(42.7)

42

>>>
int("42")

42

>>>
str(42)

'42'

>>>
str(42.7)

'42.7'

>>>

The float(), int(),
and str() functions are helpful if you need a
value’s equivalent in a different data type. But you might be wondering why we
pass matches to float()
on line 36 in the first place.

The reason is to make our detectEnglish
module work with Python 2. Python 2 will do integer division when both values
in the division operation are integers. This means that the result will be
rounded down. So using Python 2, 22 / 7 will
evaluate to 3. However, if one of the values is a
float, Python 2 will do regular division: 22.0 / 7
will evaluate to 3.142857142857143. This is why line
36 calls float(). This is called making the code backwards
compatible with previous versions.

Python 3 always does regular division no matter if the
values are floats or ints.

The previously explained getEnglishCount()
function calls the removeNonLetters() function to
return a string that is the passed argument, except with all the numbers and
punctuation characters removed.

The code in removeNonLetters() starts
with a blank list and loops over each character in the message
argument. If the character exists in the LETTERS_AND_SPACE
string, then it is added to the end of the list. If the character is a number
or punctuation mark, then it won’t exist in the LETTERS_AND_SPACE
string and won’t be added to the list.

Line 42 checks if symbol (which
is set to a single character on each iteration of line 41’s for loop) exists in the LETTERS_AND_SPACE
string. If it does, then it is added to the end of the lettersOnly
list with the append() list method.

If you want to add a single value to the end of a list, you
could put the value in its own list and then use list concatenation to add it.
Try typing the following into the interactive shell, where the value 42 is added to the end of the list stored in spam:

>>>
spam = [2, 3, 5, 7, 9, 11]

>>>
spam

[2, 3, 5,
7, 9, 11]

>>>
spam = spam + [42]

>>>
spam

[2, 3, 5,
7, 9, 11, 42]

>>>

When we add a value to the end of a list, we say we are appending
the value to the list. This is done with lists so frequently in Python that
there is an append() list method which takes a
single argument to append to the end of the list. Try typing the following into
the shell:

>>>
eggs = []

>>>
eggs.append('hovercraft')

>>>
eggs

['hovercraft']

>>>
eggs.append('eels')

>>>
eggs

['hovercraft',
'eels']

>>>
eggs.append(42)

>>>
eggs

['hovercraft',
'eels', 42]

>>>

For technical reasons, using the append()
method is faster than putting a value in a list and adding it with the + operator. The append() method
modifies the list in-place to include the new value. You should always prefer
the append() method for adding values to the end of
a list.

detectEnglish.py

44. return
''.join(lettersOnly)

After line 41’s for loop is done,
only the letter and space characters are in the lettersOnly
list. To make a single string value from this list of strings, we call the join() string method on a blank string. This will join the
strings in lettersOnly together with a blank string
(that is, nothing) between them. This string value is then returned as removeNonLetters()’s return value.

48.# By default, 20% of the words must exist in the
dictionary file, and

49.# 85% of all the characters in the message must
be letters or spaces

50.# (not punctuation or numbers).

The isEnglish() function will
accept a string argument and return a Boolean value that indicates whether or
not it is English text. But when you look at line 47, you can see it has three
parameters. The second and third parameters (wordPercentage
and letterPercentage) have equal signs and values
next to them. These are called default arguments. Parameters that have
default arguments are optional. If the function call does not pass an argument
for these parameters, the default argument is used by default.

If isEnglish() is called with
only one argument, the default arguments are used for the wordPercentage
(the integer 20) and letterPercentage
(the integer 85) parameters. Table 12-1 shows
function calls to isEnglish(), and what they are
equivalent to:

Table 12-1. Function calls with and without default
arguments.

Function Call

Equivalent To

isEnglish('Hello')

isEnglish('Hello', 20, 85)

isEnglish('Hello', 50)

isEnglish('Hello', 50, 85)

isEnglish('Hello', 50, 60)

isEnglish('Hello', 50, 60)

isEnglish('Hello',
letterPercentage=60)

isEnglish('Hello', 20, 60)

When isEnglish() is called with
no second and third argument, the function will require that 20% of the words
in message are English words that exist in the
dictionary text file and 85% of the characters in message
are letters. These percentages work for detecting English in most cases. But
sometimes a program calling isEnglish() will want looser
or more restrictive thresholds. If so, a program can just pass arguments for wordPercentage and letterPercentage
instead of using the default arguments.

A percentage is a number between 0 and 100 that shows how
much of something there is proportional to the total number of those things. In
the string value 'Hello cat MOOSE fsdkl ewpin' there
are five “words” but only three of them are English words. To calculate
the percentage of English words, you divide the number of English words by the
total number of words and multiply by 100. The percentage of English
words in 'Hello cat MOOSE fsdkl ewpin' is 3 / 5 * 100, which is 60.

Table 12-2 shows some percentage calculations:

Table 12-2. Some percentage calculations.

Number of English Words

Total Number of Words

English Words / Total

* 100

=

Percentage

3

5

0.6

* 100

=

60

6

10

0.6

*100

=

60

300

500

0.6

* 100

=

60

32

87

0.3678

* 100

=

36.78

87

87

1.0

* 100

=

100

0

10

0

* 100

=

0

The percentage will always be between 0% (meaning none of
the words) and 100% (meaning all of the words). Our isEnglish()
function will consider a string to be English if at least 20% of the words are
English words that exist in the dictionary file and 85% of the characters in
the string are letters (or spaces).

detectEnglish.py

51. wordsMatch
= getEnglishCount(message) * 100 >= wordPercentage

Line 51 calculates the percentage of recognized English
words in message by passing message
to getEnglishCount(), which does the division for us
and returns a float between 0.0 and 1.0. To get a percentage from this float, we just have to
multiply it by 100. If this number is greater than
or equal to the wordPercentage parameter, then True is stored in wordsMatch.
(Remember, the >= comparison operator evaluates
expressions to a Boolean value.) Otherwise, False is
stored in wordsMatch.

detectEnglish.py

52.
numLetters = len(removeNonLetters(message))

53.
messageLettersPercentage = float(numLetters) / len(message) * 100

54.
lettersMatch = messageLettersPercentage >= letterPercentage

Lines 52 to 54 calculate the percentage of letter characters
in the message string. To determine the percentage
of letter (and space) characters in message, our
code must divide the number of letter characters by the total number of
characters in message. Line 52 calls removeNonLetters(message). This call will return a string
that has the number and punctuation characters removed from the string. Passing
this string to len() will return the number of
letter and space characters that were in message.
This integer is stored in the numLetters variable.

Line 53 determines the percentage of letters getting a float
version of the integer in numLetters and dividing
this by len(message). The return value of len(message) will be the total number of characters in message. (The call to float()
was made so that if the programmer who imports our detectEnglish
module is running Python 2, the division done on line 53 will always be regular
division instead of integer division.)

Line 54 checks if the percentage in messageLettersPercentage
is greater than or equal to the letterPercentage
parameter. This expression evaluates to a Boolean value that is stored in lettersMatch.

detectEnglish.py

55. return wordsMatch
and lettersMatch

We want isEnglish() to return True only if both the wordsMatch
and lettersMatch variables contain True, so we put them in an expression with the and operator. If both the wordsMatch
and lettersMatch variables are True,
then isEnglish() will declare that the message
argument is English and return True. Otherwise, isEnglish() will return False.

The dictionary data type is useful because like a list it
can contain multiple values. However unlike the list, we can index values in it
with string values instead of only integers. Most of the the things we can do
with lists we can also do with dictionaries, such as pass it to len() or use the in and not in operators on it. In fact, using the in operator on a very large dictionary value executes much
faster than using in on a very large list.

The NoneType data type is also a new data type introduced in
this chapter. It only has one value: None. This
value is very useful for representing a lack of a value.

We can convert values to other data types by using the int(), float(), and str() functions. This chapter brings up “divide-by-zero”
errors, which we need to add code to check for and avoid. The split() string method can convert a single string value
into a list value of many strings. The split()
string method is sort of the reverse of the join()
list method. The append() list method adds a value
to the end of the list.

When we define functions, we can give some of the parameters
“default arguments”. If no argument is passed for these parameters when the
function is called, the default argument value is used instead. This can be a
useful shortcut in our programs.

The transposition cipher is an improvement over the Caesar
cipher because it can have hundreds or thousands of possible keys for messages
instead of just 26 different keys. A computer has no problem decrypting a
message with thousands of different keys, but to hack this cipher, we need to
write code that can determine if a string value is valid English or not.

Since this code will probably be useful in our other hacking
programs, we will put it in its own module so it can be imported by any program
that wants to call its isEnglish() function. All of
the work we’ve done in this chapter is so that any program can do the
following:

>>>
import detectEnglish

>>>
detectEnglish.isEnglish('Is this sentence English text?')

True

>>>

Now armed with code that can detect English, let’s move on
to the next chapter and hack the transposition cipher!