Snippets from a computing educator and researcher.

What's the best language model?: part 1

In the post on breaking ciphers, I asserted that the bag of letters model (a.k.a. the "random monkey typing" model) was the best one for checking if a piece of text is close to English. Is that true?

But before we get there, what are some language models we could use?

Bags of letters and similar

We've already seen the "bag of letters" model in the post on breaking ciphers. But there are other ways of categorising text. Rather than using single letters, we can use consecutive pairs of letters (called bigrams), consecutive triples of letters (trigrams) and more (generally called n-grams). Single letters are unigrams.

Unigrams have the advantage of being easy to calculate. But when we come to transposition ciphers (where the message is encrypted by changing the order of the letters), the unigram counts will stay the same. For these ciphers, 2-, 3-, 4-, or higher n-grams are useful. The disadvantage of longer n-grams is that the rarer n-grams occur very much less frequently (see figure to the right, showing relative frequency of unigrams, bigrams, and trigrams for English; click the image for a larger version). This means you may not see many (or any!) of each n-gram in your training set, meaning that things can go wrong when you're trying to break new messages.

But, if you can get away with it, larger n-grams are generally better for breaking ciphers.

Implementing n-grams

How do you find n-grams? With a little bit of string slicing:

def ngrams(text, n):
return [text[i:i+n] for i in range(len(text)-n+1)]

Frequency counts as vectors

The bag of letters model is just a list of counts of each letter in some text. An ordered list of numbers is the same as a one-dimensional array, or a vector. For the vectors we're interested in, they're vectors in a 26-dimensional space.

For example, the text hello there can be represented as the vector:

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

3

2

2

1

1

1

(there are zeros in the blank spaces, which I've left out for clarity) and the text the cat sat on the mat can be represented as:

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

3

1

2

2

1

1

1

1

5

Mathematicians know a lot about vectors, and aren't fazed by high-dimensional spaces. We shouldn't be either.

Vectors can be thought of as being points in a space. If we're trying to find the similarity between two vectors, we can think about that as finding the distance between those two points. Mathematicians have thought a lot about vectors and distances, so we can use some of their techniques here.

The first thing to consider is that different length texts will have different numbers of letters and therefore represent points that are different distances away from the origin. That means we need to scale these vectors so that all the vectors are the same length; this kind of scaling is called normalising.

Once we have all the vector the same length, we have to find the distance between two points: one that represents the text we're looking with, the other being a representation of typical English. The closer these points, the more like English is the text we're concerned with.

Mathematicians have many ways of finding these distances, and use the term norm to generalise the idea of distance. The norm of a vector \(\mathbf{x}\) is written \(|\mathbf{x}|\) or \( \lVert \mathbf{x}\rVert \).

If we're taking a geometric view of vectors (which we could if we're thinking about texts in alphabets with two or three letters, corresponding to vectors in two or three dimensional space), an obvious way to find lengths and distances is with Pythagoras's theorem; this is also called the Euclidean norm.

\[ |\mathbf{x}| = \sqrt{x_a^2 + x_b^2 + x_c^2} \]

where \(x_a\) is the number of as, \(x_b\) is the number of bs, and \(x_c\) is the number of cs (see the figure to the right). If the alphabet has more letters, the formula remains the same, but just with more terms in the square root.

Another way of finding the length of a vector is the Manhattan distance or taxicab distance, named because it's the distance travelled going from here to here if you took a taxi across the grid structure of Manhattan (see left).

\[ |\mathbf{x}| = x_a + x_b + x_c \]

But, you can rewrite that equation as

\[ |\mathbf{x}| = \sqrt[1]{x_a^1 + x_b^1 + x_c^1} \]

That suggests you could measure distance with another variation:

\[ |\mathbf{x}| = \sqrt[3]{x_a^3 + x_b^3 + x_c^3} \]

and so on. And indeed you can. These collectively are known as the \( L_p \) norms, where p is the power used. The Euclidean norm is the \( L_2 \) norm; the Manhattan distance is the \( L_1 \) norm.

As the power p increases, the largest component of x gets to be more and more significant to the final norm value. Eventually, that gives us the \( L_\infty \) norm of

\[ L_\infty \thinspace \mathbf{x} = \max_i x_i \]

The \( L_0 \) just counts how many dimensions are non-zero. It's also called the Hamming distance and is generally only interesting with binary-valued dimensions, when it's a count of how many bits are different.

Implementing norms

Now we understand norms, how do we get Python to implement them?

First of all, we want to use the same basic function to do the two related jobs:

find the "length" of a vector, which we can use to scale it to the correct length

find the distance between two vectors.

The core of all this is the \(L_p\) norm, which we'll implement as a Python function lp. If we pass it one vector, it finds the length of it. If we pass it two vectors, it will find the distance between them. If p is unspecified, we'll use a values of p=2.

Cosine distances

The final way of finding the difference between two vectors doesn't depend on normalising their lengths. Instead, the cosine distance looks at the angle between two vectors. The cosine is used because it's easy to calculate from the components of the two vectors: