Zipf's Law

Dr. Richard S. Wallace

Before we get to ALICE, we need to visit another unusual figure in the history of computer science: Professor George Kingsley Zipf. Although he was a contemporary of Turing, there is no evidence the two ever met. Zipf died young too, at the age of 48, in 1950, only four years before Turing, but of natural causes.

There are many ways to state Zipf's Law but the simplest is procedural: Take all the words in a body of text, for example today's issue of the New York Times, and count the number of times each word appears. If the resulting histogram is sorted by rank, with the most frequently appearing word first, and so on ("a", "the", "for", "by", "and"...), then the shape of the curve is "Zipf curve" for that text. If the Zipf curve is plotted on a log-log scale, it appears as a straight line with a slope of -1.

The Zipf curve is a characteristic of human languages, and many other natural and human phenomena as well. Zipf noticed that the populations of cities followed a similar distribution. There are a few very large cities, a larger number of medium-sized ones, and a large number of small cities. If the cities, or the words of natural language, were randomly distributed, then the Zipf curve would be a flat horizontal line.

The Zipf curve was even known in the 19th century. The economist Pareto also noticed the log-rank property in studies of corporate wealth. One only need to consider the distribution of wealth among present-day computer companies to see the pattern. There is only one giant, Microsoft, followed by a number of large and medium-sized firms, and then a large tail of small and very small firms.

Zipf was independly wealthy. This is how he could afford to hire a room full of human "computers" to count words in newspapers and periodicals. Each "computer" would arrive at work and begin tallying the words and phrases directed by Zipf. These human computers found that Zipf's Law applies not only to words but also to phrases and whole sentences of language.

8024 YES
5184 NO
2268 OK
2006 WHY
1145 BYE
1101 HOW OLD ARE YOU
946 HI
934 HOW ARE YOU
846 WHAT
840 HELLO
663 GOOD
645 WHY NOT
584 OH
553 REALLY
544 YOU
531 WHAT IS YOUR NAME
525 COOL
516 I DO NOT KNOW
488 FUCK YOU
486 THANK YOU
416 SO
414 ME TOO
403 LOL
403 THANKS
381 NICE TO MEET YOU TOO
375 SORRY
374 ALICE
368 HI ALICE
366 OKAY
353 WELL
352 WHAT IS MY NAME
349 WHERE DO YOU LIVE
340 NOTHING
309 I KNOW
303 WHO ARE YOU
300 NOPE
297 SHUT UP
296 I LOVE YOU
288 SURE
286 HELLO ALICE
277 HOW
262 WHAT DO YOU MEAN
261 MAN
251 WOW
239 SMILE
233 ME
227 WHAT DO YOU LOOK LIKE
224 I SEE
223 HA
218 HOW ARE YOU TODAY
217 GOODBYE
214 NO YOU DO NOT
203 DO YOU
201 WHERE ARE YOU
.
.
.

The human input histogram, ranking the number of times ALICE receives each
input phrase over a period of time, shows that human language is not random.
The most common inputs are "YES" and "NO". The most common multiple-word input
is "HOW OLD ARE YOU". This type of analysis which cost Dr. Zipf many hours
of labor is now accomplished in a few milliseconds of computer time.

Considering the vast size of the set of things people could possibly say, that are grammatically correct or semantically meaningful, the number of things people actually do say is surprisingly small. Steven Pinker, in his book How the Mind Works wrote that

Say you have ten choices for the first word to begin a sentence, ten choices for the second word (yielding 100 two-word beginnings), ten choices for the third word (yielding a thousand three-word beginnings), and so in. (Ten is in fact the approximate geometric mean of the number of word choices available at each point in assembling a grammatical and sensible sentence). A little arithmetic shows that the number of sentences of 20 words or less (not an unusual length) is about 10^20.

Fortunately for chat robot programmers, Pinker's combinatorics are way off. Our experiments with ALICE indicate that the number of choices for the "first word" is more than ten, but it is only about two thousand. Specifically, 1800 words covers 95% of all the first words input to ALICE. The number of choices for the second word is only about two. To be sure, there are some first words ("I" and "You" for example) that have many possible second words, but the overall average is just under two words. The average branching factor decreases with each successive word.

531 WHAT IS YOUR NAME
352 WHAT IS MY NAME
171 WHAT IS UP
137 WHAT IS YOUR FAVORITE COLOR
126 WHAT IS THE MEANING OF LIFE
122 WHAT IS THAT
102 WHAT IS YOUR FAVORITE MOVIE
92 WHAT IS IT
75 WHAT IS A BOTMASTER
70 WHAT IS YOUR IQ
59 WHAT IS REDUCTIONISM
53 WHAT IS YOUR FAVORITE FOOD
46 WHAT IS AIML
38 WHAT IS YOUR FAVORITE BOOK
37 WHAT IS THE TIME
37 WHAT IS YOUR JOB
34 WHAT IS YOUR FAVORITE SONG
34 WHAT IS YOUR SIGN
33 WHAT IS SEX
32 WHAT IS YOUR REAL NAME
30 WHAT IS NEW
30 WHAT IS YOUR AGE
30 WHAT IS YOUR GENDER
28 WHAT IS YOUR LAST NAME
27 WHAT IS HIS NAME
27 WHAT IS YOUR SEX
26 WHAT IS 2+2
26 WHAT IS MY IP
25 WHAT IS YOURS
24 WHAT IS YOUR PURPOSE
21 WHAT IS YOUR FAVORITE ANIMAL
20 WHAT IS 1+1
20 WHAT IS YOUR HOBBY
19 WHAT IS THE WEATHER LIKE
19 WHAT IS YOUR PHONE NUMBER
18 WHAT IS ALICE
18 WHAT IS GOING ON
18 WHAT IS THAT SUPPOSED TO MEAN
18 WHAT IS WHAT
17 WHAT IS A SEEKER
17 WHAT IS LOVE
17 WHAT IS THE OPEN DIRECTORY
17 WHAT IS YOUR FAVORITE TV SHOW
16 WHAT IS JAVA
16 WHAT IS THE ANSWER
16 WHAT IS YOUR ANSWER
16 WHAT IS YOUR FULL NAME
15 WHAT IS AI
15 WHAT IS THAT MEAN
15 WHAT IS THE WEATHER LIKE WHERE YOU ARE
15 WHAT IS TWO PLUS TWO
15 WHAT IS YOUR FAVORITE BAND
14 WHAT IS CBR
14 WHAT IS ELIZA
14 WHAT IS GOD
14 WHAT IS PI
14 WHAT IS THE TURING GAME
13 WHAT IS 2 + 2
13 WHAT IS A COMPUTER YEAR
13 WHAT IS IT LIKE
13 WHAT IS MY FAVORITE COLOR
12 WHAT IS 2 PLUS 2
12 WHAT IS A CAR
12 WHAT IS A DOG
12 WHAT IS ARTIFICIAL INTELLIGENCE
12 WHAT IS IT ABOUT
12 WHAT IS LIFE
12 WHAT IS SEEKER
12 WHAT IS YOU NAME
12 WHAT IS YOUR FAVORITE
12 WHAT IS YOUR SURNAME
11 WHAT IS 1 + 1
11 WHAT IS A CHATTERBOT
11 WHAT IS A PRIORI
11 WHAT IS SETL
11 WHAT IS THE TIME IN USA
11 WHAT IS THE WEATHER LIKE THERE
11 WHAT IS YOUR FAVORITE FILM
10 WHAT IS A CATEGORY C CLIENT
10 WHAT IS A PENIS
10 WHAT IS BOTMASTER
10 WHAT IS MY IP ADDRESS
10 WHAT IS THE DATE
10 WHAT IS THIS
10 WHAT IS YOUR ADDRESS
10 WHAT IS YOUR FAVORITE MUSIC
10 WHAT IS YOUR FAVORITE OPERA
10 WHAT IS YOUR GOAL
10 WHAT IS YOUR IP ADDRESS

Even subsets of natural language, like the example shown here of sentences
starting with "WHAT IS", tend to have Zipf-like distributions. Natural language
search bots like Ask Jeeves are based on pre-programmed responses to the most
common types of search questions people ask.