In our preprocessing stage, we use
a robust screening algorithm. We did this so we wouldn’t waste time
searching through words which are not capable of being used. Basically,
all invalid words are extracted from the dictionary.

Formal definition: An invalid
word is a word that cannot be legally played in the game of Upwords.

Invalid words can be organized into
six categories:

Words whose length is one

Words whose length is greater than eight

Words which have a ‘q’ followed by a
letter which is not ‘u’

Words which end in ‘q’

Words which have characters other than
the alphabet

Words which use more tiles of a certain
letter than there are available

ANALYSIS

When taking out the illegal words
from the UNIX dictionary, which contains 25133 words, we found that 7836
of them were illegal. Here are the statistics:

Category

Number of words

Percent of dictionary

1

25

0.0994

2

7189

28.604

3 and 4

9

0.0358

5

111

0.4417

6

502

1.9974

Total

7836

31.178

So if we can assume that the UNIX
dictionary is a good representative of a dictionary of normal length, we
would expect that approximately 30% of the words will turn out to be illegal.

Whenever a legal word is found, it
is placed in the file called “upwords.dict”, which is our preprocessed
dictionary. One more thing done in preprocessing is that all characters
are capitalized.

COMMENTS ON ALTERNATIVE
PREPROCESSING METHODS

Instead of just taking out illegal
words, we could have used many more interesting techniques in the preprocessing
step. One popular method would be to create many separate files,
each containing words which share some characteristics. The use of
a trie precludes having to do such a thing. Looking up words stored
in a trie is very efficient, so coming up with clever ways to store the
dictionary would be not be helpful at all.