Letterpress Cheater Algorithm

Sep 2013

A few months ago when Letterpress was hot, I thought about an efficient
algorithm for a cheater application. A trivial solution in Ruby is quite slow
for this problem, which motivated me to check out alternative approaches.
Abbreviated, the problem is the following:

Given 25 random letters (letters), find every string in an array of strings (words) that
consists of only those letters.

I’ll start by walking through a naive solution before presenting a data
structure to solve this problem efficiently.

Naive algorithm

The naive approach for this problem is to simply loop through all elements of
words and check whether these words can be formed from the characters in
letters with a frequency map:

For an average word length m and n words in words this runs in O(n m)
time. In my tests, this algorithm runs in about 2-4s on my Macbook on the
dictionary in /usr/share/dict/words. That is not fast enough if this was to be
used for a web application for example, so we dig deeper.

A simple iteration on this naive algorithm, is the observation that if a
character is not in letters, then we do not have to loop through all the words
that start with this letter in words. E.g. if there’s no letter c in
letters, we can skip to the words beginning with d when we encounter the first
word which begins with c in words and so on.

The speed of this iteration depends on how many words in letters start with
each character in words. Say that k is the highest number of words that
starts with any character in words, and d is the number of distinct
characters from which we can form words, then this runs in worst-case O(d k)
time. In my tests, this was about twice as fast as the previous algorithm.
Another iteration is threading the processing of each group.

Using a Trie

The idea of grouping is similar to the original solution I had in mind, but
instead of just grouping on the first character, group on all characters! This
creates a neat, recursive structure called a Trie.

For instance, if we put in the word “band”, “ban” and “boo” in the data
structure, it will look like this:

{b: {o: {o: {}}a: {n: {d: {}}}}}

That way we can check that “boo” is in the structure with something like:
map[:b][:o][:o]. However, that would also imply that “bo” is in the structure,
which it is not. We need a state on each of the maps that tells whether a word
ends at this letter.

With that comes the interesting part. The problem is now, given this data
structure, how do we find all the entries in the data structure, which is the
end of a word that we can get to with letters?

Once again we make use of the frequency map explained in the previous section,
and then we recursively visit nodes in the data structure. The frequency map is
updated as we go down the recursion, so invalid paths can be detected.

The full implementation of the data structure can be seen in this
Gist.

Benchmarks

I made a few small optimizations to the
trie showed here to make it faster.
There’s still plenty of things that could be done to make it faster, such as
concurrent searching, and there’s probably a bunch of things that can be found
profiling. Either way, it’s clear that it is much faster to use a trie over the
naive methods presented earlier. The trie uses a lot of memory, this could be
optimized by converting it into a radix
tree, which could also yield small
performance benefits.

About 0.1-0.2s in the worst case on actual Letterpress games is absolutely fine
for use in a service. The second case, which takes the most time, is a stress
test with every letter of the alphabet. It would never happen in Letterpress.