Code WorkOut of the Day -- Daily exercises for programmers looking for jobs, or just to keep sharp with skills and have fun.
I give talks, like this: https://youtu.be/NpvTE7GlXSM for people looking for jobs, or groups of programmers preparing for M&A tech HR due diligence.
Follow us on twitter: @codewod

How to

When solving these problems, learning by doing is much better than learning by reading. I encourage to you read only as far in the solution as you need, then trying to solve the problem. If you get stuck, try reading a little further. And of course, let me know if you find a better solution!

Thursday, February 23, 2012

Data structures and DAWGS

Problem: (Thank you to Stanford’s CS 106X class). Discuss a data structure, under 64k bytes,
which can in near constant time determine whether a string is a valid word in
the English language, or is invalid.

Solution:

Let’s start by asking a few more assumptions about this
problem. First, do we have to build the
data structure? No – we just need to
figure out what sort of data structure would be able to do this.

Well, let’s start out with one of the parts of the questions
– ‘near constant time.’ Well, there are
very few common data structures which can achieve this. Some data structures that fail immediately
are binary search trees (log n), sorted arrays (log n), linked lists (n or log
n).

One common data structure that is nearly constant time is a hash
table. And, at first this seems like a
great data structure. We could hash
based upon the string, end up in a hash bucket, and so long as we keep the hash
buckets relatively small, we could get a near constant time data structure. Well, this is true. This would work as being nearly constant
time. Pic 1 shows such a hash table.

Now, let’s see if this would work on the space
constraint. Let’s make an assumption
that there are 50k words in English (especially with variations like sing,
sang, sung, singing, singer, sings, etc.).
This seems about right. Maybe
it’s higher, but let’s see if we can create a hash table with a dataset of this
size in the space that we have.

Let’s also assume that we use ascii and the average word is
5 letters (seems low, but let’s go with it).
So, the average word is 6 bytes long, assuming that it’s null
terminated. Storing all of these words,
not include the over head of the hash table, we would have a data structure of
6 * 50k = 300k. This is much larger than
we want, and probably it would be larger as our estimate of words is probably
low. With the overhead of the hash
table, we would have an even bigger amount.

So, a hash table can’t work as its data store will be too
large.

So, this creates a problem that any data structure would
have – we cannot uniquely store each word and meet the space requirements.

Well, where to go from here?

Well, let’s think about what we know about the English
language to look for optimizations. We
know that the words are not random character strings, but words with many
common parts.

Again, a good example
is: sing, sang, sung, sings, singer,
singing. Here, we have six distinct
words. In a hash table, we would have
six distinct entries and we know that this approach will not work.

However, looking at the words, we can see that they have
both common prefix and common suffix.
Four of the words have the common prefix ‘sing’ and four of them end
with ‘ng’ and 3 with ‘ing.’

Let’s think about what we could do here. Well, this screams for a tree approach where
individual characters are nodes. We
previously rejected a tree because of the time required, but let’s go with this
for now. We can make ach node only branches
when it has a new letter.

By doing this, we have a tree with s as the root and then
the structure as shown in pic 2.

We
could put a bit on each bit indicating whether stopping at the node is a valid
word or not.

This tree would be fairly quick to search to see if a word
were in it, but the structure isn’t quite good enough for searching. For example, we would have to search all of
the lower nodes to see if a descendant node with the character we’re looking
for.

One quick optimization is to create nodes with all 26
characters as pointers and a ‘true’ bit.
We can then easily go down the tree and see if we have a valid word or
not. And, this is almost constant time
as we never have to look up more elements the number of characters in the
word. This type of structure is shown in
pic 3, and is called a Directed Acyclical Word Graph (DAWG).

If we think of the size of this structure, it probably
reduces the size of the words by a factor of 10 given all of the common
prefixes in words (probably a greater
reduction as over time, most prefixes are pretty common, but seems pretty close). This will get us close to the size we’re
looking for, but let’s think about how we can do better.

So far, we have done a good job constructing a data
structure that optimizes around prefixes, but if we look at the example of the
words like, ‘sing’, we see that we also have lots of common suffixes. As simple optimizations, we could imagine
uniting common prefixes. For example, many words in English end in ‘ing’; we
could have all of these words point to a common ‘ing’ ending.

This would be a further optimization, but would obviously
require balancing and make the function which inserts words longer.

This would give us a much optimized and fast data structure.

This is a good example where it’s good to try the first
thought of what could work (e.g., a hash table), but then move on and look at
optimizations to get to the correct answer.

2 comments:

Nicely done. I think this is a question I was asked during a Microsoft interview.

Regarding the suffix optimization, I think you're attempting to "zip" the data from both the head of the word and the tail of the word. It's potentially problematic (I can't vividly see the solution yet) because either (1) you zip from tail during insertion time, in which case you're not sure where to let the variable body length remain unoptimized, or (2) you zip as a post-tree-creation processing step, in which case it's expensive to decide how to cluster them.

Regardless, I would propose this heuristic: Depending on whether the language is "head-optimizable" or "tail-optimizable", I would calculate both corresponding trees (maintain two trees during insertion time) and pick the one with a smaller size.

For example, hypothetically, English language might be tail-optimizable, because as you mentioned, a lot of words end with "+ing" (sing, wing, bring), or "+ination" (procastination, imagination, combination) so a tail-optimized tree probably takes less space than the head-optimized tree.

Thx Amin. It does make me remember some late nights! One easy way to do a suffix-optimized (tail optimized) would be just to compare putting the words in backwards and see if that makes a smaller data structure.

30 minutes, once a week, free

To help the community, I do one 30 minute interview every week for free for candidates (virtual, over some online coding tools). Email me if you're interested. It's good practice for the kinds of questions you'll get, and I'll give honest feedback as to your skills, likelihood of getting a job, and tips for improving/practicing for future interviews.

I especially like to work with minority candidates, and those looking to make career changes, so please shoot me an email, provide some background, and we'll set up a time.

noah@codewod.com

Also, if you find alternative or better answers to the questions, please let me know and I'll post/credit!

About Me

About me: I'm currently the founder of Workhood (www.workhood.com) and have spent over 10 years in the technology industry. Prior to Workhood, I was the co-founder of SocialShield, acquired by Avira (A/V security company with over 100m users), where I was subsequently the VP Technology/Operations. I worked previously in McKinsey's technology practice in London and New York and also worked at several start-ups as an engineer and product manager, founding my first venture backed company when I was 22 -- and have raised over $35M in my career in financing. My focus is always on delivering great products quickly and I am a huge proponent of agile programming/product development and the lean start-up techniques. I'm a Stanford CS grad, always interested in coding problems -- co-author of the best-selling book: Programming Interviews Exposed: Secrets to Landing Your Next Job. I also went to HBS and am an avid college football and soccer fan.

I blog about coding problems at: www.codewod.com and about start-ups at www.noahkindler.com. I live in the Bay Area and can often be found at Crossfit or on the slopes when I manage to sneak out.