Dictionary Lookups in JavaScript

I’ve been working on a browser-based word game, naturally written in JavaScript, and have been encountering some interesting technical challenges along the way. I’ve written up my thought process here for others to learn from (note that most of this happened over the course of a month, or so). I’ve often found that while a final solution to a problem may be rather elegant and “make perfect sense” when looking at it – it’s only through the result of much trial and error that the solution was arrived upon.

To start, in my game, the user is frequently re-arranging letters – causing the game to look up words in a dictionary to see if they are valid, or not.

I’ve taken multiple passes at implementing a solution to this problem, ranging all the way from “I don’t care about performance, I just want it to work” all the way up to “thousands of people could be playing simultaneously, how do I scale?”

For my seed dictionary I used a variation of the Scrabble dictionary, which can be found via some creative Googling. The full dictionary ends up being around 916KB (with words separated by an endline).

Server-Side Solution

The first pass was stupid simple. It worked, but just barely. I took the dictionary and split it up into 26 files – one for each letter of the alphabet – and put all the words that started with the corresponding letter in the text file.

I then made a little PHP script to handle user requests – reading in the portion of the dictionary file and returning “pass” or “fail” if the word was found.

<?php
# Get the word to be checked from the user
$word = $_GET&#91;'word'&#93;;
# Get the first letter of that word
$first = substr( $word, 0, 1 );
# Open the corresponding file
$handle = fopen("words/" . $first . ".txt", "r");
if ( $handle ) {
# Keep going until the end of the file
while (!feof($handle)) {
# Get the word in the dictionary
# (removing the endline)
$line = trim(fgets($handle));
# And see if the word matches
if ( $line == $word ) {
# If so return "pass" to the client
echo "pass";
exit();
}
}
fclose($handle);
}
# If we made it to the end, then we failed
echo "fail";
?>

(Please don’t use the above code, it’s terrible.)

Thus you would call the PHP script like so:

/words.php?word=test

And it would return either “pass” or “fail” (meaning that it would only work on the same domain).

So while the above worked it wasn’t nearly fast enough and it consumed a ton of memory on the server – each request was reading in large portions of potentially 100KB+ files. Time for a better solution.

A Better Server-Side Solution

My next attempt was to write a simple little web server (in Perl, this time) that pre-loaded the entire dictionary into memory, and handled all the requests via JSONP.

There are a bunch of things that I don’t like about the above code (like using a hash to lookup the words, when more memory-efficient solutions exist, and instantiating a CGI object on every request – things like that). Also if I were to do this today I’d probably write it in Node.js, which I’m becoming more familiar with.

Compared with the previous solution though there are a large number of advantages. Since the dictionary is being loaded into memory only once, and into a hash, it makes lookups very, very, fast. I was timing entire HTTP requests at only a handful of milliseconds, which is great. Additionally since this solution utilized JSONP it was possible to set up a server (or multiple servers) dedicated to looking up words and have the clients connect to them cross-domain.

A Client-Side Solution

It was at this point that I realized a couple problems with the previous server-side solutions. If my game was going to work offline (or as a distributable mobile application) then a constant connection to a dictionary server wasn’t going to be possible. (And if slow mobile connections were taken into account then the responsiveness of the server just wasn’t going to be sufficient enough.)

The dictionary was going to have to live on the client.

Thus I wrote a simple Ajax request to load the text dictionary and make a lookup object for later use.

// The dictionary lookup object
var dict = {};
// Do a jQuery Ajax request for the text dictionary
$.get( "dict/dict.txt", function( txt ) {
// Get an array of all the words
var words = txt.split( "\n" );
// And add them as properties to the dictionary lookup
// This will allow for fast lookups later
for ( var i = 0; i < words.length; i++ ) {
dict&#91; words&#91;i&#93; &#93; = true;
}
// The game would start after the dictionary was loaded
// startGame();
});
// Takes in an array of letters and finds the longest
// possible word at the front of the letters
function findWord( letters ) {
// Clone the array for manipulation
var curLetters = letters.slice( 0 ), word = "";
// Make sure the word is at least 3 letters long
while ( curLetters.length > 2 ) {
// Get a word out of the existing letters
word = curLetters.join("");
// And see if it's in the dictionary
if ( dict[ word ] ) {
// If it is, return that word
return word;
}
// Otherwise remove another letter from the end
curLetters.pop();
}
}

Note that having the dictionary on the client actually allowed for an interesting new form of game play. Previously the player had to select a specific word and send it to the server – and only then would the server say if that word was valid or not. Since the dictionary now lives on the client (making lookups instantaneous, in comparison) we can change the logic a bit: The game now looks for the longest word at the start of the user’s letters. For example, if the user had the letters “rategk” then the function would work like so:

Naturally, something similar could’ve been done on the server-side – but that wasn’t readily apparent when working on the server. This is a case where a performance optimization actually created a new, emergent, form of gameplay.

But here’s the rub: We’re now sending a massive dictionary down to a client. It’s 916KB – and that’s absolutely massive.

Optimizing the Client-Side Solution

Compression and Caching

Turning on Gzip compression on the server reduces the dictionary file size from 916KB to a much-more-sane 276KB. Additionally configuring the cache-control settings of your server will ensure that the dictionary won’t be requested again for a very long time (assuming that the cache in the browser isn’t cleared).

There are some excellent articles already written on these techniques:

Of course, all of this is a bit of a given when you use a Content Delivery Network – which I most certainly do. Right now I’m using Amazon Cloudfront, due to its relatively simple API, but I’m open to other solutions. This means that the dictionary file will be positioned on a large number of servers around the globe and served to the user in the fastest manner possible (using both gzip and proper cache headers).

Cross-Domain Requests

There’s a problem, though: We can’t load our dictionary from a CDN! Since the CDN is located on another server (or on another sub-domain, as is the case here) we’re at the mercy of the browser’s cross-origin policy prohibiting those types of requests. All is not lost though – with a simple tweak to the dictionary file we can load it across domains.

First, we replace all endlines in the dictionary file with a space. Second, we wrap the entire line with a JSONP statement. Thus the final result looks something like this:

dictLoaded('aah aahed aahing aahs aal... zyzzyvas zzz');

This allows us to do an Ajax request for the file and have it work as would expected it to – while still benefitting from all the caching and compression provided by the browser.

I alluded to another problem already: If the browser doesn’t cache the file, for some reason, or if the cache runs out of space and the file is expunged – it’ll be downloaded again. I want to try and reduce the number of times in which 216KB file will be downloaded – if not for the users then for reducing my bandwidth bill.

Local Storage

This is where we can use a great feature of HTML 5: Local Storage. Mark Pilgrim has a great tutorial written up on the subject that I recommend to all. In a crude nutshell: You now have an object that you can stuff strings into and they’ll be persisted by the browser. Most browsers give you around 5MB to play with – which is more than enough for our dictionary file. It also has great cross-browser support – with the simple API working across all modern browsers.

With a little tweak to our Ajax logic (taking into account the JSONP request, the CDN, and Local Storage) we now end up with a revised solution:

This gives us an incredibly efficient solution, allowing us to load the dictionary from a CDN (gzipped and with proper cache headers) – and avoiding subsequent requests to get the file if we already have it cached.

Improving Memory Usage

One final tweak that we can make. If you remember the previous dictionary lookup we loaded the entire dictionary into an object and then checked to see if a specific property existed. While this works, and is fast, it also ends up consuming a lot of memory (more so than the existing 916KB, at least).

To avoid this we can be a little bit tricky. Instead of putting the words into an object we can just leave the entire dictionary string intact and then do searches using a JavaScript String’s indexOf method.

Thus inside the dictReady callback we’ll have something like the following:

All together, as of this moment, this is the most optimal solution to this particular problem that I can think of. It’s likely that I’ll find some additional tweaks (or ways of improving memory usage) in the future – but at the very least this solution keeps HTTP requests to a minimum, bandwidth usage to a minimum, memory usage to a minimum, and lookups fast.

Have you looked into bloom filters? They seem to be a great answer for your problem. Very compact, quick lookup, 0% false negatives. The downsides: false positives are possible (you can control the odds of that happening by sizing your filter), and you can’t traverse the list of known words (only test for their presence). Implementing one in JavaScript shouldn’t be hard, using an array of 32-bit values or a string of Unicode characters, for instance.

Also, a (indexed) database should be very efficient for lookups and memory usage, both on the server side and on Web SQL (more so than localStorage, since SQLite will know how to manage things without necessarily keeping the whole database in memory). I’m not familiar with IndexedDB, but since it’s also (presumably) backed up by SQLite, it should be good.

Another option is to hash the words into a number. Store those numbers in an ordered array. To look up a word, hash the word into a number and do a binary search on the array. Not sure how much memory that would take, but it would be fast as heck!

There’s definitely a risk of going way too far down the rabbit hole of premature optimization. But you could have 26 dictionary strings, one for words starting with each letter, or 676 for words starting with (or consisting of) each pair of letters. If you do one of those, you could save a tiny bit of RAM by leaving the first letter (or two) off each entry.

And I’m sure it’s possible to do all sorts of algorithms — binary searches, radix tries represented as one long string, who knows what — but, honestly, scanning a reasonable-sized string is so dang fast the fancy stuff probably isn’t worth it.

Two fun ideas from that “probably not useful” zone: if the lookup delay is still noticeable, move the lookup to a time the user won’t notice it — after a tile is placed instead of after the play’s submitted. If the browser supports workers, it could even be in the background. And if the load time is a problem (I think it’s not), you could use a server-side dict until the client dict is all loaded.

Hmm, you could’ve simply loaded a javascript file with the dict in it through a script tag, instead of using an ajax-request. The script tag could be inserted dynamically, so your logic stays the same but removes the security issues with jsonp that are been rumoured around the web (i dont know them) and the request isn’t been prohibited by the browser cross-origin policy.

“window.localStorage !== null”. Shouldn’t that be “!=”? When localStorage is unavailable, it will be undefined, which == null but !== null so I’m thinking it’ll go on to crash at “window.localStorage.gameDict”.

The indexOf solution is definitely not algorithmically optimal.
The complexity of indexOf is linear in the number of characters in the string.
Dictionary lookups can (and should) be done in logarithmic time or better.
A Trie is definitely an excellent structure to play around with.
Its complexity (number of comparisons) is bounded: O(min(a, log(b))) – Where a is the length of the lookup word and b is the size (in words) of the dictionary.

A played around with dictionary games before and last year I wrote some nodejs code that takes a wordlist and actually generates a js library or json. It also performs very fast when searching for correct words, and can be enhanced to contain word descriptions (taking another creative googling session).

The JS library also gzips better than the original wordlist – and is also smaller in the end.

I was going to respond with “use a trie!” earlier too, but I figured I’d put some code where my mouth was. I’ve been implementing something very similar to this in C#, so I figured it would fun to convert to JS.

Say you had a wordlist of [‘bar’, ‘bars’, ‘foo’, ‘rat’, ‘rate’], you might boil that down to a trie structure like this to save space:

// If the trie contains a key matching the next letter,
// descend into that level of the trie and do it again.
if (trieContext[letters[0]]) {
// However, don't return this back up until we know
// there's a match farther down there somewhere.
var whatLiesBelow = findLongestWord(letters.slice(1), trieContext[letters[0]]);

// But if there is, it's the longest result; concatenate
// and send it back up the stack.
if (whatLiesBelow)
return letters[0] + whatLiesBelow;
}

// If none of that worked, check to see if we're at the end
// of any valid words and return this terminal letter if so.
if (trieContext[letters[0]] && trieContext[letters[0]].end) {
return letters[0];
}

// Else, there's just nothing interesting at this point.
// Return false back up and hope we had better luck earlier.
return false;
}

// Returns 'rat'
findLongestWord('ratt');

// Returns false
findLongestWord('unrate');

// Returns false
findLongestWord('abc');

// Returns 'rate'
findLongestWord(['r', 'a', 't', 'e']);

Apart from the performance/memory optimization, one nice thing about the trie is that it can handle wildcards easily (e.g. blank tiles in Scrabble) if you decide that you want them in the future. Adding wildcard support to an indexOf() approach massacres lookup performance.

It only does anagrams (and doesn’t support the “use all letters” option, either.) I didn’t get the crosswords in there. (There wasn’t any point. I was testing performance and anagrams are the hard part.)

Don’t bother looking at the code. I was just scrapping about trying to figure out how to make it work and I didn’t know anything about javascript at the time. It’s pretty ugly.

Everyone is yelling to use a trie. Sure. It’s a great solution, and pretty dang perfect in this case. But just as a thought exercise you could consider your dataset as a perfectly balanced tree (because it’s never changing, your distinction for balancing can be arbitrary). If all you need to do is lookup a word that you know you are looking for, then a binary search will get you there hella-fast (pick the “middle” word in your set, check if it’s greater than — alphabetically in this case — or less than your taget word, and split the remaining half and recurse). If you make each word a fixed length (will probably add a few Kb to your dataset), then you can even do it with byte-offsets into a string, and all your comparisons are integers. I’ve done it that way with a simple database I built, and it could lookup a fixed-length record in a 100Mb dataset in about 20 milliseconds (if the file was in memory…)

@All: Thanks for the Trie, DAWG, and Bloom filter suggestions! I’ll do a follow-up post and talk about those specifically. Note that this post was mostly written to get people thinking about practical forms of optimization (that hopefully transcend dictionaries themselves – which only have limited applications in JavaScript applications). I’ll definitely do another post digging deeper into dictionary lookups themselves. Thanks!

@Thomas I quite like your suggestion about the perfectly balanced binary tree (in an array) and I too have experienced it to be the fastest solution. The “keeping-all-words-fixed-length-in-a-string” trick is very useful if you are working with filed, but I guess for in-memory stuff, an array should be just fine.

Having said that though, I guess performance doesn’t matter so much (as memory usage) for a browser app. As John Resig noticed, perf. was an issue on the server since one server would be serving multiple requests (memory could be compromised since it would be amortized across all the requests) whereas in the browser, memory is a problem because you are eating up everyone’s RAM (which comes at a slight premium).

Before I looked at the comment, I was also thinking trie, and made a few pseudo code examples.

Starting with the string you had in your post “aah aahed aahing aahs aal” (25 characters), it would result in a trie that looks like (the first column is the level of the node)

0 a
1 a
2 h#
3 e
4 d#
3 i
4 n
5 g#
3 s#
2 l#

That can be encoded for transmission from server to client as follows. A simple convention: + goes up one level, letters start new level, # indicates a valid end of word. ‘-‘ is often a valid letter in words, so don’t use this instead of “+”.

aah.ed#+ing#++s#+l# (19 characters)

I will leave as an exercise to verify that this can indeed be decoded into the initial trie tree.

Now, the client needs to decode this, since the string as such cannot be used for easy lookup. I do not think using javascript objects is
a good approach, since they don’t really save memory (lots of dictionaries and lookups in there). I suggest instead to use a simple array of tuples: each tuple is a letter and the index of the next letter at the same level (and with same parent). The letter should be encoded as a unicode integer value, not a String, to save space (say 2 bytes for the letter, and 2 for the index(?), that’s four bytes per letter in the trie tree, less than if we were using a htable or n-ary tree, but more than the encoded trie. We also need to indicate which letter is a final end. That might require an additional byte, unless we encoded that in the index itself as a special bit.

Here is some pseudo-code that verifies whether a word is valid. I haven’t checked this in details.

current = 0;
parent = -1;
for each letter in word:
// current points to first candidate entry
while (current != -1) {
if (trie[current][0] == letter) {
// We found the entry for current letter, move to first
// child node
parent = current;
current ++;

A good tip is to not parse strings with javascript code if all you’re doing is simple splitting. Instead wrap the string with json and use JSON.parse(). The difference can be HUGE depending on the browser, i’ve seen 20x decrease in time.

So instead of string.split(“1 2 3 4 5″, ” “);
do JSON.parse(“[1,2,3,4,5]”);
I don’t remember the exact way of splitting(splice?) in javascript so consider the above code pseudojavascript.
For strings you will of course have to transfer two extra quote-marks per word but gzip should handle that quite well since it’s a consistent pattern of “,”

For the lookup i guess a sorted list of the words with binary search would have quite fast lookups and pretty low memory usage.

In addition to that i don’t know if it’s such a great idea to use the localstorage as a cache. The browsers has a dedicated cache for a reason.

@Joe – I think what John is looking for here is for his design to also work on mobile devices if and when he ports it on those platforms. Your suggestion is still valid, but redirecting mobile users would require network data connectivity -which isn’t cheap.

Are there any memory size optimizations knowing that 26 letters (a-z) can fit in 5 bits but a character in JavaScript takes 16 (?). Maybe put three characters into a 3-gram (http://en.wikipedia.org/wiki/N-gram) since 3*15 < 16? Of course, this is a new problem space for me, so I just brainstorming and I haven't done much bit shifting in JavaScript.

Funny timing on this post. I’ve been working on a word-finder in Node.js for the last weekend or so. It uses the trie structure along with some clever prime factorization logic for the node filtering (plus a few trade secrets ;)), and the results are insanely fast.

For example, using the Words With Friends dictionary, I can locate 974 possible matches from the input string “qewfjvdslejsdfjuamcm” (20 chars) with an average look up time of 1.3ms.

Another advantage of this data structure is that you can do efficient (and algorithmically sane) “blank tile” searches. For example, searching for “abcd**” (indicating two blank tiles) returns 1234 matches in only 1.7ms.

Protip: sort the words before creating the trie, and then store the sorted word on the end node of a given path. If you sort alphabetically, you get space optimization, and if you sort based on character frequency, you get early pruning :)

I’m so happy to find so many other people who immediately started hashing out datastructures and pseudocode! I agree with the trie solution. Precompile the dictionary into a JSON trie and send that (gzipped, cached) to the client.

@Jeremy – loving the idea of sorting the letters before creating/using the trie. Having the entire word present at the end of a trie path seems to defeat the space-saving benefit of a trie, though.

@Milo
It’s definitely a space/cpu tradeoff (the characteristics of which are both relative to the dictionary input). If you’re doing a “pure trie”, then storing the words at the end doesn’t make any sense, since you already have the characters present in each node (and they’ll be in order if you haven’t sorted them). In my solution each node actually stores a prime number, based on the character it represents. That way I can do super fast node matching based on the result of modding it against a product derived from the input string. In JavaScript this requires some trickery since that product quickly grows too large for a long to store accurately, but it can be made to work. Anyway, storing the word on the node was necessary in my case, but definitely not for every implementation.

This allowed me to save some new-line characters.
After loading those packed patterns, the client ‘unpacks’ the patterns by splitting the lines and stores them in a more verbose object.
If available, the unpacked object is stored in localStorage, thus the unpacking has only to be done once.

Since the original hyphenation algorithm by F. M. Liang is using a trie I experimented with tries, too.
But the format probosed by Dave Ward above turned out to be too verbose for transmission. The format proposed by Emmanuel Briot look very promising, though.

@Ryan Fair enough, I suppose (albeit unwarranted). The point wasn’t to be self-congratulatory though, so much as to put some numbers behind the efficiency of using a trie. I didn’t invent the structure, and the idea of using prime factorizations for anagram-style problems isn’t original either. Rather than simply chiming in with “ya, tries are fast”, it seems more useful to provide some numbers around it.

Can’t help but think we’re partly missing the point talking about algorithms. John’s post is about a practical all-around approach (simple, compact, fast enough), not squeezing every bit of speed he can out of lookup.

Nonetheless, I’ll probably come back later with my own entry in the algorithm-golf tournament :)

@Mathias: I like this idea but I dont know if this use less memory than the tie structure. Btw. you dont have to split the strings if the words are sorted. You can use the good-old binary search with index access.

I agree Emmanuel Briot’s compression is good but you might consider using the shared prefix format used for Unix ISpell (I think). Words are sorted and compressed to a shared prefix count and suffix. So,

@Thomas in regards to a static, balanced tree for binary search purposes, I’d say its still not as good as trie for this specific case of looking up words in a dictionary.

The reason being that the computational complexity for a binary search is O(log N) where N is the # of words in the dictionary
A trie search’s complexity is O(M) where M is the # of letters in the word

In this scenario the length of a word is typically MUCH smaller than the number of words in a dictionary

I think the trie can itself be compacted into a pure string format, yielding a file much compressed compared with the size of the dictionary (with all the JSON overhead, it’s coming in at about 20% higher for the full dictionary I found online).

Interesting post! On another topic, a different approach to JSONP would be to save your data-structure-du-jour as a set of pixel values in a PNG (using its inbuilt compression, and perhaps some tweaking to use 5/6/7-bit values to represent your letters).

Load this up on the client’s CANVAS and extract the pixel values, optionally saving the result to localStorage. The disadvantage would be no cross-domain/CDN access without perhaps remapping the DNS of a subdomain.

1. Each Trie node is represented as one line of the file.
2. If the node is “terminal” (i.e., the string leading to that node is a word in the dictionary), then the line begins with a ‘!’ character.
3. The rest of the line is a list of character strings.
4. Strings that are “terminal”, are comma separated.
5. Strings that reference other nodes are appended with the numeric LINE OFFSET to the node that they reference.

Using this encoding format (and without suffix sharing), the ospd3.txt file compresses from 625,324 bytes down to 299,924 bytes.

This format is also very efficient to use “as-is” in a trie traversal (it does not have to be reconstituted into a collection of JavaScript objects). After loading, I would split the lines, and then the relative node references can be traversed in constant time.

Very basic benchmarking on my machine says that the original Net::Server::HTTP version got around 350 reqs/sec, and that the PSGI version gets 2500 reqs/sec or more (with both starman and twiggy). Maybe I’ll do a blog entry with some graphs.

I’d LOVE to see the Node.js equivalent! Node.js uses libev, which can also be used by the PSGI/twiggy/AnyEvent stack, so in theory they should be pretty comparable technologies. I might also do Mojolicious version, all in the name of self-education on the various technologies and techniques.

Hi John, what about making the entire word list a tree, branching on each letter. Each branch can be a sub-object. This may save some space in JSON representation, and should be quite fast in searches. Not sure whether this O(logn) tree search will be faster than native implementation of indexOf though, but the tree search only have to search n steps (n = length of word), but indexOf has to go through the entire word file.

I have a dictionarie application in JavaScript at artimap.com and I am open sourcing the refactored code at ogbuzz.com. I found that in a real dict app you check for way more than the existence of a word, you also want to make suggestions to the user and for that the use of an unsorted array is ideal because you want to search through ALL the words in the dictionary anyway. @Thomas Some operations could be done faster if the array was sorted though. @John javascript dictionaries have a wide usage possibilities as part of web-sites search-engines and spreadsheet analyzers.