Sunday, August 28, 2011

Thesaurus

Steve Hanov blogged about building a thesaurus using a "zero load time" file formats. Below, we translate his implementation into Factor.

You can download the 11 MB thesaurus data file we will be using (containing over 100,000 words and their lists of related words). It is implemented as a single file with a custom binary file format that looks like this:

[ header ]
4 bytes: number of words
[ index section ]# The words are listed in alphabetical order, so you
# can look one up using binary search.
for each word:
4 byte pointer to word record
[ word section ]
for each word:
null terminated text
4 bytes: number of related words
for each link:
pointer to linked word record

Build It

The data file consists of 4 byte "pointers" and null-terminated strings. We can build words to read an integer or a string from a particular location in the file:

1 comment:

Seems like a nice solution. FWIW, my old Aiksaurus program had a somewhat similar approach, navigating its two data files using fseek to avoid loading everything into memory. Of course, its data files were smaller than what you're dealing with here (about 500 KB together), but at the time this seemed like too much overhead. Ah, how things have come along. :)