I have a problem: I need space-efficient lookup of file-system data based of file path prefix. Prefix searching of sorted text, in other words. Use a trie, you say, and I thought the same thing. Trouble is, tries are not space-efficient enough, not without other tricks.

I have a fair amount of data:

about 450M in a plain-text Unix-format listing on disk

about 8 million lines

gzip default compresses to 31M

bzip2 default compresses to 21M

I don't want to be eating anywhere close to 450M in memory. At this point I'd be happy to be using somewhere around 100M, since there's lots of redundancy in the form of prefixes.

I'm using C# for this job, and a straightforward implementation of a trie will still require one leaf node for every line in the file. Given that every leaf node will require some kind of reference to the final chunk of text (32 bits, say an index into an array of string data to minimize string duplication), and CLR object overhead is 8 bytes (verified using windbg / SOS), I'll be spending >96,000,000 bytes in structural overhead with no text storage at all.

Let's look at some of the statistical attributes of the data. When stuffed in a trie:

total unique "chunks" of text about 1.1 million

total unique chunks about 16M on disk in a text file

average chunk length is 5.5 characters, max 136

when not taking into account duplicates, about 52 million characters total in chunks

Internal trie nodes average about 6.5 children with a max of 44

about 1.8M interior nodes.

Excess rates of leaf creation is about 15%, excess interior node creation is 22% - by excess creation, I mean leaves and interior nodes created during trie construction but not in the final trie as a proportion of the final number of nodes of each type.

Here's a heap analysis from SOS, indicating where the most memory is getting used:

The Dictionary<string,int> is being used to map string chunks to indexes into a List<string>, and can be discarded after trie construction, though GC doesn't seem to be removing it (a couple of explicit collections were done before this dump) - !gcroot in SOS doesn't indicate any roots, but I anticipate that a later GC would free it.

MiniList<T> is a replacement for List<T> using a precisely-sized (i.e. linear growth, O(n^2) addition performance) T[] to avoid space wastage; it's a value type and is used by InteriorNode to track children. This T[] is added to the System.Object[] pile.

So, if I tot up the "interesting" items (marked with *), I get about 270M, which is better than raw text on disk, but still not close enough to my goal. I figured that .NET object overhead was too much, and created a new "slim" trie, using just value-type arrays to store data:

This structure has brought down the amount of data to 139M, and is still an efficiently traversable trie for read-only operations. And because it's so simple, I can trivially save it to disk and restore it to avoid the cost of recreating the trie every time.

So, any suggestions for more efficient structures for prefix search than trie? Alternative approaches I should consider?

What sort of use are you going to make of the data? Lots of processing or just a few lookups; can you give some idea as to what trade off between efficent storage and processing are acceptable?
–
JacksonAug 30 '09 at 21:11

It's basically to cache file-system lookup operations so that the physical disk doesn't need to be consulted for things like getting all files in a directory, all files recursively in a directory, etc. without consulting the disk, which is invariably not in memory and is in fact across the network => far too many roundtrips. Performance expectation would be that doing 150 prefix lookups (i.e. finding all lines with this prefix) returning average 100 lines shouldn't take more than, say, 100ms. As it is, my SlimTrie approach takes 10 seconds to load from disk and list 8,000,000 lines => ~18ms.
–
Barry KellyAug 30 '09 at 21:21

And that's with optimization turned off, with it on, 8.5 seconds - that's including app startup. 140M isn't too bad, but considering the redundancy in this data, I'm sure it can be improved.
–
Barry KellyAug 30 '09 at 21:23

3 Answers
3

Since there are only 1.1 million chunks, you can index a chunk using 24 bits instead of 32 bits and save space there.

You could also compress the chunks. Perhaps Huffman coding is a good choice. I would also try the following strategy: instead of using a character as a symbol to encode, you should encode character transitions. So instead of looking at the probability of a character appearing, look at the probability of the transition in a Markov chain where the state is the current character.

A Huffman tree is the first thing I wrote after I saw the chunks in the trie - I was thinking of trying to encode lines as bit strings, one string for each chunk, concatenated - but while I was writing the bit-packing logic, I thought about using flat value-typed arrays for the trie encoding instead. Implementing Huffman encoding correctly and efficiently, and decoding in particular, gets pretty tedious pretty quickly. I may pick it back up and perhaps encode based on character frequency instead.
–
Barry KellyAug 31 '09 at 9:10

Yes, indexing using less bits than 32 is something I've thought about. Other things: 16M character data is cutting close for 24 bits, but if I aligned character data to word boundaries, costing on average 0.5 bytes per chunk, I could use 24 bits to index up to 32M position, for half the saving. And that bit-packing logic I was writing for the Huffman tree encoding, may come in useful for using less than a whole number of bytes for storing indexes. My next step will probably be writing a "bitfield array" class.
–
Barry KellyAug 31 '09 at 9:20

I'll award this one the win. I wrote up a bit-packed array class that can index signed or unsigned integers of constant bit width, and I determine the maximum width required when converting from my mutable loading-time StringTrie to my immutable SlimTrie. Storing the SlimTrie on disk and reloading later saves time and memory, avoiding stale GC garbage hanging around. Now down to 75M!
–
Barry KellyAug 31 '09 at 22:56

I missed your Markov chain for character pair encoding the first time, interesting idea that could / would probably reduce text storage quite substantially, as there is a degree of self-similarity even after the prefixes and duplicated chunks have been taken care of.
–
Barry KellyAug 31 '09 at 23:03

You can find a scientific paper connected to your problem here (citation of the authors: "Experiments show that our index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip, bzip or ppmdi." - but unfortunately the paper is payment only). I'm not sure how difficult these ideas are to implement. The authors of this paper have a website where you can find also implementations (under "Index Collection") of various compressed index algorithms.

Actually, the radix tree, or Patricia trie, is how I'm storing my trie data - only storing a single character per edge/node would clearly be insane for space orientation.
–
Barry KellyAug 31 '09 at 18:08

Off-the-wall idea: Instead of a trie a hash table. You'd have in memory just the hash and the string data, perhaps compressed.

Or can you afford one page read? Only hash and file position in memory, retrieve the "page" with lines matching that hash, presumably small number of ordered lines, hence very quick to search in the event of collisions.

Doing 150 seeks to read 100 lines from each location isn't as fast as one might wish - that's how I was doing it before I took up the trie approach. I was using a line index into the text file, i.e. a file basically containing a flat array of 32-bit offsets into the start of each line, with the file in sorted order. The random seeks over 450M file kill you.
–
Barry KellyAug 31 '09 at 9:13

For the hash table idea - I don't quite understand you. The prefix search is not a fixed-length key, it could be a/b, a/b/c, a/b/c/d, etc. In the first trie I create - not the slim one - I'm already storing character data once using indexes.
–
Barry KellyAug 31 '09 at 9:14

The idea was to hash the entire prefix, no matter how long. This would result in a number that is the index to a "page" the page contains all the lines matching that hash. Hence you only do one logical read, getting back some lines. [That might actually be a few physical reads, but hopefully way less than 150 seeks.] You then just discard any hash collisions that you don't want.
–
djnaAug 31 '09 at 12:47

The problem is I'd be trying to do 150 separate prefix lookups in less than 100ms, so even if I had a hash that mapped the prefix to an exact location in the file, it would likely still be too slow.
–
Barry KellyAug 31 '09 at 18:01