Oracle Blog

Interesting search topics and random musings

Everything is Dictionary and Postings

When we started working on Minion, we decided to take a philosophy that we call EIDAP: Everything Is Dictionary And Postings.

The previous version of the engine had a number of issues (the queries were slower than I would have liked, for example), and while figuring out what needed to be fixed, it was clear that there were several places in the engine where I had developed a sorta-dictionary that had entries that used kinda-postings to store information. This, of course, led to a lot of code duplication with the utterly expected (and annoying) failure mode of having to debug and fix the same problem (or, even worse, similar problems) multiple times.

So, I decided that in Minion (although it wasn't called that yet) we would use the dictionary and postings paradigm wherever we could. This turned out to be a lot of places.

So here's how it works: a dictionary has entries. Each entry has a name and some associated information. Most of the entries encode information about the number of IDs associated with the entry, the size of the postings associated with the entry and the offset (or offsets) in a postings file where those entries might be found, but there's a general mechanism that a particular entry type can use to encode more information, if necessary.

The postings encode information about the IDs that are relevant to that entry. This information may just be the IDs or it might include frequency information or information about the fields in which the entry occurred and the positions where the entry occurred.

If this all sounds a bit generic, that's because it is pretty generic. We use this paradigm for pretty much everything in the engine (remember: EIDAP!) The entry types, the names of the entries, the postings types, and where the postings are written is all configurable at runtime.

Perhaps some concrete examples will help. By default, the "main" dictionary for a partition contains entries whose names are strings. The name of the entry is a term from a document. The postings associated with an entry contain the IDs of the documents that contain that term, the fields in which the term occurs in each document and the positions where the term occurs in each document.

The document dictionary contains entries whose names are strings. The name of an entry is the unique key for a document. The postings associated with an entry contain the IDs of the terms that the document contains and the frequencies of the terms in the document. There may be multiple postings per document, one per vectored field.

A saved field of type INTEGER has a dictionary whose names are integers. The names are the saved values. The postings associated with the entry contain the IDs of the documents that have that value in the field. You can imagine what the dictionaries are like for the other saved types.

We use bigram dictionaries to speed up wildcard processing for querying. A wildcard dictionary has entries whose names are character bigrams. These bigrams are extracted from the terms in an associated dictionary. The postings associated with a bigram entry contain the IDs of the terms containing the bigrams.

A single partition in the index might have more than a dozen dictionaries and associated sets of postings.

Where the EIDAP philosophy really pays off for us is when partitions are merged, which happens all the time in a live index. In the previous version, there were about four different places where we needed to implement merges (come to think of it, they were pretty much the four places that correspond to the examples above!), which meant that debugging a partition merge meant fixing the bug in one place and then looking in all the other places to figure out if the same bug was evident. When everything is a dictionary, a partition merge boils down to merging all of the dictionaries in the partitions, which means we only need to have merge code for dictionaries.

The code for merging dictionaries is fewer than 200 lines, including comments. Admittedly, we use a couple of support classes to do the merge (they're shared with the code that writes the dictionaries in the first place), but the resulting code size is substantially smaller than it was pre-EIDAP.

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.