Here is an extremely simple implementation of a statistical full text
substring search which has a runtime independent of the corpus size. Recall is
100% and precision is adjustable.

We use the sample index from http://www.dotnetdotcom.org/ for the
test corpus (2MB in size) and break the source into lowercased n-grams. In this
case of length 5. Given the string "The dog ran.", we'd get 8 5-grams: "the
", "he d ", "e dog", " dog ", "dog r", "og ra", "g ran", and " ran."

After all grams have been enumerated, the amount of space required increases
roughly 5 fold (or n for an n-gram). Then we just iterate through each gram
and insert it into a hashmap where the key is the gram and the value is the
set of URLs which have contained that gram.

Searching consists of breaking the search string into grams, looking up each
set of URLs for each gram, and intersecting all the sets. Whatever URLs are in
the resultant set will contain the search string (with a high probability,
which will be defined in another post).

The following code implements what we've described here. The script accepts
two arguments, the file with the web content (from http://www.dotnetdotcom.org) and
the gram length. You can run it with "python script.py web_index 5". The first
loop does the indexing, and the second does the searching.

There are two big drawbacks with this approach: 1) The search string can't be
shorter than the gram length and 2) The index becomes huge, in this case 13MB
from 2MB. But the benefit is that you can do full text searching extremely
quickly, even for very large sets of corpuses, and recall is always 100%. You
can also increase or decrease the size of the index by adjusting the size of
the grams and the frequency of hash collisions. By hash collisions, I'm
referring to two different grams pointing to the same set of urls even if all
of the urls in that set don't contain both of the grams. By adjusting these
parameters, we can finely tune the precision of the search and the size of the
index. I plan on visiting these tradeoffs in a follow up post.

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse.
Contact me at steve@thinkfuse.com.