Introduction

hOOt is a extremely small size and fast
embedded full text search engine for .net built from scratch
using an inverted WAH bitmap index. Most people are familiar
with an Apache project by the name of Lucene.net which is a
port of the original java version. Many people have complained
in the past why the .net version of lucene is not maintained,
and many unsupported ports of the original exists. To
circumvent this I have created this project which does the
same job, is smaller, simpler and faster. hOOt is part of my
upcoming RaptorDB document store database, and
was so successful that I decided to release it as a separate
entity in the meantime.

Based on the response and reaction of users to this project, I
will upgrade and enhance hOOt to
full feature compatibility with lucene.net, so show your love.

Why the name
'hOOt'?

The name came from the famous Google search footer and while
looking for owl logos (gooooogle ~ hOOOOOOOt ).

What is full text searching?

Full text searching is the process of
searching for words in a block of text. There are 2 aspects to
full text indexers / searchers :

Existence : means
finding words that exist in the many blocks of texts stored
(e.g. 'bob' is in the PDF documents stored).

Relevance : meaning
the text blocks returned are delivered by a ranking system
which sorts the most relevant first.

The first part is easy, the second part is difficult and there is
much contention as to the ranking formula to use. hOOt only implements
the first part in this version, as most of use and need the
existence more in our applications than relevence, especially for
database applications.

Why Another
Full Text Indexer?

I was always fascinated by how
Google searches in general and lucene
indexing technique and its internal algorithms, but it was just
too difficult to follow and anyone who has worked with
lucene.net will attest that it is a complicated and convoluted piece of code. While some people are
trying to create a more .net optimized version, the fact of the
matter is that it is not easy to do with that code base. What
amazes me is that nobody has rewritten it from scratch. hOOt is much simpler, smaller and faster than
lucene.net.

One of the reasons for creating hOOt was for
implementing full text search on string columns in RaptorDB - the
document store version. Hopefully more people will be able to use
and extend hOOt
instead of lucene.net as it is much easier to understand and
change.

Features

hOOt has been built with the following features
in mind:

Blazing fast operating speed (see performance test
section)

Incredibly small code size.

Uses WAH compressed BitArrays to store information.

Multi-threaded implementation meaning you can query while
indexing.

Tiny size only 38kb DLL (lucene.net is ~300kb).

Highly optimized storage, typically ~60% smaller
than lucene.net (the more in the index the greater the
difference).

Query strings are parsed on spaces with the AND operator
(e.g. all words must exist).

Wildcard characters are supported (*,?) in queries.

OR operations are done by default (like lucene).

AND operations require a (+) prefix (like lucene).

NOT operations require a (-) prefix (like lucene).

Limitations

The following limitations are in this release:

File paths in document mode is limited to 255 bytes or
equivalent in UTF8.

Exact strings are not currently supported (e.g. "alice in
wonderland").

Wildcards (*,?) are not currently supported.

OR not currently
supported in queries (e.g. bob OR alice).

NOT not currently supported in queries (e.g. bob NOT
alice).

Parenthesis not currently supported
in queries.

Ranking and relevance is not currently supported.

Searching user defined document fields are not currently
supported.

Performance
Tests

Tests were performed on my notebook
computer with the following specifications: AMD K625 1.5Ghz,
4Gb Ram DDRII, Windows 7 Home Premium 64bit, Win Index 3.9.
hOOt generates a lot of log information
which you can use to see what is going on inside the engine. Using the sample application I performed
a crawl over a collection of around 4600 files of about
5.8Gb of data in various formats here are the results using
grep on the output files (bear in mind that the application
was compiled under .net 4 and running 64bit, 64bit IFilters
were installed also) :

Document Count

4,490

Total time

~22 min

Index file size

29 Mb

Text size total

632,827,458 characters

Total hOOt Indexing time

56767.2471 ms ~ 56.7 secs

Total hOOt writing document info time
(fastJSON serialize)

110632.3282 ms ~ 110 secs

Total words in hOOt

~290,000

As you can see the indexing engine is blazing fast going
through 632mb of text in
57secs, the difference in time to the total 22
minutes is to do with the IFilter extraction time. On the
query side for the above document count all queries perform in about1 ms on the engine side
and as you can see in the sample picture a search for
"microsoft" gives back 208 items which took 0.151 seconds,
again the difference is to do with the document
deserialization time.

By comparison lucene.net has the following :

Index file size

70.5 Mb

Total time

~28 min

Performance Tests v1.2

Using a better word extractor the index size is reduced by about 19% to 24Mb.

Using the Sample
Application

The picture above is the obligatory desktop search application
built on hOOt . To
use the application do the following :

Set the index storage directory in the (1) group box, this
will store all the index information.

Choose a folder where you want to crawl for content in the
(2) group box.

In the (3) group box you can do the following :

Load hOOt : This will load hOOt
so you can query an existing index.

Start Indexing : This will load hOOt
and start indexing the directory you
specified.

Stop Indexing : This will come active after you have
started indexing so you can stop the process.

Count Words : This will show the number of words in hOOts dictionary

Save Index : This will save anything in memory to disk.

Free memory : This will call the internal free memory
method on hOOt (this
will only free the bitmap storage and not release the
cache).

You can search for content in the (4) group box, the label
will show the count and time taken.

To open the file just double click on the file path in the
list box.

This is just a demo that show cases hOOt
, although it works as is, it does need some bells
and whistles for a real application. To use this sample
effectively please install the following IFilter handlers
beforehand :

You can inherit from the Document object
and store any extra information your application needs as
properties (lucene uses dictionary values which isn't nice at
compile time and requires debugging at run time if you
misspelled etc.).

Because hOOt uses fastJSON
to store the document you created, it gives you back what
ever you saved, as a first class object.

New in v1.2

foreach(string filename in hoot.FindDocumentNames("microsoft"))
{
// a faster way to get just the filename instead of a Document object
}

How It Works

hOOt operates
in the following 2 modes :

Database mode :
where you want to index columns in a database and you supply
the row number yourself from the database.

Document mode :
where you give hOOt a document file which will
be serialized and stored and document numbers generated for
it.

hOOt has 2 main
parts which are the following :

Indexer Engine : is
responsible for updating and handling the index generated.

Query Engine : is
responsible for parsing the query text and generating an
execution plan.

To optimize memory usage the indexer can free up memory in the
following 2 stages :

Compress bitmap indexes in memory and free the BitArray
storage.

Unload cached word bitmap indexes completely.

For the most part you would use stage one, but if you are going to
index 100's millions of documents the second stage is going to be
necessary.

Some
Statistics...

A quick search in the internet reveals the following statistics
about the English language:

There are about 250,000 English words of which around
175,000 are in use (Oxford English Dictionary).

The maximum length of a word in English is 28 characters.

After going through the words extracted from the IFilters it is easy to see that there a problems with some document formating as some sentences don't have spaces and the words are stuck together. Also for programming documents CamelCase words are prevelent.

For this reason as of version 1.2 the word extractor in hOOt will do the following:

CamelCase words are broken up.

Word lengths greater than 60 characters are ignored

Word lengths less than 2 are ignored

Index Engine

The index engine is responsible for the following :

Load words into memory.

Load bitmap index data on demand.

Extract word statistics from the input text.

Update the word bitmap index.

Free memory and bitmap cache.

Write the words to disk.

Write the bitmap index to disk.

Query Engine

The query engine will do the following :

Parse the input string of what to search for into words

Extract the bitmap index for all those words used in the
search string

Execute the bitmap arithmetic logic

AND the result with the NOT of the deleted documents bitmap,
to filter out deleted documents.

Enumerate the resulting bitmap

in the case of database mode give you a list of record
numbers

in the case of document mode seek the document number and
give you a list of documents from the documents storage.

What's an
inverted index?

An inverted index is a special index which stores the words
to bitmap conversion data. In a normal index you would store
what words are in a certain document, an inverted index is the
opposite given a word 'x' what documents have this word is
stored.

For example if you have the following :

Document number 1 : "to be or not to be, that is the
question ..."

Document number 2 : "Romeo! Romeo! where art thou Romeo!
..."

Would generate the following when parsed by the GenerateWordFreq
method:

And the following word bitmaps would be updated (1 being the
existence of that word in the index position of the document
number, 0 being not found):

to : 1000000...

be : 1000000...

or : 1000000...

romeo : 0100000...

where : 0100000...

The power of
bitmap indexes

To see the power of bitmap indexes take a look at the
following diagram :

As you can see for the query string "alice in wonderland"
(without the quotes) hOOt will do the following
:

The filter parser extracts the words "alice", "in",
"wonderland".

Seek the associated bitmap indexes for the words.

Execute the query arithmetic logic in this case a logical
AND operation to yield a resulting bitmap index.

The resulting bitmap is the index based record numbers to the
document, for example if the result is 0001100... then the
documents 4,5... (starting from index number 1) contain all the
words "alice", "in" , "wonderland".
This is in marked contrast to lucene indexing scheme which saves
the record numbers literally albeit in an optimized notation. hOOt
offers huge space savings in index size especially with the
WAH compression used, and as an added bonus, performance is
extremely fast.

File Formats :
*.WORDS

Words are stored in the following format on disk :
The bitmap index offset will point to a record in the *.bitmap
file.

File Formats :
*.DOCS

Documents are stored in JSON format in the following structure on
disk:
This format came straight from RaptorDB and is
being used there, the only modification in hOOt is
that the MaxKeyLength is set to 1. Also because the
records are variable in length the *.RECS file maps the record
number to an offset in this file for data retrieval.

File Formats :
*.BITMAP

Bitmap indexes are stored in the following format on disk :
The bitmap row is variable in length and will be reused if the new
data fits in the record size on disk, if not another record will
be created. For this reason a periodic index compaction might be
needed to remove unuesd records left from previous updates.

File Formats :
*.DELETED

Deleted index file is just a series of int valuesthat represent deleted files with no special formatting.

File Formats :
*.REC

Rec file is a series of long values written to disk
with no special formatting. These values
map the record number to an offset in the DOCS storage file.

File Formats : *.IDX

IDX file is MurMur2 Hash dictionary for
mapping document file names to document numbers. This is used in
document mode.

Appendix v2.0

Finally I got round to updaing hOOt, most of the updates are post backs from RaptorDB which are for stability, bug fixes and performance improvements, things like locks and multithreading issues.

The bitmap file format has changed to support index offsets and the storage file has been updated to support deleted items, so the previous files will not work in this release.

hOOt now supports incremental indexing of documents and will check if the files exists in the index before indexing. You can RemoveDocument(filename) now and the file will be removed from the index results.

File information like dates and size is now stored in the storage file, so in the future it will be possible to check changed files and index updated documents.

History

Initial release v1.0:
12th July 2011

Update v1.1 : 16th July 2011

tweaked parameters and reduced index size by 46% (now less than half of lucene)

Share

About the Author

Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinums on CodeProject (13th Jan'12)

With test application, "Free Memory" causes exception in hOOt. Not clear why you use the button "Load hOOt". Couldn't you load all you need on demand, in the style of lazy pattern? Or do I miss anything.

Also, the test application is not accurate enough: TAB navigation goes in some unexpected order, there are no keyboard shortcuts on controls, sizes is not accurate (at least, set correct MinimumSize to avoid hiding controls. You see, this is important enough: by your test applications, the users judge the quality of the library.

Also, I would recommend to include FASTJson in the solution as source code. Not everyone is ready to trust pre-compiled DLL (I don't), and adding the source from the different article is a little but hassle.

I will look into the free memory issue (I did change he underlying WAH handler a lot in RaptorDB and will put the newer faster version in hOOt, I think the problem is fixed as I did run into it in RaptorDB), and fix the tabs and minimum size of the form.

The load hoot uses the directories selected and loads indexes (you can load different index files for testing).

I will include the fastJSON source code in the next release.

Thanks again, that's why you should never test your own code!

Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist
If the final destination is death, then we should enjoy every second of the journey.

How hard could be to add relevance to the search? I think that instead of storing the document, what can be stored is an array of "word positions" (along with the document filename/document ID, so you always have access to it) so, what word from the word index is in what position on the document, and then use that information to do a fast scan by relevance.

Not sure if that would be efficient, as you will have to scan every document returned by the search and then sort based on relevance. Could this be done directly on the query? Maybe creating another type of index?

I developed such engine at 1996-1998 with C++, and I understand every part of ur work. But, I have changed all my work after reading the document:http://infolab.stanford.edu/~backrub/google.html[^]
My engine was so fast with the bitmaps workout, but it was suitable only for small scale search engines. Reasons for those limitations for example are:
- The headache of decompressing large bitmaps, and do Boolean operations while some of the bitmaps may include only one or two hits.
- The need to store another values with the bit of each document such as the rank of the relevance of the work to the document and the positions of the word in that document.
- You have to parse each document to get the relevancy of ur query to each document, which is not applicable for large scale engines. So, we keep the initial positions of each word in the document and do some fast intersections while doing first document filtering. Any way you can check that in the previous link.
...

The magic solution is to use vectors instead of bitmaps. Bitmaps used only in cases that we need to keep only single information about the key, and we are expecting a homogeneous distribution of bits between bitmaps of the Boolean operation. The good example for that is the wildcard indexing of the words lexicon.

Any way I gave u my 5, as I understand from ur introduction that it is a small scale.
"hOOt is a extremely small size and fast embedded full text search engine"
I don't know it u mean "small scale" with "small size" or not.
If not please let me know to change my 5 to 4.

Like I said in the article, hOOt was primarily developed for my upcoming RaptorDB document store database. Your could be right the uncompression could be a mitigating factor for really large bitmaps although there are several companies with huge databases using bitmap indexes as column store engines (oracle, palo , ... to name a few).

I like the link to the ubergeeks article you sent.

Cheers man.

Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist

Is it just me or does everyone else have a problem visiting the RaptorDB[^] page? I've been trying for a couple of days now and CP says "Page not Found". I thought it was just a temporary problem, but this is the only one of Mehdi's articles that I can't see.

I could imagine that this could be very useful for a trace parser which creates an index for each line which contains process, thread id, method name and the actual payload.
What I am not sure is if it is possible to handle time with a BitArray. If I want to search for a time range I would need to create a filter which would represent all possible times which becomes quickly a problem. Do you know of any ways around this or should I simply keep for each line the 64 bit DateTime value?

I think that binning is the answer. If I want to search for time ranges but I cannot create a query with all possible time values (might be billions) for the bitmap index I can create a bitmap index with time RANGES e.g. 5s. Then I can create a query which which is not so large and I can filter out the rest for an exact match later.

Hi Mehdi,
the project is very impressing, congratulations. I am thinking about using it for the full text search feature of the project I work now. The problems that I face are as follows:
1. I need stemmers for european languages: german, italian, romanian. How can this be accomplished?
2. There are some special text constructs like code/name/year, let's say identifyable with a regex, which should not be separated, but found as a single entity. Is it possible by using hoot, or by extending it? This is more like a nice to have feature, but very useful.

Absolutely yes, I'm working on an interface which you can plug in your own word extractor logic.

For European languages it shouldn't be a problem but someone wanted a far eastern extractor which apparently doesn't have a notion of space separator, that is hard and would require language knowledge.

Currently hOOt breaks on non Letter characters, I will have to see how you can do the second point efficiently.

Cheers

Its the man, not the machine - Chuck Yeager
If at first you don't succeed... get a better publicist

Once the word-break interface is on, implementing stemmers should not be an issue. You can plug the stemming code right on the word-break interface (so, you break the text into "words" and then instead of providing the indexer with those words, you first stem them).

Then you will have to do the same when searching (so, break the search text into words, stem them and use that result as keywords for the search).

You can check the Lucene.net source code for some ready-to-use stemmers.

As for word breaking algorithms, you can find some on the ICU project (com.ibm.icu.text.BreakIterator) (it is available in C and Java.. should not be too hard to convert the java version to C#).

I am looking for Lucene.net replacement and hOOt brings some hope for me.
However, my project involves dealing with CJK, i.e. Chinese, Japanese and Korean, documents. You may probably have known that those far-east languages do not use spaces to separate words. Indexing those documents must surmount the word-separation issue, and noise words should be detected with special knowledge. Does hOOt support CJK specific word-separation algorithms or allow plug-in interfaces?

I have studied little about Chinese sentence segmentation, but it seems that it is possible to find some segmentation libraries from the web. Perhaps it is possible to provide an interface to plug in the algorithm and let the user to separate and filter words.