Saturday, 21 April 2012

Following up on Mathias's great post on Full Text Search engines, I decided to take a look at the memory usage of some of the engines while performing queries. Mathias looked at Lucene++, SQLite, QtCLucene, Tracker and Xapian, I focused only on three of them - Lucene++, SQLite and Xapian (version numbers match those that Mathias used as I'm also testing on Ubuntu 12.04).

The procedure was simple - I grabbed the benchmark repo (https://gitorious.org/openismus-playground/fts-benchmark), used it to built two sets of the databases with 17251 and 121587 movies and then just ran valgrind's massif while only performing queries on the already built databases. Here are the values of peak memory usage:

Lucene++

SQLite

Xapian

17251

1.4 MiB

2.5 MiB

1.2 MiB

121587

3.1 MiB

2.6 MiB

5.2 MiB

Of course, the peak memory usage by itself isn't terribly interesting value, what matters is also how does the engine work with memory over time, so let's look at that as well (images are courtesy of Milian's fantastic massif-visualizer, note that their scale is not relative to each other):

Lucene++

SQLite

Xapian

17251

121587

We can see that both Lucene and SQLite seem to build a cache on the first query and then use it, Xapian on the other hand doesn't seem to be keeping a cache, as the rapid drops in mem usage suggest, but maybe there's different explanation to that.

So that's it for memory usage while performing standard queries, but what I was particularly interested in was memory usage when performing wildcard queries (as I saw some strange behaviour here and there with Xapian). Therefore I added one simple wildcard query "T*" to the list of executed queries and ran it on the largest DBs (ones with 121587 movies). As you can imagine the "T*" query is really generic and it matches around 110 thousand documents from the dataset, that's why I also added a limit of 10k results per query to each backend (although that shouldn't make much difference).

Let's look at the results:

Lucene++

SQLite

Xapian

Peak mem usage

4.6 MiB

7.3 MiB

442.2 MiB

Visualization

Now we can clearly see that Xapian uses huge amount of memory during expansion of the wildcard query (can this be considered a bug report? :)), SQLite has a couple of peaks there, but nothing to be worried too much about, and Lucene++ shines with its fairly constant (and really low) mem usage.

You saw the data, so I'll leave any conclusions up to you. ;)

The small number of changes I had to do to the original benchmark repository is available as a simple diff here.