Lucandra / Solandra: A Cassandra-based Lucene backend

In this guest post Jake Luciani (@tjake) introduces Lucandra (Update: now known as Solandra – see ourState of Solandra post), a Cassandra-based backend for Lucene (Update: now integrated with Solr instead).

For most users, the trickiest part of deploying a Lucene based solution is managing and scaling storage, reads, writes and index optimization. Solr and Katta (among others) offer ways to address these, but still require quite a lot of administration and maintenance. This problem is not specific to Lucene. In fact most data management applications require a significant amount of administration.

In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache Software Foundation project, originally developed at Facebook called Cassandra.

What is Cassandra?

Cassandra is a scalable and easy to administer column-oriented data store, modeled after Google’s BigTable, but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things:

It has no single point of failure, and

Adding nodes to the cluster is as simple as pointing it to any one live node

Cassandra also has built-in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully. Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.

Enter Lucandra

Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within Facebook was for search, integrating Lucene with Cassandra seemed like a “no brainer”. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than trying to build a Lucene Directory interface on top of Lucene as some backends do (DbDirectory for example), our approach was to implement a an IndexReader and IndexWriter directly on top of Cassandra.

Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.

Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.

There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader. However, this is still quite acceptable to us given what you get in return.

For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under its own key. Luckily, this will be fixed in the next version of Cassandra, which will allow batched writes for keys.

One other major caveat is, there is no term scoring in the current code. This simply hasn’t been needed yet. Adding is relatively trivial – via another column.

To see Lucandra in action you can try out the Twitter search app http://sparse.ly that is built on Lucandra. This service uses the Lucandra store exclusively and does not use any sort of relational or other type of database.

Lucandra in Action

Using Lucandra is extremely simple and switching a regular Lucene search application to use Lucandra is a matter of just several lines of code. Let’s have a look.

From here on, you work with IndexWriter and IndexSearcher just like you in vanilla Lucene. Look at the BookmarksDemo for the complete class.

What’s next? Solandra!

Now that we have a Lucandra we can use it with anything built on Lucene. For example, we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.

For most users, the trickiest part of deploying a Lucene based solution is managing and scaling storage, reads, writes and index optimization. Solr and Katta (among others) offer ways to address these, but still require quite a lot of administration and maintenance. This problem is not specific to Lucene. In fact most data management apps require a significant amount of administration.

In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache project, originally developed at facebook called Cassandra.

What is Cassandra?

==============

Cassandra is a scalable and easy to administer column oriented data store, modeled after Google’s Bigtable but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things. It has no single point of failure and adding nodes into the cluster is as simple as pointing it to any one live node. Cassandra also has built in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully.

Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.

Enter Lucandra:

===========

Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within facebook was for search, integrating Lucene with Cassandra seemed like a nobrainer. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than try and build a Lucene directory interface ontop of Lucene as some backends do (DbDirectory for example), our approach was to implement a index reader and writer directly ontop of Cassandra.

Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.

Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.

There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader however this is still quite acceptable to us given what you get in return.

For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under it’s own key, however this will be fixed in next version of the Cassandra, which will allow batched writes for keys.

One other major caveat is, we do any term scoring in the current code.

Now that we have a Lucandra we can use it with anything built on Lucene. For example we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.

29 Responses to Lucandra / Solandra: A Cassandra-based Lucene backend

“Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query.”

The idea here being similar to Lucene’s FieldSelector, right?

“Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.”

Could you elaborate more on these 2 points?

Regarding no need for optimization – is that simply because the Directory is different, so with Cassandra as the Lucene Directory, there is no such thing as index files, segments, and such?
What about deletes? I skimmed a recent post about distributed deletes and saw a mention of tombstones, which I think is similar to what Lucene does with deleted docs: they are first just marked as deleted, then letter truly removed during segment merging (“organic” or trigger by index optimization). Who controls this expunging in Lucandra?

And what do you mean by seeing writes immediately and not needing to reopen the searcher? What makes that work automatically? Again simply the fact that the storage is Cassandra? Lucandra’s IndexReader doesn’t do anything under the hood to re-read any data or some such? And what about something like FieldCache, which is used for sorting and which needs to be populated once, typically when IndexReader is first opened – how does that work in Lucandra?

Oh, and one more question: what about Cassandra’s eventual Consistency – how does that work with Lucandra seeing writes and documents immediately and consistently?

Regarding key ranges: The performance gain is we can pick and choose just the keys we want in the case of a simple search with a single term. Or we can fetch a range in 1 step with only the columns we need example (field1:(+book*) field2:-another). Once we add term scoring we can fetch term positions and/or scores depending on the call.

Regarding optimization/deletes: Yes, since there are no segments, index files, etc no need to optimize. Cassandra takes care of this for us and has pre-indexed the data based on the cassandra config included with Lucandra.
For deletes, Cassandra deletes the tombstones every N seconds, based on the config setting.

Regarding seeing writes immediately: Yes, IndexReader does cache but just for the length of any one query. If you can IndexReader.reopen() it will flush its internal query cache (for sparse.ly we do this on every call). This means next query we see any new info in cassandra on this index. If you use a lucene FieldCache, I imagine you will see issues. I’ll need to test this case, but Lucene may drop any new items, since they wont match any docs in the field cache.

Last Point: We currently don’t worry about eventual consistency, so different readers may see different terms, though as long as each reader is consistently connected to the same cassandra instance they’ll see a consistent view. There are ways to force consistency across the cluster, but it obviously affects performance.

I think where you keep your game/SN data can be independent of where your search indices live. You can store your game/SN data in Cassandra (or any other database), but it doesn’t mean you also have to use Cassandra to store your (Lucene) search indices.

Question: What do you thing about having one column family for every field instead of putting the field in front of the term? For HBase this means of course, that you can’t easily add additional fields.

you mean index data already in cassandra? no. if you want data in cassandra to access via cassandra apis you need to submit it to cassandra and also solandra. solandra keyspace isn’t mean’t to be accessed directly