Fulltext Search with MongoDB and Solr

Berlin, Germany

Tuesday, June 5th 2012, 18:03 CEST

In this article I am explaining how you can tie MongoDB and Solr together. With the help of a little helper script in PHP that uses MongoDB's replication features we automatically generate updates in Solr. We will first look at the MongoDB side and then the Solr side.

Replication

MongoDB's replication works by recording all operations done on a database in a log file, called the oplog. The local database contains a collection called oplog.rs that stores all those operations. The oplog is a capped collection which means they have a fixed maximum size. The operations that are stored in the oplog can be quite easily read, as it is just a normal collection in a normal database:

This script connects to MongoDB and enables a replica set connection (array(
'replSet' => 'seta' )). It then selects the local database and the oplog.rs collection. We then find one document with findOne().

The ns field can be used to restrict for which databases and collections you would want to see the operations. In this case, the findOne() call limits the operations to the demo database and the article collection.

The op field tells you what sort of operation was recorded. In some cases, the records in the oplog are not exactly what they are when you send them through a driver, as they are broken up in parts. A remove() call for example will generate a delete operation for each document in the oplog. In any case, the op types are: i for inserts, u for updates and d for deletes. n is used for notices such as the reconfiguration of the replicaset.

One of the cool things about capped collections, is that they support something called a tailable cursor. A tailable cursor does not use an index and can only return documents in natural order. Which means the order into which documents are inserted into a collection. Hence, we can write the following PHP script to connect to the oplog and wait until new operations are made:

In the script that we run on the command line, we use the tailable() method to configure the cursor as tailable. When we run this script however, it would just loop and use CPU time because hasNext() doesn't actually wait for new data coming in. The sleep(1) prevents the script from using 100% CPU time. From the upcoming driver releases (1.2.11/1.3.0) there will be a new awaitData() method to make hasNext() wait until there is actually new data available for the cursor. The script would then look like:

When we run this script it just sits there and loops for new information to be available for the cursor. In our case, that would be if the oplog has a new item for our article collection in the demo database.

Solr

Now we have a way to wait for updates on the MongoDB side, we need to tie this into Solr. To talk to Solr I use the Solr PECL extension from http://pecl.php.net/solr that I've installed by running pecl install solr and then added extension=solr.so into my php.ini file.

There is a configuration file in solr/conf/schema.xml. For now we will keep the defaults, but an interesting feature is is that Solr supports dynamic fields. Basically, this allows you to store any field with the type being selected by a suffix. For example the field name name_t will automatically be used as text, and addr_housenumber_l as an integer (long).

In order to store things in Solr for easy searching we need to map the fields in our document to fields that Solr understands. Solr does not support nested arrays or embedded documents so we need some translation. We also will need to add the correct type suffix. As example we use the document from earlier:

Shortlink

Comments

Bret R. Zaun

Tuesday, June 5th 2012, 20:57 UTC

Hi Derick, thanks for sharing this useful information. I was not aware of the possibilities MongoDB offers to fill a Solr index in such a nice way. This is an efficient alternative to completely rebuilding the Solr index every time.

Best regards Bret

Sam Hennessy

Friday, June 8th 2012, 22:25 UTC

Thanks Derick,

Been meaning to look into how to do this. This will be very useful to adapt to my needs.

Sam @SamHennessy

gustavo galvis

Friday, August 24th 2012, 22:00 UTC

thank you for sharing this interesting post, really it has been very helpful.

Robert

Monday, March 25th 2013, 23:25 UTC

We also use both MongoDB and Elastic Search (which bas Solr in its core), we found Elastic Search easy to setup but a nightmare to develop against with the C# NEST client.

Thankfully MongoDB 2.4 has been released which has a new Free Text Search feature.