Solr demystified

As I mentioned in this post, we’ve decided to set aside some of our weekly brown bags to spread around some knowledge on different technologies via a relatively informal presentation/discussion format. This past week we talked a bit about Solr.

This post covers much of what we discussed, ranging from the introductory to the somewhat arcane. If you’re a seasoned Solr user, this may not have much for you. But, you never know.

Solr: wtf?

For people who have never used Solr (me, for instance), I’ll start with the obvious question: what is it? At its most basic, Solr simply provides a web interface to the Lucene search engine. It’s written in Java and runs as a servlet inside a servlet container such as Tomcat or Jetty. The example application included in the distribution package includes Jetty, so you can get up and running relatively easily. You use Solr by sending your requests in the form of XML over HTTP; the responses also contain XML.

For those of you looking for sense in the world, I’m sorry: Solr isn’t an acronym, and to our knowledge doesn’t stand for anything in particular. It’s just a name with a vowel shortage.

You can find the home page for Solr here, a wiki for discussion of all things Solr here, and tutorial to get you started here. Finally, you can download the distribution (the current release version is 1.3.0) here.

But, why?

Let’s say you’re working on a site to help people find a physician. Users of this site might care about location, age, or gender of each physician. Your site might include how many pending malpractice suits each physician has, how patients have rated their bedside manner, or what magazines they stock in their waiting rooms. As a good citizen of the web community, you want to provide your users the ability to search for any combination of these criteria. You have all the information sitting in your database, so you should be able to search it, right?

Sure, no problem, but in order to ensure quick response times you’ll want to add indices on the columns in your physicians table. But, which indices to add? If your table has columns for age, gender, and rating, and you want to allow users to search on any combination of fields, then you need three indices to match all searches:

age, gender, rating

rating, age, gender

gender, rating, age

Keep in mind that indices match from left to right, and will only match on columns included in the query. Thus, if you allow searching on another column you’ll need to have eight indices:

age, gender, rating, mortality rate

age, gender, mortality rate, rating

age, rating, mortality rate, gender

age, mortality rate, gender, rating

gender, rating, mortality rate, age

gender, mortality rate, age, rating

rating, mortality rate, age, gender

mortality rate, age, gender rating

So, we quickly discover that we need n! / (n – 1) indices to search n columns, and this doesn’t take into account range queries. This could quickly get out of hand; Solr to the rescue.

Solr will build your indices for you based on the columns you tell it you want to search, it will keep these indices up to date as you add or change records, and it will do it fast.

More accurately, Lucene will do these things for you. However, Solr allows you to put Lucene on its own server that your application talks to via HTTP. This way all of your production servers can share the same Solr server, keeping searches consistent for all instances of your application.

6 Comments

Mark Wilden says:

You want to allow your users to search for any combination of columns; why would you need indexes for every permutation?

September 29, 2008 at 7:55 pm

Mark Wilden says:

To answer my own question, you could use all these indexes to perform prefix matches: e.g., on just gender and age. However, you would probably index each column individually and make sure the query optimizer performed a bitmapped intersection of matching keys. Solr may in fact do just that.

September 29, 2008 at 10:50 pm

Adam Milligan says:

Your searches will only match on indices that include all the columns you’re searching for (from left to right, remember). So, a search for age and rating will match any of these indices:

* age, rating
* rating, age
* rating, age, gender, foo, bar, wibble

But that same search will not match on any of these indices:

* age
* rating
* age, gender, rating

Thus, indexing each column individually will only speed up single column searches.

This is how MySql indices work; I imagine many other databases perform similarly, although I can’t say categorically.

September 30, 2008 at 3:40 pm

Mark Wilden says:

The way a DBMS could use single indexes is to construct a bitmap of matching rows for each search column, then AND them together to get the final bitmap of rows that match all criteria. This is common in data warehousing. I’m not surprised that MySQL doesn’t support it.

It should also not be necessary that all of the search columns be present in the index to use it. If you’re searching for age + rating, it might be better to use the age index to limit the scanned rows to ones that match the age, rather than scanning the entire table. It would depend on the selectivity of the index. Take the extreme case where there was only one row for the requested age. Looking up that one row from the age index, then checking its rating value, would be a lot better than reading a million rows.

I believe Oracle supports these kind of operations, but since I use PostgreSQL, I’m not positive.

Regarding my previous comment, I was incorrect that a search for age and rating won’t match a single column index. It won’t match *completely*, but MySql will perform a partial match. It can then do a table scan on the rows returned by that match.

David also tells me that for a search on age and rating, MySql can use single column indices for both age and rating, then merge the results of the two partial matches. This hurts my head slightly.

October 1, 2008 at 5:21 am

Mark Wilden says:

Cool – I didn’t know MySQL could do that. Thanks for digging into it.

I might just add that the methods described will not necessarily be adopted by the query analyzer. You can index your tables up the wing-wang, but if the DBMS doesn’t think they’re selective enough, it’ll just do a table scan anyway, rather than performing lookups from the index. Hence, it’s important to keep the database’s statistics up to date so it can choose the right query plan.

It makes my head hurt slightly to think that the database cares not only about the structure of our data, but also the contents.