Large-scale Search

To scale up from 500,000 volumes of full-text to 5 million, we decided to use Solr’s distributed search feature which allows us to split up an index into a number of separate indexes (called “shards”). Solr's distributed search feature allows the indexes to be searched in parallel and then the results aggregated so performance is better than having a very large single index.

On November 19, 2009, we put new hardware into production to provide full-text searching against about 4.6 million volumes. Currently we have about 5.3 million volumes indexed. Below is a brief description of our current production hardware. Future posts will give details about performance and background on our experiments with different system architectures and configurations.

Before we implemented the CommonGrams Index, our slowest query with the standard index was “the lives and literature of the beat generation” which took about 2 minutes for the 500,000 volume index. When we implemented the CommonGrams index, that query took only 3.6 seconds.

In part 1 we talked about why some queries are slow and the effect of these slow queries on overall performance. The slowest queries are phrase queries containing common words. These queries are slow because the size of the positions index for common terms on disk is very large and disk seeks are slow. These long positions index entries cause three problems relating to overall response time:

Since we finished the work described in the Large Scale Search Report we have made some changes to our test protocol and upgraded our Solr implementions to Solr 1.3. We have completed some testing with increased memory and some preliminary load testing.

A recent blog pointed out that search is hard when there are many indexes to search because results must be combined. Search is hard for us in DLPS for a different reason. Our problem is the size of the data.