Advanced Filter Caching is a relatively new feature in Solr, available in version 3.4 and above. It allows precise control over how Solr handles filter queries in order to maximize performance, including the ability to specify if a filter is cached, the order filters are evaluated, and post filtering.

Filter Queries in Solr

Adding a filter expressed as a query to a Solr request is a snap… simply add an additional fq parameter for each filter query.

By default, Solr resolves all of the filters *before* the main query. Each filter query is looked up individually in Solr’s filterCache (which is pretty advanced itself, supporting concurrent lookups, different eviction policies such as LRU or LFU, and auto-warming). Caching each filter query separately accelerates Solr’s query throughput by greatly improving cache hit rates since many types of filters tend to be reused across different requests.

To Cache or not to Cache

The new advanced filter control API adds the ability to *not* cache a filter. Some filters may see almost no reuse across different requests, and not attempting to cache them can lead to a smaller, more effective filterCache with a higher hit rate.

To tell Solr not to cache a filter, we use the same powerful local params DSL that adds metadata to query parameters and is used to specify different types of query syntaxes and query parsers. For a normal query that does not have any localParam metadata, simply prepend a local param of cache=false. For example:

&fq={!cache=false}year:[2005 TO *]

To add cache=false to a filter query that already had localParams, simply add it right in with the rest of the params. For example, if we want to use Solr’s native spatial abilities to restrict our matches to locations within 50 km of Stanford, our filter query would look like:

&fq={!geofilt sfield=location pt=37.42,-122.17 d=50}

It’s easy to modify this filter to tell Solr not to cache it by adding cache=false in with the rest of the local parameters:

&fq={!geofilt sfield=location pt=37.42,-122.17 d=50 cache=false}

Leapfrog anyone?

When a filter isn’t generated up front and cached, it’s executed in parallel with the main query. First, the filter is asked about the first document id that it matches. The query is then asked about the first document that is equal to or greater than that document. The filter is then asked about the first document that is equal to or greater than that. The filter and the query play this game of leapfrog until they land on the same document and it’s declared a match, after which the document is collected and scored.

How much is that filter?

Advanced filtering adds even more fine grained control by introducing the notion of cost. If there are multiple non-cached filters in a response, filters with a lower cost will be checked before those with a higher cost.

In the example above, the filter based on year has a lower cost and will thus always be checked before the spatial filter.

As an aside, notice how spatial queries will use global spatial request parameters if they are not specified locally. This can make it even easier to construct requests containing spatial functions.

Expensive Filters

Some filters are slow enough that you don’t even want to run them in parallel with the query and other filters, even if they are consulted last, since asking them “what is the next doc you match on or after this given doc” is so expensive. For these types of filters, you really want to only ask them “do you match this doc” only after the query and all other filters have been consulted. Solr has special support for this called “post filtering“.

Post filtering is triggered by filters that have a cost>=100 and have explicit support for it. If there are multiple post filters in a single request, they will be ordered by cost.

For example, if we wanted to take the log of popularity, divide it by the square root of the distance, and filter out documents with a result less than 5, we could run this as a post filter using frange:

Post filtering support for the spatial filter queries bbox and geofilt has just been added to Solr 4.0 too. To execute our previous un-cached spatial filter as a post filter, simply modify it’s cost to be greater than 100:

Custom Filters

If you have expensive custom logic you’d like to add as a post filter (say per-document custom security ACLs), you can implement your own QParserPlugin that returns Query objects that implement Solr’s PostFilter interface. You can set the default cost or hardcode a cost higher than 100 if you want to only support post filtering. Then, you can use your custom parser as you would any other builtin query type via fq={!myqueryparser} and Solr will handle the rest!

Try it out!

In conclusion, hopefully this gives more insight into just one of many factors working under the hood to make Solr so fast. To try out the latest functionality, you can always get a nightly build of trunk. Feedback is always appreciated!