Monday, January 28, 2008

Incremental caching for web search

Puppin et al. have a 2007 paper, "Load-Balancing and Caching for Collection Selection Architectures" (PDF) with a curious idea for optimizing large scale distributed search engines they call incremental caching.

The basic concept is to do a bit of additional work each time the cache is accessed, adding the results from that additional work to the cache. In this way, cached search results improve each time they are accessed.

From the paper:

[We] discuss a novel class of caching strategies that we call incremental caching.

When a query is submitted to our system, its results are looked for in the cache. In the case of a hit, some results previously retrieved from a subset of servers will be available in the cache. The incremental cache however will try to poll more servers for each subsequent hit, and will update the top-scoring results stored in the cache.

Over time, the cached queries will get perfect coverage because results from all the servers will be available. This is true, in particular, for common or bursty queries: the system will show great performance in answering them.

Seems like this makes things a lot more complicated, however, with users of the cache having to do a lot more work to figure out what to do after a cache hit, the cache suffering a lot more write contention, and debugging the system (determining why you got the results you did) becoming much more difficult.

Moreover, it is not clear to me that this offers a huge amount of value over a simpler schemes such as a cache that combines a static cache computed from access patterns with a smaller normal run time cache to catch new or bursty behavior.

Even so, it seems valuable for things like federated search where the data sources are not under your control, each data source access is expensive, and there may be limits on the number of data sources you reasonably can query simultaneously. In that case, more traditional caching might cache the results from each of the data sources independently, but incremental caching probably would allow a much more compact and efficient cache.

Please see also my Jan 2007 post, "Yahoo Research on distributed web search", that discusses another 2007 paper, "Challenges in Distributed Information Retrieval" (PDF). That paper shares authors in common with this one and also discusses some ideas around distributed search including caching.

Update: About a year later, Puppin et al. published another paper, "Tuning the Capacity of Search Engines: Load-driven Routing and Incremental Caching to Reduce and Balance the Load" (ACM), that proposes another scheme, one that is quite clever, that learns from searcher behavior (the query-click graph) to cluster index chunks, maximize the likelihood of cache hits, and adapt to load. Nice work there with several good ideas.