Archive for May, 2012

We recently overhauled the search functionality for the UK government’s e-petitions site, run by the Government Digital Service, a new team within the Cabinet Office. Search has an important function on the site; users are forced to search for existing petitions which cover their area of concern before creating a new one. This cuts down on the number of near-duplicate petitions, and makes petitions more effective.

The website is implemented in Ruby on Rails, using the Sunspot Solr client library. There are currently only 22,000 petitions, of no more than a few kilobytes each – easily enough to fit into the cache of a standard server. Despite this, the previous configuration was performing badly, and maxing out 8 CPU cores on a virtual machine under a load of a few hundred queries per second. Retrieval was also poor, with no results at all found for queries like “EU”.

The first thing we did was to install Solr 3.6 (the previous version was the rather elderly 1.4) running in Jetty on Ubuntu. Then we looked at the schema and search implementation. The former was using the standard Sunspot field mappings, which is fine for many applications but in this case was not allowing flexibility of weighting. Searches used the standard query parser to parse a hand-constructed query string with different field weightings and frequent use of the fuzzy match operator (e.g. “leasehold~0.8″). This seemed to be the most likely cause of poor performance under load.

Fuzzy matching had been used because of the frequent misspellings in petition text entered by users (e.g. “marraige” instead of “marriage”). Solr spelling correction on the query is not appropriate here, as correctly-spelled queries may not find misspelled content. But since fuzzy matching was performing badly on a relatively small index, we needed a new approach.

What we came up with was two levels of fields: the first being normalised with lowercasing and KStem but otherwise matching exactly, the second using a PhoneticFilterFactory to perform a Double Metaphone encoding on terms. We hoped that the misspellings in the corpus would transform to the same terms under this filter (e.g. “marriage” and “marraige” both yielding “MJ” etc.) The exact fields should provide precision, the phonetic fields, retrieval. Fields were populated using the copyField directive, without changing the client indexing code. We configured an eDisMax query handler to provide a simple interface and removed the custom query string construction from the client code.

In practice, this worked very well – the new server can handle search loads 5 times or greater compared with the previous one, and the CPUs are never maxed out (despite the server having only 4 cores compared with the previous 8). Ranking and retrieval are also greatly improved, and searches for “EU” return relevant petitions!

Phonetic algorithms are never going to catch all misspellings, and had Solr 4.0 been released at this time (with its very fast fuzzy engine) then it would have been the obvious approach to try. However, for now the search is much better, in less than 2 days of effort.

There’s been a recent flurry of activity from search vendors (and those larger companies that have been buying them) around the theme of Big Data, which has become the fashionable marketing term for a sheaf of technologies including search, machine learning, Map Reduce and for scalability in general. If anyone impertinently asks why company X bought company Y the answer seems to be ‘because they have capability in Big Data and our customers will need this’.

Search companies like ours have been working with large datasets since the beginning – back in 1999/2000 the founders of Flax led a team to build a half-billion-page Web search engine, which as I recall ran on a cluster of 30 or so servers. Since then we’ve worked with other collections of tens or hundreds of millions of items. Even a relatively small company can have a few million files on their intranet, if you count all those emails, customer records and Powerpoint presentations. So yes, you could say we can do Big Data – we certainly know how to design and build systems that scale.

However it makes me nervous when a set of technologies that could (in theory) be used together are simply lumped together for marketing purposes as the Next Big Thing. The devil is as always in the detail (and the integration) and it’s important to remember that just because you can fit all your data into a system doesn’t mean that system will help you make any sense of it. A recent term for unstructured data (which of course us search developers have been working with for decades) is Dark Data, which implies that it is mysterious and hidden – but that doesn’t mean it has any actual value. Those considering a Big Data project should be aware that in any computer system GIGO is still an issue.