Search/Old

This page is obsolete. It is kept for historical interest only. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date.

This page describes the Wikimedia Foundation's activities surrounding our sites' search functionality. Our current project is to replace our legacy lsearchd system with a new system based on Elasticsearch (using a new extension called CirrusSearch). This project started in June 2013, with the migration slated to last until 2014.

The BetaFeature for this project is called New Search and enabling it will change how your searches are executed but shouldn't substantially change the search experience. The ordering will change - hopefully always for the better. You should file bugs against the CirrusSearch project in Phabricator if it feels like the ordering is worse or if any of your search related tasks no longer work or degrade.

The Wikimedia search infrastructure hasn't had significant development work for many years. The current system is based on homegrown layer (named "lsearchd") on top of Lucene. The problem lsearchd solves has since been tackled by much larger projects such as Solr and Elasticsearch. The lsearchd system frequently breaks in ways that are difficult to diagnose[clarification needed], and generally makes our Operations staff sad.

Goals for our current effort:

Make our existing tools more robust

Improve logging in our existing tools to make problems easier to diagnose

Migrate away from lsearchd to Solr or something similar

Our current search infrastructure is highly outdated and difficult to manage due to tons of custom code. We are now replacing lsearchd with Elasticsearch (which is also a layer on Lucene), as it's very stable, contains many of the features we need, and doesn't require nearly as much custom code to support. What custom code we write will be incorporated in a MediaWiki extension called CirrusSearch.

This page is a timeline for future deployments of Cirrus as primary backends to wikis. Our general goal is to deploy CirrusSearch (backed by Elasticsearch) as the primary search backend for all wikis by the end of 2013(ha!) September 2014.

This table is of the current plan. We imagine the current plan to change frequently. Historically we've been pretty bad at keeping this up to date when the plan slips but we'll try to be better in the future.

We spent some time looking at search systems we could use and it became pretty apparent that the thought leaders in the open source world for search are Solr and Elasticsearch. We spent a few weeks with each and decided to build on Elasticsearch because of its wonderful suggester, easily composable queries and good documentation. We are also happy with the process of submitting changes upstream to Elasticsearch.

We've just started looking at how to move GeoData to Elasticsearch. For now, it'll remain in Solr with plans to migrate it to Elasticsearch when time permits. Some considerations:

The index is relatively small (so no need to make it distributed), but requires a lot of computational power to work with. Full-text search is not currently used. Currently, data from all the wikis is stored in the same core, in the future we will need to split data to many cores (the puppet changes for using multiple cores with shared configuration/schema are here, needs more work).

Load expectations: unclear, but will be high if we start using it heavily e.g. for maps display.

Backups: not really needed - if master is down just switch to a slave. If all servers are down, reindexing from scratch is quick.

Note: because GeoData's schema is very stripped-down, /admin/ping doesn't work - should be remembered if someone wants to rewrite the current monitoring.

The feature originally existed for all wikis around 2009 but was later disabled; as of now there isn't a timeline for re-enabling it by default. You can however request it for your wiki with the usual process.