There are really two big names in open source search that provide replication, sharding, and the degree of customization that we need: Solr and Elasticsearch. After a few weeks of evaluation of each tool we've chosen Elasticsearch because of its simpler replication, more easily composable queries, and very nice suggester.

While Elasticsearch is officially schemaless it doesn't make perfect analysis or scaling decisions without some hints. CirrusSearch includes a maintenance that reads $wg and configures Elasticsearch with index specific parameters.

Yes but sometime reindexes are required. The maintenance script can (theoretically) perform these reindexes and they should be quite quick as they can be done by streaming documents from Elasticsearch back to itself.

With the flick of a global, you can engage in-process updates to the search index that happen right after the user makes the edit. With SolrCloud's soft auto commits and push updates, these should be replicated and searchable in two seconds. What that does to the cache hit rate has yet to be seen but it is certainly possible.

If you have to turn off in-process indexing for any reason, you'll have to rebuild the gap in time. The same maintenance script used for bootstrapping accepts a time window for document production but query that it needs to identify the documents is less efficient. It should still be a fair sight better than just rebuilding the whole index though.

We're entertaining replacing In process and rebuilding specific time windows with using the job queue[edit]

This has the advantage of offering a much simpler path for reindexing documents after a template update. This also might allow us to remove our custom reindex script and use the one built in to MediaWiki for bootstrapping. We'd probably have to expand it a bit to get nice batching and stuff, but that shouldn't be too hard because we'd mostly be porting it from our custom script.

We can play with full-text search but we really shouldn't expect the same results because we're not making an effort to match the current behavior exactly because, well, the current behavior isn't really what our users want, so far as we know.

I think what we do is get performance to an acceptable level in labs; going ahead and tuning things for performance now (while remembering that it is labs, so performance will never match production). The only way to have real ideas of load is by the gradual rollout with proper logging/monitoring. I assume this will likely be iterative, and we'll learn lessons (and tweak things) as we move forward.

Do we actually want all of the wikis on the same cloud? Or would we split into a couple of clouds, like the DB clusters (s1-s7)

We probably want to split into multiple clouds because every member of a cloud could grab and core at any time, even becoming the master. Being the master for one of enwiki's shards will be a lot more work than being the master for a mediawiki.org.

Counter argument: you can move nodes around if you need to

Will transcluded pages be able to be indexed in situ especially where the pages are transcluded cross-namespace, or would this be part of a future build?

This is the plan for the first iteration, yes.

Will an oversighting action cause the indexing action to trip on the immediate?

What do you mean immediate? As in faster, than a few minutes (which seems to be the current target), with a special queue for revert+oversight?