A First Exploration Of SolrCloud

Update: this article was published in August 2012, before the very first release of SolrCloud. Meanwhile SolrCloud has evolved, please refer to the Solr website and community for up-to-date information.

SolrCloud has recently been in thenews and was merged into Solr trunk, so it was high time to have a fresh look at it.The SolrCloud wiki page gives various examples but left a few things unclear for me. The examples only show Solr instances which host one core/shard, and it doesn’t go deep on the relation between cores, collections and shards, or how to manage configurations.

In this blog, we will have a look at an example where we host multiple shards per instance, and explain some details along the way.

The setup we are going to create is shown in this diagram.

SolrCloud terminology

In SolrCloud you can have multiple collections. Collections can be divided into partitions, these partitions are called slices. Each slice can exist in multiple copies, these copies of the same slice are called shards. So the word shard has a bit of a confusing meaning in SolrCloud, it is rather a replica than a partition. One of the shards within a slice is the leader, though this is not fixed: any shard can become the leader through a leader-election process.

Each shard is a physical index, so one shard corresponds to one Solr core.

If you look at the SolrCloud wiki page, you won’t find the word slice [anymore]. It seems like the idea is to hide the use of this word, though once you start looking a bit deeper you will encounter it anyway so it’s good to know about it. It’s also good to know that the words shard and slice are often used in ambiguous ways, switching one for the other (even in the sources). Once you know this, things become more comprehensible. An interesting quote in this regard: “removing that ambiguity by introducing another term seemed to add more perceived complexity”. In this article I’ll use the words slice and shard as defined above, so that we can distinguish the two concepts.

In SolrCloud, the Solr configuration files like schema.xml and solrconfig.xml are stored in ZooKeeper. You can upload multiple configurations to ZooKeeper, each collection can be associated with one configuration. The Solr instances hence don’t need the configuration files to be on the file system, they will read them from ZooKeeper.

Running ZooKeeper

Let’s start by launching a ZooKeeper instance. While Solr allows to run an embedded ZooKeeper instance, I find that this rather complicates things. ZooKeeper is responsible for storing coordination and configuration information for the cluster, and should be highly available. By running it separately, we can start and stop Solr instances without having to think about which one(s) embed ZooKeeper.

We need to specify the hostPort attribute since Solr can’t detect the port, it falls back to the default 8983 when not specified.

This is all we need: the actual core configuration will be uploaded to ZooKeeper in the next section.

Creating a Solr configuration in ZooKeeper

As explained before, the Solr configuration needs to be available in ZooKeeper rather than on the file system.

Currently, you can upload a configuration directory from the file system to ZooKeeper as part of the Solr startup. It is also possible to run ZkController’s main method for this purpose (SOLR-2805), but as there’s no script to launch it, the easiest way right now to upload a configuration is by starting Solr:

If you would like to change a configuration later on, you essentially have to upload it again in the same way. The various Solr cores that make use of that configuration won’t be reloaded automatically however (SOLR-3071).

Starting the Solr servers

All SolrCloud magic is activated by specifying the zkHost parameter. Without this parameter, you run Solr ‘classic’, with the parameter, you run SolrCloud. If you look into the source code, you will see that this parameter causes the creation of a ZkController, and at various places checks of the kind ‘zkController != null’ are done to change behavior when in cloud mode.

Note that now, we don’t have to specify the boostrap_confdir and collection.configName properties anymore (though that last one can still be useful as default sometimes, but not with the way we will create collections & shards below).

We have neither added the -Dnumshards parameter, which you might have encountered elsewhere. When you manually assign cores to shards as we will do below, I don’t think it serves any purpose.

So the situation now is that we have two Solr instances running, both with 0 cores.

Define the cores, collections, slices, shards

We are now going to create cores, and assign each core to a specific collection and slice. It is not necessary to define collections & shards anywhere, they are implicit by the fact that there are cores that refer them.

In our example, the collection is called ‘collectionOne’ and the slices are called ‘slice1′ and ‘slice2′.

AFAIU the information in ZooKeeper takes precedence, so the attributes collection and shard on the core above serve more as documentation, or they are of course also relevant if you would create cores by listing them in solr.xml rather than using the cores-admin API. Actually listing them in solr.xml might be simpler than doing a bunch of API calls, but there is currently one limitation: you can’t specify the configName this way.

In ZooKeeper, you can verify this collection is associated with the config1 configuration:

Adding some documents

We sent the request to one specific core, but you could have picked any other core and the end result would be the same. The request will be forwarded automatically to the leader shard of the appropriate slice. The slice is selected based on the hash of the id of the document.

What happens internally is that when you sent a request to a core, when in SolrCloud mode, Solr will look up what collection the core is associated with, and do a distributed query across all slices (it will pick one shard for each slice).

The SolrCloud wiki page gives the suggestion that you can use the collection name in the URL (like /solr/collection1/select). In our example, this would then be /solr/collectionOne/select. This is however not the case, but rather a particularity of that example. As long as you don’t host more than one slice and shard of the same collection in one Solr server, it can make sense to use such a core naming strategy.

Starting from scratch

When playing around with this stuff, you might want to start from scratch sometimes. In such case, don’t forget you have to remove data in three places: (1) the state stored in ZooKeeper (2) the cores defined in solr.xml and (3) the instance directories of these cores.

When writing the first draft of this article, I was using just one Solr instance and tried to have all the 4 cores (including replica’s) in one Solr instance. Turns out there was a bug that prevents this from working correctly (SOLR-3108).

Managing slices & shards

Once you have defined a collection, you can not (or rather should not) add new slices to it, since documents won’t be automatically moved to the new slice to fit with the hash-based partitioning (SOLR-2595).

Adding more replica shards should be no problem though. While above we have used a very explicit way of assigning each core to a particular slice, you can actually leave that parameter off and Solr will automatically assign it to some slice within the collection. (I guess here the -Dnumshards parameter kicks in to decide whether the new core should be a slice or a shard)

How about removing replicas? It can be done, but manually. You have to unload the core and remove the related state in ZooKeeper. This is an area that will be improved upon later. (SOLR-3080)

Another interesting thing to note is that when your run in SolrCloud mode, all cores will automatically take part in the cloud thing. If you add a core without specifying the collection, a collection named after that core will be created. You can’t mix ‘classic’ cores and ‘cloud’ cores in one Solr instance.

Conclusion

In this article we have barely touched the surface of everything SolrCloud is: there’s the update log for durability and recovery, the sync’ing between replica’s, the details of distributed querying and updating, the special _version_ field to help with some of the these points, the coordination (election of overseer & shard leaders), … Much interesting stuff to explore!

As becomes clear from this article, SolrCloud isn’t as easy to use yet as ElasticSearch. It still needs polishing and there’s more manual work involved in setting it up. To some extent this has its advantages, as long as it’s clear what you can expect from the system, and what you have to take care of yourself. Anyway, it’s great to see that the Solr developers were able to catch up with the cloud world.

What would be the proper way to start this with the embedded zookeeper? I ask because I am working on windows, and there is no ZooKeeper release for windows (though I was able to build the .jar, I am not sure what to do with it, so I’d prefer to stick to the embedded one)

Great article! But I found the shard/slice distinction confusing. In the Solr terminology, typically, a “shard” and a “slice” are logically considered the same thing. See http://wiki.apache.org/solr/SolrCloud for details.

If I got it right, in this article, the author was using the term “shard” to designate a “physical shard” (leader or replica), as opposed to a “logical shard” (slice).

I have verified using zkCli.sh that 2 slices and 4 shards are up and running, and the inherited solrconfig.xml uses the default UpdateRequestHandler. I can’t tell from the Solr log why the cloud is not automatically distributing the inserted data into both slices.

Has anybody experienced a similiar issue, or know the reason behind this issue? I’d appreciate it.

This article is rather old, it dates from the very first iteration of SolrCloud. I guess SolrCloud has evolved quite a bit in the meantime. Please use the Solr user mailing list to get help on the current version.

very nice article, I see no date or versions you used. I found this page by googling for some problem I am having with solr cloud. Maybe you can give me a hint.

I have a separate zookeeper installation (3 nodes) and two solr-4.6.0 as solrcloud.

I already have some cores as I am migrating and all I did was just migrate the configuration to suit the new discovery style.

So, the zookeeper running (fresh, empty) and the first solr started with bootstrap_conf set to true. By observing the solr log and the zookeeper registry I can confirm that the configuration was uploaded to zookeeper. The cores are visible in the solr web console, also search works and insert also.

The other solr server was stopped up to this point.

It was only having the solr.xml with no cores (directories). I started the solr hoping the cores will be replicated, but the solr web console is saying there are no cores. I see no errors in the logs.

Am I missing something? What should I be after for cores to appear on the second solr?

Hi.
This article was published in August 2012, before the very first release of SolrCloud. Meanwhile SolrCloud has evolved, please refer to the Solr website and community for up-to-date information. [http://lucene.apache.org/solr/]