Special Note

The term “replica” in SolrCloud has caused me confusion. This paper attempts to clarify this issue.

Here are my words trying to clarify the above terms:

Collection - a complete logical index.

Collections are made up of one or more shards and a replication factor.

There is always at least one instance of a Shard (this is a replication factor of one). There can be more than one instance of a Shard for redundancy ( a replication factor greater than one).

An instance of a shard is called a core.

Collection

A SolrCloud Collection is a complete logical index.

A Collection can be divided into shards. This allows the data to be distributed.

The above picture represents a Collection with one shard and a replication factor of one. This results into a collection with one shard and that one shard IS the one replica. Here in lies the confusion when trying to describe SolrCloud. When I read the word replica I immediately imagine an original and a copy, an original and a replica.

Since there is always at least one replica it can be confusing terminology. When I first started trying to build a mental picture of SolrCloud I erroneously started with the idea that there was a master with replicas and therefore a replication factor of one would be a master and a replica. But that is not the case. What I thought of as a master is in reality “replica one”. Therefore, if you want your original index with one backup / failover copy what you need to say is “I want a replication factor of two.”

Therefore I feel the best way to describe it is like this:

The above picture represents a Collection with one shard and a replication factor of one. This results into a collection with one shard and that one shard is the only copy / instance of the data. Each instance of a shard is called a Core..

Shard

A Shard is a division of a Collection (complete logical index). Therefore a Shard is a portion or slice of a Collection (complete logical index).

Above represents a Collection sliced or divided into eight Shards.

Why would you want more than one shard? One reason would be if the total size of the collection is too large to fit on one computer.

In a Collection with one Shard all of the data will be in that single shard. For example, if you are doing a dictionary then with one shard the words from A to Z all go into the single shard. If you have two shards then the data for shard one could be A to M and the data for shard two could be N to Z.

Replica

Shards can be duplicated by using a replication factor. The replication factor specifies how many instances of each shard should exist. Each instance of a shard is called a Core. The confusion lies in that a Core is also called a Replica.

From the Solr documentation:

“Collections can be divided into shards. Each shard can exist in multiple copies; these copies of the same shard are called replicas. One of the replicas within a shard is the leader, designated by a leader-election process. Each replica is a physical index, so one replica corresponds to one core.”

The replication factor multiplied by the number of shards results in the total number instances of shards or better said the replication factor multiplied by the number of shards results in the total number of cores.

Shard instances show up in the Solr dashboard as “Cores”. In SolrCloud a Replica and a Core are the same thing.

Above the picture shows the “gettingstarted” collection with two shards and a replication factor of one which results in two shards each with one core / replica. Since there are two shards, each with one core / replica, there are a total of two cores / replicas. That is why you see two “cores” in the Solr Dashboard.

It is interesting to see the state.json for the “gettingstarted” collection.

"gettingstarted": {

"maxShardsPerNode": "2",

"router": {

"name": "compositeId"

},

"replicationFactor": "1",

"autoAddReplicas": "false",

"shards": {

"shard1": {

"range": "80000000-ffffffff",

"state": "active",

"replicas": {

"core_node2": {

"state": "active",

"core": "gettingstarted_shard1_replica1",

"node_name": "10.211.1.126:8983_solr",

"base_url": "http://10.211.1.126:8983/solr",

"leader": "true"

}

}

},

"shard2": {

"range": "0-7fffffff",

"state": "active",

"replicas": {

"core_node1": {

"state": "active",

"core": "gettingstarted_shard2_replica1",

"node_name": "10.211.1.126:8983_solr",

"base_url": "http://10.211.1.126:8983/solr",

"leader": "true"

}

}

}

}

}

Below is a collection that has eight shards with a replication factor of three. What is the total number of cores / replicas? There are 24 cores / replicas.

Just remember if you prefer to use the term Replica instead of the term Core that “replica 1” is just the first instantiation of a shard and “replica 2” is the second instantiation of the shard.

Zookeeper

If everything is running correctly then we are going to check and see what is in zookeeper. If it isn’t running, delete everything and start over. If you used the -noprompt command to start solr, follow the steps on the webpage and include the -V option with the command.

The first way to examine part of what is in zookeeper is through the Solr Dashboard.

Click on the left panel as shown here:

In Solr’s install directory, go to:

$ cd server/scripts/cloud-scripts

Run:

$ ./zkcli.sh -zkhost localhost:9983 -cmd list | less

You will see how the Solr Dashboard is showing what is in zookeeper.

Now download zookeeper and install it.

Got to the zookeeper bin directory and run:

$ ./zkCli.sh -server localHost:9983

Just because Solr is running the embedded zookeeper doesn’t mean you can’t connect to it.

Note that zkCli.sh is completely different than the shell found in Solr with the name zkcli.sh.

In the Solr install directory go to example/films and read README.txt. You will see that you need to update the Solr schema. You can run the following command or go into Solr and add fields through the Solr Dashboard.

After updating the schema I reloaded the cores through the Solr Dashboard.

On the original (first) instance of Solr (in my case the instance running on CentOS) run the command:

bin/post -c gettingstarted example/films/films.json

Next go to the Solr Dashboard of the original instance, select the gettingstartedcollection and execute the default query.

It looks like the post put 1,100 records into the database.

NOTE:

If you don’t update the schema before running the post command you will call all kinds of exceptions and errors. I think this is because the post tries to auto-detect field types and update the schema at runtime and it erroneously picks the wrong field type.

Checking Replication

Now I am wondering did the replication to the second instance of Solr (for me the second instance was started on Windows) actually has the data. To check this I am going to remove the cores/replicas from my first instance of Solr (running on CentOS) by removing the cores / replicas through the Solr Dashboard.

Just click the red X next to each core / replica running on the first box.

Now go to the Cloud panel and see if the original box has been removed.

Everything looks as expected.

Now select the collection “gettingstarted” and execute the default query.

There are still 1,100 records. It looks like everything is working correctly.

Just to double check, go into the index directory and see if there are any files on the original instance.