Wednesday, February 27, 2013

Intro

I have recently returned from Solr training by LucidWorks. The training was excellent. I had been using Solr for a couple of months experimenting with various queries trying to improve recall on my particular data. The clarification on the various caches was very valuable. But enough of that, this post is about setting up SolrCloud on windows.

This tutorial uses:

Solr 4.1.0

Java 1.7.0_13

Java

Install Java.

After installing Java I added this to the sytem Path:
C:\Program Files\Java\jdk1.7.0_13;C:\Program Files\Java\jre7\bin

I also added this environment variable:
JAVA_HOME

with the value:
C:\Program Files\Java\jdk1.7.0_13

Of course those values need to reflect where you installed Java.

Check your install of Java by going to some directory other than where you installed java and run the following:

Here I want to simulate setting up a custom collection. So, rename the directory "collection1" to "junk".

Still in the same directory edit the "solr.xml" file. We are interested in the portion at the bottom that specifies the cores.

Here is the original contents of the solr.xml file core entry:

Change each instance of collection1 in the xml file to junk. The results are shown here:

Make sure you have the host port set to the jetty port in the solr.xml file.

In the SOLRHOME directory I duplicated the "example" directory. I want to setup two shards each with a replica, so I need four directories total. So I make three copies of the example directory and name them example1, example2, example3, and example4. (By the way, I did this in Windows Explorer, you could do it the same, or in a Command window, doesn't matter.)

Notice that this command specifies the number of shards. Here is something to remember, re-sharding means re-indexing. If you setup two shards and load data into them and then decide you want three shards, at the time of writing this blog you have to re-idex (re-import) all of your data.

Also, the command specifies to launch an instance of ZooKeeper with the -DzkRun. ZooKeeper you say. What is ZooKeeper? It is an application for managing clusters. ZooKeeper comes with the Solr install (as well as Jetty) and is launched for you. This is for a convenience. In a production system you would not want ZooKeeper running on the same box as Solr which makes a single point of failure. Also ZooKeeper should be ran in an ensemble of at least three instances. You can look up ZooKeeper if you want more details. The command runs ZooKeeper at the Solr Port + 1000. The default Solr port is 8983, therefore ZooKeeper is at 9983.

Open a browser (I use Firefox, I have experienced problems with IE) and go to this url:

http://localhost:8983/solr/#/

You should see this:

Click on the "Cloud" item on the left.

The page will show you the graph of the cloud. Remember, our collection is named "junk" and we setup two shards and two replicas by making four "example" directories.

Starting the Second Shard

Starting the second shard is very simple.

Launch another command window and go to SOLRHOME\example2.

Runs this command:

java -Djetty.port=8984 -DzkHost=localhost:9983 -jar start.jar

You need to run this next instance of Solr on a different port than the first. The first defaulted to port 8983, so run this new instance on port 8984 by telling Jetty which port to run on. (Jetty is like Tomcat, it is a Web Application Server).

The paramter -DzkHost is specifying where ZooKeeper is running.

After running the command you should see that shard2 is now running from the Solr Cloud page.

Starting the Replicas

Starting the replicas is like starting the second shard above, just go into each remaining directory (example3 and example4) and run the command used before specifying a different port for each instance.

Replica 1:

Runs this command:

java -Djetty.port=8985 -DzkHost=localhost:9983 -jar start.jar

Replica 2:

Launch another command window and go to SOLRHOME\example4.

Runs this command:

java -Djetty.port=8986 -DzkHost=localhost:9983 -jar start.jar

Miscellaneous

In the Solr Dashboard if you select "Tree" under "Cloud" it shows you the information that was is used by ZooKeeper to configure the shards and replicas.

Mistakes I Made Trying to Figure this Out

Before I tried to setup my first SolrCloud configuration I had been running Solr for about two months. During that time I was experimenting with various schemas and field types, and Lucene queries. My problem set is one of "recall" based on "scoring".

Sometime during this experimentation I had altered many of the files, and I must have messed up the solr.xml file. I would do the same steps I have above and I would never get any other shards to appear. Finally I just reinstalled Solr and everything started working.

I suspect one culprit that caused things not to work was I had experimented with DistributedSearch where you manually setup shards. In the solr.xml file you can specify the core information with shard details, and that may have been "floating" around somewhere.

Another mistake I made is I forgot to specify the Zookeeper param when launching what I thought would be a new shard or replica. So, make sure you don't forget to tell where Zookeeper is with the -DzkHost param.

If things don't seem to be working you can go to "example1" and delete the zoo_data directory and try launching things again.