Saturday, 20 April 2013

I headed over to the Royal Ontario Museum to check out the NASA Space Apps Hackathon this Friday with a clear goal in mind- to play around with the leap motion controller. This being my fourth hackathon, I noticed many familiar faces. Some folks recognized me from the LinkedIn space apps hackathon, where I had fared pretty well, and I rather enjoyed the attention.

The mandatory goody bag was replete with pens and mementos (a Google puzzle, and jarringly enough, a "Rogers"-mobile branded t-shirt). They had ethernet cables on every table, a vast improvement over the spotty wifi typically deigned to attendees. The inexorable banalities soon followed - presentations, sponsors, etc.

A bit of background- I had applied for a developer version of the Leap Motion device many months ago. This little piece of hardware is essentially an "Xbox Kinect"-type infrared device which can be plugged into a USB port, enabling gestures on your machine. Unfortunately, I wasn't sent the device then. I was thus pleased to hear that these devices (not yet released) were being made available to hackathon attendees.

It was only around 8pm that I actually got a hold of these babies. No SDK access though. I lingered around till around 11pm, when I received an e-mail granting me SDK access. By then, I was thoroughly exhausted. Ever the masochist, I fired up the SDK, and after a few hiccups, got the damned thing working. The sample apps accompanying the SDK were interesting, and I rather enjoyed playing around with my "miniaturized Kinect". The Java example was rather bare-bones. In contrast, the C++ examples were well integrated with OpenGL and looked pretty cool. It was at this crucial point, that the organizers jettisoned everybody out of the venue. Unlike other hackathons, there was no overnight component.

I left the Leap Motion with the organizers (they had held my ID hostage) and decided not to return for the subsequent evenings. It was clear that two days would be insufficient to build an application given an unfamiliar SDK (Leap Motion had not made the SDK public yet). Moreover, integrating Leap Motion with a NASA challenge would be difficult, given that there were barely any developers using it, casting aside any opportunity for collaboration.

From my interaction with the Leap Motion, I felt that it would make a suitable gaming accompaniment. However, it is difficult for me envision any other scenario where this device, at its current stage, would be a useful tool for activities such as web browsing, typing, etc. It's an interesting concept, and very portable, but I would rather rest my hands on my laptop than strain them atop the Leap Motion for most things.

Saturday, 13 April 2013

The Solr distribution comes with a couple of sample applications. I will be focusing on 2 of those - one is found in the example directory (solr-4.2.0/example) and the other is found in the example-DIH directory (solr-4.2.0/example/example-DIH). The Data Import Handler (DIH) is used to index database contents.

The next thing is the Solr deployment. To do that we need the /opt/solr-4.2.0/dist/solr-4.2.0.war file that contains the necessary files and libraries to run Solr that is to be copied to the Tomcat webapps directory and renamed solr.war.

Start the Solr sample example :

sudo service tomcat restart

If you want to run the Data Import Handler (DIH) Solr example instead :

The /opt/solr-4.2.0/example/solr/db/conf/schema.xml file doesn't need to be changed for my example, but you will likely have to change it if you're going to index other stuff in your DB.

You will also need to include your JDBC driver :

cd /opt/solr-4.2.0/example/solr/db/lib

paste your postgresql-9.1-902.jdbc4.jar in that folder

Restart Tomcat :

sudo service tomcat restart

When you begin indexing things, you might encounter permission problems creating the db/data/index folder. In this case, I just change all the permissions of the db folder and its files :

sudo chmod -R 777 /opt/solr-4.2.0/example/solr/db

this might not be a great idea security wise

Zookeeper

In order to run SolrCloud—the distributed Solr installation—you need to have Apache ZooKeeper installed. Zookeeper is a centralized service for maintaining configurations, naming, and provisioning service synchronization. SolrCloud uses ZooKeeper to synchronize configuration and cluster states (such as elected shard leaders), and that's why it is crucial to have a highly available and fault tolerant ZooKeeper installation. If you have a single ZooKeeper instance and it fails then your SolrCloud cluster will crash too.

Regarding the second line that you added, the first part is the IP address of the ZooKeeper server, and the second and third parts are the ports used by ZooKeeper instances to communicate with each other.

Document cache : This is used for storing Lucene documents which hold stored fields

Query result cache : This is used for storing results of queries

There is a fourth cache - Lucene's internal cache - which is a field cache, but you can't control its behaviour. It is managed by Lucene and created when it is first used by the Searcher object.

With the help of these caches we can tune the behaviour of the Solr searcher instance. Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.

On Solr Directory Implementation :

One of the most crucial properties of Apache Lucene, and thus Solr, is the Lucene directory implementation.

The directory interface provides an abstraction layer for Lucene on all the I/O operations. This can affect the performance of your Solr setup in a drastic way.

If you want Solr to make the decision for you, you should use solr.StandardDirectoryFactory. This is a filesystem-based directory factory that tries to choose the best implementation based on your current operating system and Java virtual machine used.

If you are implementing a small application, which won't use many threads, you can use solr.SimpleFSDirectoryFactory which stores the index file on your local filesystem, but it doesn't scale well with a high number of threads.

solr.NIOFSDirectoryFactory scales well with many threads, but it doesn't work well on Microsoft Windows platforms (it's much slower), because of the JVM bug, so you should remember that.

solr.MMapDirectoryFactory was the default directory factory for Solr for the 64-bit Linux systems from Solr 3.1 till 4.0. This directory implementation uses virtual memory and a kernel feature called mmap to access index files stored on disk. This allows Lucene (and thus Solr) to directly access the I/O cache. This is desirable and you should stick to that directory if near real-time searching is not needed.

If you need near real-time indexing and searching, you should use solr.NRTCachingDirectoryFactory. It is designed to store some parts of the index in memory (small chunks) and thus speed up some near real-time operations greatly.

solr.RAMDirectoryFactory, is the only one that is not persistent. The whole index is stored in the RAM memory and thus you'll lose your index after restart or server crash. Also you should remember that replication won't work when using solr.RAMDirectoryFactory. One would ask, why should I use that factory? Imagine a volatile index for an autocomplete functionality or for unit tests of your queries' relevancy. Just anything you can think of, when you don't need to have persistent and replicated data. However, please remember that this directory is not designed to hold large amounts of data.