Thursday, July 19, 2012

Setting up Apache Solr on Mac OS X

Apache Solr is an incredibly flexible and capable text search engine that you can hook into other applications.

For a beginner it can appear to be very daunting due to the number of configuration options and when you start to read the documentation this feeling tends not to improve. But if you are using Mac OS X (a recent version) and have the homebrew package manager installed then the process is not too bad, at least to get you up and running.

These steps will help you get Solr installed and then load in data extracted from HTML, PDF and Word files.

In my configuration I have Mac OS X Lion and current versions of Java and homebrew.

This installs the software in /usr/local/Cellar/solr/3.6.0 (or whatever your version is)

2: Go to the example directory and start up Solr$ cd /usr/local/Cellar/solr/3.6.0/libexec/example$ java -jar start.jarThis will start up the Solr server using Jetty as the servlet container, which is just fine for our testing. Do not worry about using Tomcat etc until you are comfortable with a working Solr set up.It is usually not good form to work inside a distribution directory - but for initial testing you should do this.You will see a load of verbose output from Java in your terminal window, aside from any real errors, just ignore this.

3: Verify that the server is running
Browse to http://localhost:8982/solr/admin
You should the Administration interface with a gray background. There is not a lot you can do as you have not yet indexed any data, but this shows that you are running.

4: Load some data into Solr$ cd exampledocs$ java -jar post.jar *.xml
Note that the Solr server MUST be running before you try and load the documents.

5: Run your first query from the admin page
Enter 'video' in the Query form and hit 'Search'. You should see the contents of an XML file returned to you with 3 documents.
This demonstrates that the server is working, that you can index XML documents and query them. Solr does not provide you with a nice search interface. The intent is that your Rails, etc application sends queries and then parses out the XML results for display back to the user.

6: Extract text from other document types
What I want to use Solr for is to search text extracted from web pages, Word documents, etc. But this where the Solr documents started to really let me down.
This parsing is done by calling Apache Tika which is a complex software package with the sole aim of text extraction. The interface between Solr and Tika used to be called SolrCell and is now called the ExtractingRequestHandler. Don't worry about any of that for now! The Solr distribution has everything you need already in place - you just need to know how to use it...

Do not set up a custom Solr home directory for now. That is what the documentation suggests, it makes perfect sense but nothing will work if you do.

Save a few HTML files that contain a good amount of text into a temporary directory somewhere.

To load the contents of a file into the server and have it parse the text you need to use 'curl' to POST the data to a specific URL on the server. In this example my file is called index.html and I have cd'ed to the directory containing the file$ curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true&uprefix=attr_&fmap.content=attr_content" -F "myfile=@index.html"

There are several things to note with this URL...
The server URI is http://localhost:8983/solr/update/extractThe parameters areliteral.id=doc1commit=trueuprefix=attr_fmap.content=attr_content

Literal.id provides a unique identifier for this this document in the index (doc1 in this case)
uprefix=attr_ adds the prefix 'attr_' to the name of each tag in the source document
fmap.content=attr_content specifies that the main text content of the page should be given the tag attr_content
commit=true actually commits the parsed data to the index

And then note how the file to be parsed is specified... as a quoted string passed with the -F flag to curl. The quoted string consists of a name for the uploaded file (myfile) followed by the path to the file preceded by the Ampersand character (@).

You can ignore what these mean for now with the exception that each document you load must have a unique document ID.

7: Verify that the text has been indexed
Go to the Admin interface and click 'Full Interface'. Set the number of rows returned to large number (50) and then search with the query *:* - this will return all the records in the index.

You should get back an XML page with the example data plus the text derived from your HTML file at the bottom.

8: Try loading other data types
Tika knows how to parse Word, Excel, PDF and other files.
Be aware that there may not be much text in certain files. This is especially true of PDF files which although they appear to have text when displayed, may have that text stored as an image.

9: Explore the example directory

Everything I've shown here should 'just work' as long as you worked in the example directory. In there you will see a 'solr' subdirectory which contains conf and data directories. The data directory is where the index files reside. In the conf directory you will see a solrconfig.xml file. This is the main location for specifying the various components that your Solr installation uses. The default values happen to work fine until you create your own 'Solr Home' directory elsewhere.

The problem is that the solrconfig.xml file contains various relative paths (like ../../lib) which will not work if you create a solr directory in an arbitrary location. Knowing this you can update those paths pretty easily, but if you created a custom solr home then you are in for a lot of frustration trying to figure out why nothing works. I went through that process - you shouldn't have to...

That should be enough to getting you started with Solr. The next steps are to link it to your web application, to load in a lot more data and to move it to a production servlet container like Tomcat.