Sphinx: Getting Practical

Last week we looked at the Sphinx search engine from a high-level point of view. Now let’s look at getting and building the code, some basic setup, and how to build and query indexes.

Get, Build, Install

In order to build Sphinx, you will need the MySQL client libraries (libmysql) and header files (libmysql-dev) as well as the expat library (libexpat) and header files (libexpat-dev) installed in standard locations. Once you have those, you can grab the latest version of the Sphinx trunk from Google Code subversion repository. If you prefer, you can use a tarball from the Sphinx site but new releases are coming only every few months while the code itself evovles more rapidly as bugs are fixed and features added.

indexer: reads data from MySQL or an XML import flie and produces the full-text indexes. This can also be used to merge and rotate indexes.

searchd: the sphinx seach daemon which listens on TCP port 3312 for connections and handles queries.

search: used for running ad-hoc queries from the command-line directly against the indexes (does not use searchd).

spelldump: tool for extracting ispell dictionary data

indextool: tool used for producing information about the indexes, such as the index header, list of document ids, and the “hit list” for a given keyword.

We’ll see how to use several of those tools shortly and next week.

General Sphinx Configuration

Installing Sphinx also deposits a few files in /usr/local/etc that serve as starting points for configuration. sphinx-min.conf.dist is an excellent minimal confiugration file to start with. I won’t reproduce it entirely here (you can see the pre-build version from the subversion repository), but it defines a few things to get you started.

There is a source called src1 and a corresponding index called test1. In Sphinx, you specify various information about an index separately from the data source definition. That helps to separate the index definition from the mechanics of how to get the data. Sphinx has back ends that know how to extract structured data from MySQL, Drizzle (coming soon), ODBC data sources, and arbitrary XML files or command pipelines (known as xmlpipe or xmlpipe2 in recent versions).

There’s an indexer section that specifies how much memory the indexer should limit itself to. The default is 32MB but you can go as high as 2GB–quite useful if you have very large document sets to index.

Next we need to tell Sphinx what our data looks like and how it should be indexed. The SQL Data Sources section of the Sphinx manual explains this fairly well if your data lives in MySQL and can easily be queries to get at the documents you’d like to index. However, things are often more complicated. The documents may need some sort of cleanup or pre-index processing before Sphinx should read them. That means telling the Sphinx indexer to use the xmlpipe2 “driver” which reads full XML documents from a pipe. That XML input stream contains a short header that specifies the “schema” for the index too–it does not live in the Sphinx configuratin file as it does for MySQL-based sources.

And you’d change src1 to src1_pipe to use then. Then when your run the indexer, it’d execute the xmlbuilder.pl Perl script and read it’s stdout as XML.

While there’s a bit more overhead involved in the initial seutp of an index using the xmlpipe2, it’s a lot more powerful. It allows you to perform arbitrary conversions and cleanup on the data before Sphinx sees it, does not tie you to a particular database table or schema, and so on.

Buliding an Index

Assuming you’ve created a working xmlbuilder.pl, indexing your documents is very striaghtforward. You simply run indexer and tell it which index to build (or all indexes).

/usr/local/bin/indexer test1

Or:

/usr/local/bin/indexer --all

The amount of time required to build an index depends on the size of the data, complexity of the index, CPU speed, and several other variables. But indexer will produce some status output while running by default and it will provide some summary data when it finishes. It’s not unusual to see indexing rates as high as 10,000 documents per second on modern hardware.

Query the Index

Once the index is buillt, you should run a few queries against it to make sure that it is finding documents and nothing unusual is happening. The search command-line utility is helpful for this. You can search for all documents matching one or more keywords:

/usr/local/bin/search keyword1 keyword2

If you have multiple indexes configured, you can restrict the search to a single index using the -i command-line argument:

/usr/local/bin/search -i test1 keyword1 keyword2

In either case, search will produce a list of the documents that contain the search term(s). By default it will perform an and query, meaning that it will only match documents that contain all the terms (if you specify more than one). However, the -a or --any option will tell search to match documents that contain any of the keywords.

In all of those cases, we’re searching all fields in the index: title and body. To specify a single field, you can use some fancier syntax as part of an extended mode match. You simply prefix the search term(s) with @fieldname like this:

/usr/local/bin/search -i test1 -e @title keyword1

That would find all documents which contain keyword1 in the title.

More to Come…

Now you’ve seen the basics of setting up a simple document index using Sphinx. Next week we’ll look at running the searchd daemon, running queries against it from PHP and Perl, and dig into some of the advanced features that may be interesting.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62