Archive: Jan 2014

I’ve been working on a project to use the open source Sphinx search engine to index some websites. The intention is that the user will be able to type in the URL of the website they want to index, and the system will then (a) download the content of the site in question using a bespoke crawler and (b) automatically pipe the relevant parts of that content into Sphinx.

This is a development version of the indexing system in operation

The system will provide a front-end search engine for users to query the indexed content, with the results providing a summary of each page of content – including generated keywords; as well as report on the status of each page, including any relevant metadata. There will also be an API so that if required the content can be utilised in other systems.

Generating the content to index

Crawling any web content is something of an art form. Many sites do not conform to good HTTP practices – it amazes me, for example, that so many sites don’t use the last modified HTTP header to ease the burden on their servers. However, for the purposes of this article we’ll assume that the content we want can be fetched easily and that every page fetched can be processed to generate a nice XML fragment with all of the information we want for each page.

So, the web crawler fetches the content and each URL encountered generates an XML fragment stored with the following structure:

What we need to do then is to:

turn the XML documents into a format suitable for importing into a Sphinx index;

create an index for the site in Sphinx; and

feed the information stored across the multiple XML files into the relevant index.

Processing the XML files for Sphinx

To get the content in the static XML files that have been generated for each URL into Sphinx, we need to generate an XML document that is compatible with Sphinx’s xmlpipe2 document format.

The first step is to concatenate all of the individual XML files into one large XML file (making to sure to top and tail the concatenated files with some extra XML to make sure the final file is valid XML). This is handled by a simple ‘cat’ command.

Then a separate XML file is generated in memory using a fairly straightforward XLST transformation – using:

<Sphinx:document id="{1 + count(preceding-sibling::document)}">

…to give each entry a unique sequential id.

This XML file is only generated when it is called by the Sphinx configuration file. However, before that we need to create a configuration file that knows about each new site as it is added for indexing.

Creating a Sphinx configuration file on the fly

Fortunately, Sphinx configuration files can be written in more or less any scriptable language. The solution here is written in perl, partly because that’s what much of the other code for the system I am developing is written in, and partly because perl has the rather useful Sphinx::Config module which can parse, update and output Sphinx configuration files.

Using Sphinx::Config the script reads in a standard Sphinx configuration file which I’ve set up with all of the correct connection details for my instance of Sphinx, but which doesn’t include any source or index information. [For more on configuring Sphinx, see the official documentation]

I then iterate through each of the file-based folders the crawling process has created ($dir) to get the names of the indexes I need to create ($index), and then set the appropriate source and index entries using Sphinx::Config->set():

It is the xmlpipe_command that generates the transformed XML document that is to be fed into Sphinx. (The perl file sphinx.xlst.pl just carries out a standard XSLT transformation and outputs the result as a string).

Finally I output the updated configuration as a string so that it can be read by Sphinx:

print $c->as_string();

The complete configuration script looks like this:

#!/usr/bin/env perl
use Sphinx::Config;
use Cwd;
# The name of the base configuration file
# - this needs to include all of the connection
# information for indexer and searchd
my $filename = "sphinx.conf";
# Load in the default configuration file
my $c = Sphinx::Config->new();
$c->parse($filename);
# Set the location of all the crawled content
my $root = cwd()."/crawler/";
# Open the directory
opendir my $dh, $root
or die "$0: opendir: $!";
# And read in each of the folders (excluding any starting with .)
my @dirs = grep {-d "$root/$_" && ! /^\.{1,2}$/} readdir($dh);
# Iterate through the folders and create a source
# and index entry for each
foreach $dir (@dirs) {
my $index = $dir;
$index =~ s/\.//g;
$c->set('source',$index,'type','xmlpipe2');
$c->set('source',$index,'xmlpipe_command','perl sphinx.xslt.pl xslt_file crawler/'.$dir.'.sphinx.xml');
$c->set('index',$index,'source',$index);
$c->set('index',$index,'path','/path/to/sphinx/'.$index);
$c->set('index',$index,'morphology','stem_enru');
$c->set('index',$index,'docinfo','extern');
$c->set('index',$index,'charset_type','utf-8');
$c->set('index',$index,'min_word_len','1');
}
# Output the updated configuration so that
# it can be read by Sphinx
print $c->as_string();

So now I have a configuration file that reflects the content that has been crawled for indexing. However, as yet I haven’t provided a mechanism for letting Sphinx know that it needs to re-read this configuration file when a new site has been added.

That’s not quite as straightforward as it might be.

Updating Sphinx

To step back slightly, I should say that the whole crawler/indexer process is being managed by a bash script. This is the easiest way to tie together the different processes and languages involved in getting everything to work. So it is a bash script which starts the crawling process and which, when it has finished, creates the standalone XML document (using the cat command).

The next step in the bash process is to tell Sphinx to index. This is seemingly just a question of calling the indexer using the dynamic configuration file devised above along with the name of the new index and asking it to rotate:

indexer --config sphinx.config.pl "$INDEX" --rotate

If you’ve set your permissions correctly this will report a successful rotation AND that it has sent a SIGHUP to the searchd process that handles web-based queries so that it too can re-read the configuration file and learn about the new index.

Except this is a lie. The index is rotated properly but searchd ends up knowing nothing about it. In fact, you have to send that signal independently:

kill -SIGHUP `cat /path/to/your/searchd.pid`

Except actually that doesn’t work either.

That’s because although the Sphinx configuration file is scriptable and hence dynamic in terms of its output, the process that SIGHUP triggers to re-read the configuration file first does a check on the file date, and if it hasn’t changed it won’t bother re-reading it at all. So before calling kill -SIGHUP on the searchd.pid file you need to touch the configuration file first:

touch /path/to/your/sphinx.config.pl

Of course, it probably still won’t work, because the process running your bash script won’t have permission to execute the kill command on the searchd process. To get that to work you need to configure visudo. That’s mostly outside the scope of this article, but you’ll need something that looks a bit like this: