Introduction

The whole thing is about Search All Sites module for Wikidot Open Source. This entry is mainly for those of you that run their own Wikidot services. More info on getting Wikidot software run can be found on Ed's site.

Fresh install

After installing Wikidot Open Source from current version, you can instantly test the new Search All Sites module. Just navigate to /search:all page of your main wiki. Example: if your wiki farm runs on the domain mydomain.com and your main wiki is www.mydomain.com, navigate to http://www.mydomain.com/search:all.

Existing install

Updating to the new search engine from already installed version is quite tricky, because you need to pre-populate the search index (which is the actual file that is searched by the index, when looking for term user entered).

You can try the following commands:

obtain root priviledges

sudo su

navigate to your Wikidot directory (it is /var/www/wikidot by default), update the code and run the lucene_bootstrap.php script as your lighttpd user

This command adds every page to the index (normally located at /var/www/wikidot/tmp/lucene_index). Once indexed a page can be searched (if the site containing the page is public or you're a member of the site). The command prints a dot for each 10 indexed items (item is every page and forum/comments thread).

If this runs smoothly (i.e. no error, Segmentation fault at the end is OK, but memory exhausted is not OK) you have all your sites indexed and ready to search through.

When it fails: you can increase the max_memory setting in the corresponding php.ini file and re-run the command. There is no bad thing in running this command more than once as indexing a page always deletes the page from index before adding it again.

Just go to /search:all location at your main wiki and search for some content.

Also you need to update your crontab file. Add:

* * * * * www-data /var/www/wikidot/bin/job.sh UpdateLuceneIndexJob

to your /etc/crontab (assuming you have wikidot in /var/www/wikidot/). This will add an every-minute job indexing pages and threads queued to index when saving or changing public/private site state.

Features

First of all the new search applies only to the Search All Sites i.e. Search This Site works in the old way.

Search uses titles and tags intelligently

pages with the exact search phrase in the title are placed higher in the result list

pages with tags matching search phrase are quite high in the result list

pages with title matching search phrase are quite high in the result list

pages with content matching search phrase are somewhere low in result list

pages with parts of search phrase matching titles and tags can be higher in the result list than the pages having content matching even the exact phrase

this all means: tags and titles are more important than content for the search engine

The search includes public sites plus sites you are a member of. Also the results from your sites are generally more relevant to the search engine (i.e. they appear higher than the results from other sites)

The search results for given phrase for given user are cached (if memcached is used) for a few minutes. This makes the search even more smooth (no need to search the index again when user only switches the result page from 2 to 3 for example)

Test it!

If you don't run and don't want to run your own Wikidot installation, you can try the new features on the following site:

Need more performance or memory limit exhausted

We experienced some low performance when searching through 2 millions of pages and threads of Wikidot.com. The search results were generated in about 3 seconds. This was not enough for us, so we manage to speed things up using the native Java Lucene implementation for searching the index. This works because we use PHP Lucene implementation that is compatible with the Java's one. This means we can index page with PHP and search with Java. And we do it! If you want do this too (experiencing low search performance of getting memory exhausted error messages), just add the following lines to your conf/wikidot.ini file:

[search]
; enables the use of Java for searching
use_java = true

Notes

if you already have [search] section in the conf/wikidot.ini file, just add the use_java = true line in the search section

enabling Java for searching requires you to install java executable for your system. You should know how to do this (try sudo aptitude install openjdk-6-jre).

you don't need any Java libraries as we already bundled everything needed in the .jar file. The Java source and Ant build script is located normally at the /var/www/wikidot/java directory (assuming you installed the wikidot in /var/www/wikidot.

Summary

Once we assure the search is stable and gives relevant results, we'll introduce it to the Wikidot.com service. I calculated that indexing all the sites would take about 3 days! But searching is done in less than 1 second (using the Java program).

I'm looking for your comment on the features. Especially if you've tried them yourself!