In Progress

Indices are stored in a database; Supported databases include MySQL and PostgreSQL (SQLite not advertised, but listed in documentation)

Support for all SBCS and some DBCS

Search tool supports Booleans (AND, OR, NOT, NEAR)

Evaluation

Evaluated after mnoGoSearch.

Database modes are single and multi, like mnoGoSearch, but include hashed modes, which does not support string and substring searches, and cached mode, which stores only URI indices in the database, but stores word data in disk files through an additional daemon. The absence of blob mode from mnoGoSearch, but the presence of cached mode in DataparkSearch, appears to be a major difference between the two.

As written, the setup expects that the application process logs in to the database as the schema owner, however with additional manual steps it can be made to work as not the schema owner. The call handler setup and custom stored procedure language definition present in mnoGoSearch is commented out in the setup of DataparkSearch, so PostgreSQL superuser is not required, as written. (Their presence in mnoGoSearch is questionable, anyway.) In general, DataparkSearch does appear to be a more slowly developed version of mnoGoSearch.

Multiple character set support is not the default, but is specified explicitly before compilation.

Bugs in create.multi.sql

185: ERROR: relation "cachedchk2" already exists

186: ERROR: column "url_id" does not exist

Caused by duplicate CREATE statements in the script. Comment out one set to resolve the bug.

Bugs in drop.multi.sql

16: ERROR: sequence "url_rec_id_seq" does not exist

17: ERROR: sequence "categories_rec_id_seq" does not exist

18: ERROR: sequence "qtrack_rec_id_seq" does not exist

19: ERROR: sequence "server" does not exist

Caused by extraneous DROP statements. Comment them out or ignore them. They're harmless.

Extended search mode appears broken. Only the first result is returned. All other results are lost. This might be only an error in the search form, in which case it can be easily debugged and fixed.

Requirements

buildrequires

gcc make

postgresql-devel (for PostgreSQL support)

zlib-devel

requires

postgresql-libs (for PostgreSQL support)

httpd

zlib

others as desired to index documents (pdf, etc.)

Setup Notes

The tarball was compiled into /opt/dpsearch, although it should be possible to create an RPM for more conventional locations (/bin, /etc, /sbin, etc.)

The code compiles and runs fine on x86_64 (compare to mnoGoSearch)

Create database user (dp/search) different from database owner (dbowner/dbowner) different from superuser (postgres)

Search engine library with sample indexer and search page rather than fully-functional application. Stores indices in Berkeley DB files with JSON interfaces. Allows custom-designed indices, including categories (exact match) to fulfill "programmable keywords" requirement. Each document index on each document source is a single write-once file collection (BDB and JSON) in a unique directory. Rerunning the indexer creates a new directory, obsoleting the old directory if all the old documents are included. The old directory then needs to be cleaned up. Postings can, however, be deleted from an index. Additionally, only the new documents can be indexed, but that's not efficient.

The supplied install.pl script generates a configure command, but does not support SQLite.

Adding --with-sqlite3 to the generated command adds SQLite support. An empty database must be created manually. A URI in the indexer.conf file specifies the location of the database. According to the documentation, sqlite:/path/to/db/file should work, but doesn't. According to the message boards on mnoGoSearch.org, sqlite://localhost/path/to/db/file should work, but doesn't.

No problems compiling with PostgreSQL support. Configure and Makefile may need significant rewriting to make good RPMs. There is no support for cross-compiling to 32-bit architectures from 64-bit machines.

As written, the setup expects that the application process logs in to the database as a PostgreSQL superuser and schema owner, however with additional manual steps it can be made to work as neither the superuser nor the schema owner. Several schemata are available (single, multi, and blob). LavTech recommends blob for sites indexing more than 50k documents. The crawler is very flexible with quite a complex configuration file. The CGI search page also has nice features for "advanced" searching, although it can be customized to suit each site. Tags are labels configured within the crawler, usually by URI server component. Categories are numerical hierarchies, up to 6 levels deep, also specified in the crawler configuration.

Bugs in pgsql/drop.blob.sql

1. drop function clean_srvinfo(); (the () is omitted, but needs to be included)

2. DROP LANGUAGE plpgsql; (missing)

3. DROP FUNCTION plpgsql_call_handler(); (missing, has to be run twice, once for postgres, once for dbowner?)

Does this definition in postgres (dangerously) shortcut anything inherent for other DBs on the same server?

Requirements

buildrequires

gcc make

sqlite-devel (for SQLite support)

postgresql-devel (for PostgreSQL support)

zlib-devel

requires

sqlite (for SQLite support)

postgresql-libs (for PostgreSQL support)

httpd

zlib

others as desired to index documents (pdf, etc.)

Setup Notes

The tarball was compiled into /opt/mnoGoSearch, although it should be possible to create an RPM for more conventional locations (/bin, /etc, /sbin, etc.)

On x86_64 architecture, the x86_64 binary fails when indexing a crawl with the error "indexer[21272]: PQexecPrepared: ERROR: incorrect binary data format in bind parameter 2." The tarball refuses to cross-compile to 32-bit architecture, despite tweaking the ./configure options. Compiling on a 32-bit machine and moving the binaries to the 64-bit machine works.

Create database user (mno/search) different from database owner (dbowner/dbowner) different from superuser (postgres)

/opt/mnoGoSearch/sbin/indexer -Ecreate, to create the tables, etc., and /opt/mnoGoSearch/sbin/indexer -Edrop, to drop tables, etc., just runs a script from /opt/mnoGoSearch/share/<db-type>/, but needs to be run as postgres

Run create.blob.sql as postgres, change owners to dbowner, and grant privileges to mno

the crawler/indexer is a Java command line application; the default depth is 5; the default number of threads is 10

the search tool runs in a Java servelet container, e.g., Tomcat

Evaluation

There's nothing to build. Simply configure the crawler (which is actually the indexer, too) and deploy/configure the searcher. The crawler caches the pages it indexes, making the cache available to the search tool. The search interface is extremely simple and is multi-lingual, but is almost entirely an advertisement for the Nutch project. It doesn't look particularly easy to rebrand. Overall, the polish of the finished product means it's less flexible to custom modifications, like programmable keywords. After creating a new index (i.e., after a new crawl), the search application must be reloaded in Tomcat manager. The crawler is more flexible than a brief investigation could reveal. The official documentation leaves a lot to be desired. Searches are for single terms only, no multiple terms or +/- Booleans.

Search engine with a built-in web crawler, a built-in file system crawler, and an interface for an external crawler. The distribution includes sample search pages which use the Perl API. There is also a C API. The index is not customizable, but does include a facility for including metawords (exact match) and the path in the index for each document. The documentation acknowledges that the software only supports ASCII, but some MBCS may also work.

xapian-core, xapian-bindings, and perl-Search-Xapian in Fedora already; xapian-omega is not

additional bindings to PHP, Java, and more (?)

Omega provides a Xapian front-end for indexing (via script) and searching (command line or CGI)

Omega provides glue scripts for ht://Dig, mbox files, and perl DBI

Flax [13] is another search engine built on top of Xapian and CherryPy

Evaluation

Xapian is a search engine library. Omega adds functionality on top of Xapian. The Xapian database is very flexible, supporting an entirely user-designed schema. Usage through Omega loses very little, if any, of that flexibility, however the supplied Omega CGI is extremely rudimentary. The supplied Omega CGI also requires the database to be named "default," although that can be changed. Database columns are of type field or index. Fields are stored verbatim (e.g., URL, date, MIME type, keywords). Indices are input as blocks of text or other content to be indexed, but not stored (e.g., the corpus of a file or web page). The Omega scriptindex utility can be combined with an external web crawler for HTML. Making Omega work with Apache requires relabeling /var/lib/omega as httpd_sys_content, or moving /var/lib/omega to /var/www/omega and using the default context there. In this evaluation, /var/lib/omega was moved to /var/www/omega. Xapian only works with UTF-8.

It has no crawling/spidering facility. It has no user query interface. There are no samples.

Description

written in Java

based on Lucene

Evaluation

The documentation describes installing Sun Java to run Solr, but OpenJDK 1.5 or later is fine. Solr needs a Java servlet container in which to run. It comes with Jetty, but other containers should work, as well (e.g., Tomcat). Currently only supports UTF-8 characters.

Basically Solr provides an HTTP admin GUI for a search engine that uses a superset of the Lucene query syntax. The schema is very flexible. Set-up is essentially entirely through XML files. Applications can query the servlet port and get XML or JSON responses.

Requirements

buildrequires

ant (note that ant currently pulls in java-gcj-compat, too, but it appears not to be a problem)

Apache Configuration Notes

CGI for Xapian Omega and mnoGoSearch

In /etc/httpd/conf.d/cgi-bin.conf, just use the default configuration normally commented out in httpd.conf.

# ScriptAlias: This controls which directories contain server scripts.
# ScriptAliases are essentially the same as Aliases, except that
# documents in the realname directory are treated as applications and
# run by the server when requested rather than as documents sent to the client.
# The same rules about trailing "/" apply to ScriptAlias directives as to
# Alias.
#
ScriptAlias /cgi-bin/ "/var/www/cgi-bin/"
#
# "/var/www/cgi-bin" should be changed to whatever your ScriptAliased
# CGI directory exists, if you have that configured.
#
<Directory "/var/www/cgi-bin">
AllowOverride None
Options None
Order allow,deny
Allow from all
</Directory>