Four indexer.conf commands provide HTDB. They are
HTDBAddr, HTDBList, HTDBLimit and HTDBDoc.

HTDBAddr is used to specify a database connection. Its syntax is identical to DBAddr command.
If no HTDBAddr command is specified then the data will be
fetched using the same connection specified in the DBAddr command.

HTDBList
is the SQL query to
generate a list of all URLs which correspond to records in the table
using the PRIMARY KEY field. You may use either absolute or relative URLs
in the HTDBList command:

HTDBLimit may be used to specify the maximal number of records in one SELECT operation.
It reduces memory usage for big tables indexing. For example:

HTDBLimit 512

HTDBDoc
is a query to get only certain records from the database using the PRIMARY KEY's value.

The HTDBList SQL query is used for all URLs which end with the '/' sign. For other URLs, the SQL
query given in HTDBDoc is used.

Note: HTDBDoc query must return a FULL HTTP
response with headers. So, you can build very flexible indexing system
giving different HTTP status in query. Take a look at the HTTP response
codes section of documentation to understand indexer behavior when it
gets different HTTP status.

If there is no result from HTDBDoc or the query does
return several records, HTDB retrieval system generates "HTTP 404 Not
Found". This may happen at reindex time if the record was deleted from
your table since last reindexing. You may use HoldBadHrefs 0 to
delete such records from mnoGoSearch tables as well.

You may use several HTDBDoc/List commands in one
indexer.conf with corresponding Server
commands.

Using htdb:/ scheme you can create a full text
index and use it further in your application. Imagine you have a
big SQL table which stores web board messages in plain text format.
You also want to build an application with message search facility.
Let's say messages are stored in "messages" table with
two fields "id" and "msg". "id" is an integer PRIMARY KEY and "msg"
is a big text field containing messages themselves. Using usual SQL LIKE
search may take long time to answer:

SELECT id, message FROM messages WHERE message LIKE '%someword%'

Using mnoGoSearch htdb: scheme you have a
possibility to create a full text index on "messages" table. Install
mnoGoSearch in usual order. Then edit your
indexer.conf:

When started, indexer will insert 'htdb:/' URL
into database and will run an SQL query given in HTDBList. It will
produce 1,2,3, ..., N values as a result. Those values will be
considered as links relative to 'htdb:/' URL. A list of new URLs in
the form htdb:/1, htdb:/2, ... , htdb:/N will be added into
database. Then HTDBDoc SQL query will be executed for each new
URL. HTDBDoc will produce a HTTP document for each document in the
form:

You can also use htdb:/ scheme to index your
database driven WWW server. It allows to create indexes without having
to invoke your web server while indexing. So, it is much faster and
requires less CPU resources when direct indexing from WWW
server.

The main idea of indexing database driven web
server is to build full text index in usual order. The only thing is
that search must produce real URLs instead of URLs in 'htdb:/...'
form. This can be achieved using mnoGoSearch aliasing tools.

Take a look at sample
indexer.conf in
doc/samples/htdb.conf It is an
indexer.conf used to index our webboard.

HTDBList command generates URLs in the form:

http://search.mnogo.ru/board/message.php?id=XXX

where XXX is a "messages" table PRIMARY KEY values.

For each PRIMARY KEY value HTDBDoc command generates a text/html document with HTTP headers and content like this:

The first command tells the indexer to execute the HTDBList query, which will generate a list of messages in the form:

http://search.mnogo.ru/board/message.php?id=XXX

The second command allows the indexer to accept such message URLs using string match with '*' wildcard at the end.

The third command replaces the
"http://search.mnogo.ru/board/message.php?id=" substring in the URL with
"htdb:/" when the indexer retrieves documents with messages. It means that
"http://mysearch.udm.net/board/message.php?id=xxx" URLs will be shown
in search result, but "htdb:/xxx" URLs will be indexed instead, where
xxx is the PRIMARY KEY value, the ID of record in "messages"
table.

mnoGoSearch supports exec: and cgi: virtual URL
schemes. They allow running an external program. This program must
return a result to sdtout. The result must be in HTTP standard,
i.e. HTTP response header followed by document's content.

For example, when indexing both
cgi:/usr/local/bin/myprog and
exec:/usr/local/bin/myprog, the indexer will execute
the /usr/local/bin/myprog program.

When executing a program given in cgi: virtual
scheme, the indexer emulates the fact that this program is running under a HTTP server. It
creates REQUEST_METHOD environment variable with "GET" value and
QUERY_STRING variable according to HTTP standards. For example, if
cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e
is being indexed, the indexer creates QUERY_STRING with
a=b&d=e value. cgi: virtual URL scheme allows
indexing your site without having to invoke web servers even if you
want to index CGI scripts. For example, you have a web site with
static documents under /usr/local/apache/htdocs/
and with CGI scripts under
/usr/local/apache/cgi-bin/. Use the following
configuration:

The indexer does not create a QUERY_STRING variable
like in cgi: scheme. It creates a command line with the arguments given in
URL after ? sign. For example, when indexing
exec:/usr/local/bin/myprog?a=b&d=e, this
command will be executed:

exec: virtual scheme can be used as an
external retrieval system. It allows using protocols which are not
supported natively by mnoGoSearch. For example, you can use curl
program which is available from http://curl.haxx.se/ to index HTTPS sites.

Put this short script to
/usr/local/mnogosearch/bin/ under
curl.sh name.

#!/bin/sh
/usr/local/bin/curl -i $1 2>/dev/null

This script takes an URL given as command line
argument and executes curl program to download it. -i argument says
curl to output result together with HTTP headers.

You may specify a path to the root directory to enable sites mirroring

MirrorRoot /path/to/mirror

You may as well specify the root directory of mirrored document's headers. The indexer will store HTTP headers to local disk too.

MirrorHeadersRoot /path/to/headers

You may specify the period during which earlier mirrored files will be used while indexing instead of really downloading.

MirrorPeriod <time>

It is very useful when you do some experiments with
mnoGoSearch, indexing the same hosts and not wanting much traffic
from/to Internet. If MirrorHeadersRoot is not specified and headers
are not stored to local disk, then the default Content-Type's given in
the AddType commands will be used. Default value of the MirrorPeriod is
-1, which means do not use mirrored files.

<time> is in the form
xxxA[yyyB[zzzC]] (Spaces are allowed between xxx
and A and yyy and so on) where xxx, yyy, zzz are numbers (can be
negative!). A, B, C can be one of the following:

If you specify only numbers without any character, it is
assumed that time is given in seconds (this behavior is for
compatibility with versions prior to 3.1.7).

The following command will force using local copies for one day:

MirrorPeriod 1d

If your pages are already indexed, when you re-index
with -a, the indexer will check the headers and only download those files that
have been modified since the last indexing. Thus, all pages that are
not modified will not be downloaded and therefore not mirrored
either. To create the mirror you need to either (a) start again with a
clean database or (b) use the -m switch.

You can actually use the created files as a full
featured mirror to you site. However be careful: indexer will not
download a document that is larger than MaxDocSize. If a document is
larger it will be only partially downloaded. If you site has no large
documents, everything will be fine.