indexer is a part of
UdmSearch - web search engine. The purpose of
indexer is to walk through HTTP, FTP, NEWS servers as well as local file system,
recursively grabbing all the documents and storing metadata about documents
into SQL or built-in database in a smart and effective manner. Since every
document is referenced by its corresponding URL, metadata collected by
indexer is used later in a search process.

The behaviour of
indexer is controlled mainly via configuration file
indexer.conf (5) , which it reads on startup. There is a compiled-in default for configuration
file name and location, so you dont need to specify it every time you run
indexer , but you can specify alternative configuration file as the last argument.

indexer supports HTML-formatted (text/html MIME type) and plain text
(text/plain MIME type) documents. Support for other data types is provided
by using external programs, which are called "parsers". Parser should get
data of some type from stdin and put text/html or text/plain data to stdout.
See
indexer.conf(5) for details.

You may run
indexer regularly from
cron (8) to keep metadata up-to-date.

indexer is also used to manipulate with database. It may be used to clear some data
from database, to output some statistics and to load ispell data into database.

By default indexer reindex only whose documents that are "expired", e.g.
time since their last reindexing is greater than "Period" from
indexer.conf (5) file. This option disables the feature, so all documents will be reindexed,
irrelevant to their state.
To achieve this,
indexer just first marks all URLs as "expired". This gives the
following side effect: if you start
indexer-a and then terminate it (for example, by pressing
Ctrl-C ) and start again, all URLs will be considered "expired" and will be
reindexed again.

-n

number Reindex only given
number of URLs and exit.

-c

seconds limit indexing time to a given number of
seconds

-e

Reindex most expired documents first.
That option forces the list of documents to reindex to be sorted by last
reindexing time. That means that most "expired" documents will be reindexed
first. You may or may not experience some minor delay with that option,
but at least in theory it should slow down indexer a bit.

The combination of
-e and
-nnumber is seems to be of some value. So, you can use
indexer-e-n100 to reindex just 100 most expired documents.

-m

This option makes
indexer to reindex documents, even if their content has not been modified.
It is achived by disabling If-Modified-Since HTTP header and MD5 hash check.
This is usable if you have changed some
Allow ,
Disallow ,
MaxHops or other directives in your
indexer.conf(5) file. Thus, there will be different set of rules for storing document URLs and
so different set of URLs. To find out that URLs, there is a need to reindex
even-not-changed documents.

-q

Quick startup. This mode is useful if you havent added or modified
Server commands.
indexer will not insert into database URLs given in Server commands which leads
to some startup speed-up.

Subsection control

-t tag

-u pattern

-s status

Set URL filters on
tag ,
pattern and
status respectively.

tag is a server tag that you can arbitrary set in config file
indexer.conf (5)

pattern is a SQL LIKE wildcard for URL. In short, underscore (
_ ) means "any symbol", and per cent (
% ) means "any symbols", and the comparison is case insensitive. For example,
indexer-u%izhcom.ru% will reindex all documents that URLs contains string "izhcom.ru".

status is a filter on documents HTTP status obtained during last reindexing.
For example,
-s0 is a filter for all documents that has not been indexed before.
-s200 is a filter for all documents that was retrieved with "HTTP 200 Ok" status,
and
-s301 is a filter for all documents that was retrieved with "HTTP 301 Redirect"
status.
See HTTP protocol specifications
for details on HTTP status codes and their respective meanings.

You can freely combine any number of
-t ,
-u and
-s options. The filters of the same class (tag, pattern, status) are be combined
using logical OR, and the filters of different classes will be combined using
logical AND. That means, if you type
indexer -u %izhcom.ru% -u %udm.net% -t 1 -s 200 the documents-to-index will be those with tag 1 and HTTP status 200,
which URLs contains the strings "izhcom.ru" or "udm.net".

Ispell import

-L

language Set the language of given files.
language is a two-letter ISO country code (en, ru, de etc.). This parameter
is obligatory if you are importing ispell data.

-A

affix_file Import given
affix_file to database.

-D

dict_file Import given
dict_file to database.

Misc.

-C

Clear databases.

This will erase data previously collected by indexer from the UdmSearch
databases. You can use options
-t ,
-u and
-s described above to select what do you want to delete.

WARNING: Use this option with extreme caution!

-S

Show statistics.

This option outputs a brief statistics of how many documents are there in
database, their HTTP status, and how many documents are expired. You can use
options
-t ,
-u and
-s described above to select what documents do you want statistics on.

-I

Show referrers.

This option shows you the referrers of URLs. Or, in other words, all hyperlinks
from the document. You can use
options
-t ,
-u and
-s described above to select what documents do you want to show referrers on.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.