Verbose mode: Gives a running commentary on the program's attempts to read
data in various ways. As the amount of verbose output is substantial, the
-v option can now be followed by zero, one or more of the following
flags (without space) in order to differentiate the verbose output generated:

a: Anchor relevant information

b: Bindings to local file system

c: Cache trace

g: SGML trace

p: Protocol module information

s: SGML/HTML relevant information

t: Thread trace

u: URI relevant information

The -v option without any appended options shows all trace
messages. An example is "-vpt" showing thread and
protocol trace messages

-version

Prints out the version number of the robot and the version number of libwww
and exits.

Rule file, a.k.a. configuration file is a
set of rules and configuration options that can be used to map URLs, and to
set up other aspects of the behavior of the command line tool. Note that the
address must be specified as a URI - and in fact it can be located on HTTP
servers etc. as need be. File URIs are parsed relative to the current folder,
so a rule file address of "rules.conf" will point to a file in the
location where this tool it started. If a local file then the file suffix must
be ".conf" - otherwise the media type must be
application/x-www-rules.

These are some very simple constrants that can always be used when running
the webbot.

-depth [ n ]

Limit jumps to n hops from the start page. The n-1 link is checked
using a HEAD request. The default value is 0 which means that only
the start page is searched. A value of 1 will cause the start page and
all pages directly linked from that start page to be checked.

-prefix [ URI ]

Define a URI prefix for all URIs - if they do not match the prefix then they
are not checked. The rejected URIs can be logged to a
separate file.

Using regular expressions reguires that you link against a regex library handling
regular expressions
- see the installation instructions for
details. When using regular expressions, you can control the constraints much
more efficiently - both to decide which URIs should be followed and to decide
whether the webbot should use HEAD or GET when
checking the links.

-exclude [ regex ]

Allows you to define a regular expression of
which URIs should be excluded from the traversal. The rejected URIs can be logged to a separate file. This can be used to exclude
specific parts of the URI space, for example all URIs containing "/old/":
-exclude "/old/"

-check [ regex ]

Check all URIs that match this regular expression with a HEAD method
instead of a GET method. This can be used to verify links but
avoiding downloading large distribution files like this: -check
"\.gz$|\.Z$|\.zip$", for example.

-include [ regex ]

Allows you to define a regular expression of
which URIs should be included in the traversal

By default, the webbot doesn't follow HTTP redirections - it only registers
them in the log files. However, by using the
-redir option, it actually follows the redirections if the
redirected address fulfills the traversing
constraints.

-redir [ redirectioncode ]

Follow HTTP redirections. If no redirectioncode is given then follow all known
redirections (301, 302, 303, 307). If you just
want a single type of redirection to be followed then indicate that number as
the redirectioncode, for example -redir 302.

Using SQL based logging requires that you have linked against a MySQL library. See the installation instructions for details. I like
the Web interface provided by www-sql which makes it easy
to access the logged data. The data is stored in four tables within the same
database (the default name is "webbot"):

uris

An index that maps URIs to integers so that they are easier to refer to

requests

Contains information from the request including the request-URI, the method,
and the resulting status code.

resources

Contains information of the resource like content-type, content-encoding,
expires, etc.

links

Contains information about which documents point to which documents, the type
of the link etc. The type can either be implicit like "referer" or
"image", or it can be explicit like "stylesheet",
"toc", etc.

The command line options for handling the SQL logging are as follows:

-sqlserver [ srvrname ]

Specify the mysql server. The default is "localhost".

-sqldb [ dbname ]

Specify the database to use. The default is webbot. Note that
webbot creates its own set of tables for handling the logs.

-sqluser [ usrname ]

Use this to specify the user that we are connection to the database as. The
default is "webbot".

-sqlpassword [ usrpswd ]

Use this to specify the password needed to connect to the database. The
default is empty string.

-sqlrelative [ relroot ]

If you want to make the URI entries in the database relative then you can
specify the root to which they should be made relative. This can for example
be used to built the database on another machine than is normally running the
service. On heavy loaded sites, it is often a good idea to have an internal
test server running which can be used to build the database as it does take
some resources.

-sqlexternals

Use this flag if you want all links that have been filtered because they
didn't fulfill the constraints to be logged as well in the same table as all
other URIs.

Note that if you are using SQL based logging then the set of statistics
that can be drawn directly from the database is very high.

-format [ file ]

Specifies a log file of which media types (content types)
were encountered in the run and their distribution

-charset [ file ]

Specifies a log file of which charsets (content type
parameter) were encountered in the run and their distribution

-hit [ file ]

Specifies a log file of URIs sorted after how many times they
were referenced in the run

-lm [ file ]

Specifies a log file of URIs sorted after last modified date.
This gives a good overview of the dynamics of the web site that you are
checking.

-rellog [ file ]

Specifies a log file of any link relationship found in the HTML LINK
tag (either the REL of the REV
attribute) that has the relation specified in the -relation parameter
(all relations are modelled by libwww as "forward"). For example "-rellog
stylesheets-logfile.txt -relation stylesheet" will produce a log file of
all link relationships of type "stylesheet". The format of the log file is

"<relationship> <media type> <from-URI> -->
<to-URI>"

meaning that the from-URI has the forward relationship
with to-URI.

-title [ file ]

Specifies a log file of URIs sorted after any title found
either as an HTTP header or in the HTML.

Specify the write delay in milliseconds for how long we can wait until we
flush the output buffer when using pipelining. The default value is 50 ms. The
longer delay, the bigger TCP packets but also longer response time.

Any further command line arguments are taken as keywords. Keywords can be
used as search tokens in an HTTP request-URI encoded so that all spaces are
replaced with "+" and unsafe characters are encoded using the URI
"%xx" escape mechanism. An example of a search query is