SYNOPSIS

DESCRIPTION

output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats,

•

support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links,

•

restriction of link checking with URL filters,

•

proxy support,

•

username/password authorization for HTTP, FTP and Telnet,

•

support for robots.txt exclusion protocol,

•

support for Cookies

•

support for HTML5

•

HTML and CSS syntax check

•

Antivirus check

•

a command line, GUI and web interface

EXAMPLES

The most common use checks the given domain recursively, plus any
URL pointing outside of the domain:
linkchecker http://www.example.net/
Beware that this checks the whole site which can have thousands of URLs.
Use the -r option to restrict the recursion depth.
Don't check mailto: URLs. All other links are checked as usual:
linkchecker --ignore-url=^mailto: mysite.example.org
Checking a local HTML file on Unix:
linkchecker ../bla.html
Checking a local HTML file on Windows:
linkchecker c:\temp\test.html
You can skip the http:// url part if the domain starts with www.:
linkchecker www.example.com
You can skip the ftp:// url part if the domain starts with ftp.:
linkchecker -r0 ftp.example.org
Generate a sitemap graph and convert it with the graphviz dot utility:
linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps

OPTIONS

General options

-fFILENAME, --config=FILENAME

Use FILENAME as configuration file. As default LinkChecker
uses ~/.linkchecker/linkcheckerrc.

-h, --help

Help me! Print usage information for this program.

--stdin

Read list of white-space separated URLs to check from stdin.

-tNUMBER, --threads=NUMBER

Generate no more than the given number of threads. Default number
of threads is 100. To disable threading specify a non-positive number.

-V, --version

Print version and exit.

Output options

--check-css

Check syntax of CSS URLs with the W3C online validator.

--check-html

Check syntax of HTML URLs with the W3C online validator.

--complete

Log all URLs, including duplicates. Default is to log duplicate URLs only once.

-DSTRING, --debug=STRING

Print debugging output for the given logger.
Available loggers are cmdline, checking,
cache, gui, dns and all.
Specifying all is an alias for specifying all available loggers.
The option can be given multiple times to debug with more
than one logger.
For accurate results, threading will be disabled during debug runs.

Output to a file linkchecker-out.TYPE,
$HOME/.linkchecker/blacklist for
blacklist output, or FILENAME if specified.
The ENCODING specifies the output encoding, the default is
that of your locale.
Valid encodings are listed at
http://docs.python.org/library/codecs.html#standard-encodings.
The FILENAME and ENCODING parts of the none output type
will be ignored, else if the file already exists, it will be overwritten.
You can specify this option more than once. Valid file output types
are text, html, sql,
csv, gml, dot, xml, sitemap, none or
blacklist.
Default is no file output. The various output types are documented
below. Note that you can suppress all console output
with the option -o none.

--no-status

Do not print check status messages.

--no-warnings

Don't log warnings. Default is to log warnings.

-oTYPE[/ENCODING], --output=TYPE[/ENCODING]

Specify output type as text, html, sql,
csv, gml, dot, xml, sitemap, none or
blacklist.
Default type is text. The various output types are documented
below.
The ENCODING specifies the output encoding, the default is
that of your locale. Valid encodings are listed at
http://docs.python.org/library/codecs.html#standard-encodings.

-q, --quiet

Quiet operation, an alias for -o none.
This is only useful with -F.

--scan-virus

Scan content of URLs for viruses with ClamAV.

--trace

Print tracing information.

-v, --verbose

Log all checked URLs. Default is to log only errors and warnings.

-WREGEX, --warning-regex=REGEX

Define a regular expression which prints a warning if it matches any
content of the checked link.
This applies only to valid pages, so we can get their content.
Use this to check for pages that contain some form of error, for example
"This page has moved" or "Oracle Application error".
Note that multiple values can be combined in the regular expression,
for example "(This page has moved|Oracle Application error)".
See section REGULAR EXPRESSIONS for more info.

--warning-size-bytes=NUMBER

Print a warning if content size info is available and exceeds the given
number of bytes.

Checking options

-a, --anchors

Check HTTP anchor references. Default is not to check anchors.
This option enables logging of the warning url-anchor-not-found.

-C, --cookies

Accept and send HTTP cookies according to RFC 2109. Only cookies
which are sent back to the originating server are accepted.
Sent and accepted cookies are provided as additional logging
information.

PROXY SUPPORT

To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or $ftp_proxy
environment variables to the proxy URL. The URL should be of the form
http://[user:pass@]host[:port].
LinkChecker also detects manual proxy settings of Internet Explorer under
Windows systems. On a Mac use the Internet Config to select a proxy.
You can also set a comma-separated domain list in the $no_proxy environment
variables to ignore any proxy settings for these domains.
Setting a HTTP proxy on Unix for example looks like this:

export http_proxy="http://proxy.example.com:8080"

Proxy authentication is also supported:

export http_proxy="http://user1:mypass@proxy.example.org:8081"

Setting a proxy on the Windows command prompt:

set http_proxy=http://proxy.example.com:8080

PERFORMED CHECKS

All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues
are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.

HTTP links (http:, https:)

After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.
HTML page contents are checked for recursion.

Local files (file:)

A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device files,
unreadable or non-existing files are errors.
HTML or other parseable file contents are checked for recursion.

Mail links (mailto:)

A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail.
For each mail address we check the following things:
1) Check the adress syntax, both of the part before and after
the @ sign.
2) Look up the MX DNS records. If we found no MX record,
print an error.
3) Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.
4) Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.

FTP links (ftp:)

For FTP links we do:

1) connect to the specified host
2) try to login with the given user and password. The default
user is ``anonymous``, the default password is ``anonymous@``.
3) try to change to the given directory
4) list the file with the NLST command

Telnet links (``telnet:``)

We try to connect and if user/password are given, login to the
given telnet server.

NNTP links (``news:``, ``snews:``, ``nntp``)

We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.

Unsupported links (``javascript:``, etc.)

An unsupported link will only print a warning. No further checking
will be made.

The complete list of recognized, but unsupported links can be found
in the linkcheck/checker/unknownurl.py source file.
The most prominent of them should be JavaScript links.

RECURSION

Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:

1. A URL must be valid.

2. A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.

3. The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.

4. The maximum recursion level must not be exceeded. It is configured
with the --recursion-level option and is unlimited per default.

5. It must not match the ignored URL list. This is controlled with
the --ignore-url option.

6. The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
"nofollow" directive in the HTML header data.

Note that the directory recursion reads all files in that
directory, not just a subset like index.htm*.

NOTES

URLs on the commandline starting with ftp. are treated like
ftp://ftp., URLs starting with www. are treated like
http://www..
You can also give local files as arguments.

If you have your system configured to automatically establish a
connection to the internet (e.g. with diald), it will connect when
checking links not pointing to your local host.
Use the --ignore-url option to prevent this.

Javascript links are not supported.

If your platform does not support threading, LinkChecker disables it
automatically.

You can supply multiple user/password pairs in a configuration file.

When checking news: links the given NNTP host doesn't need to be the
same as the host of the user browsing your pages.