GENUKI Maintainers' Pages

Version 2.1

How the Spider Works

The starting point for analysis is the database file which lists all the
pages that comprise the GENUKI web site: the GENUKI page list. For
this purpose, the GENUKI web site consists of all pages held on the
genuki.org.uk server, and all pages in GENUKI sections held on other
servers and referenced from the genuki.org.uk server.

Analysis is done page by page using the GENUKI page list. It is only
during this stage of analysis that the spider uses web protocols to access
GENUKI pages. If there are any problems locating a GENUKI page at this
point, e.g., if a page in the GENUKI page list cannot be found because it
has been deleted, it will appear as a problem under the "Spider"
heading of the Problems report.

When a page is being analysed, the spider detects and checks links from
each page in the GENUKI page list. For reasons of speed and to avoid hitting
websites too often, the spider minimises web access.

For links to GENUKI pages, the spider doesn't use web access protocols.
Instead, it looks at local files for pages held at genuki.org.uk, and
at the copy saved at genuki.org.uk for every GENUKI page hosted
elsewhere (these copies are held for html checks but also used as a backup).
If such a file doesn't exist, a "404 Not Found" error is reported
under the Spider heading in the Problems report. This means that if a GENUKI
page no longer exists it will only be reported as a problem
under the "Spider" heading of the Problems report.

For links to non-GENUKI pages, the spider uses another database table:
the non-GENUKI page list. Before attempting to locate a
non-GENUKI page, the spider searches the non-GENUKI page list for a
matching page address entry. If a link to this non-GENUKI page has
already been encountered, either on this run of the spider or a previous
run of the spider, the spider uses the entry in the non-GENUKI list and
does not use web access protocols to obtain the page again. If a link
to this page has not been encountered before, its page address entry
will not be in the list. In that case, the spider adds an entry to the list
for each non-GENUKI link, attempts to locate the page using web access
protocols, and records any consequent errors or redirects in the entry,
together with the date the entry was added.

However, a link to a non-GENUKI page could fail subsequent to the
spider run which added the entry to the table, so the non-GENUKI list
has to be purged regularly. If purging did not take place, links
to non-GENUKI pages would be reported as a success even though, in
practice, they would fail. Entries in the non-GENUKI page list for
successful links are purged after 5 days, and those for failed links
are purged after 3 days.

There are some links to non-GENUKI pages that should never be checked,
e.g., those to "validator.w3.org" which provides html syntax
checking. Such links are avoided by creating a suitable entry in the non-GENUKI
page list with a date well into the future. This means that these entries
are never purged, the spider thinks they are always successful, and their
web site is not accessed.

During the analysis phase, when the spider is checking for links to
GENUKI pages, it notes the filename and address of each GENUKI page
referenced, and saves these in a list in the database at the end of
the spider run.

This list of GENUKI page names is used by the discovery process which
checks if the page is already known by looking at the GENUKI page list,
i.e., the list of pages that comprise the GENUKI web site. For those that
are unknown, i.e., new pages, it creates a page entry in the GENUKI
page list, chooses a section for it, and sets its mediatype
and type. New pages are therefore only checked and analysed on the next
run of the spider. The mediatype is set for anything composed of html
(except cgi), images, css, js, or other. Link and html checking is
performed only on html, and for the others only if the page exists. The type
is set only for html files and is unused currently.

The final complication is handling directories for sections not hosted at
genuki.org.uk. If a url ends in "/" the spider tries the usual
suspects, index.html, index.php, etc. until it finds one successfully. The
spider looks directly at files on genuki.org.uk, but for those expected
elsewhere it has to use web access protocols until it finds one successfully.

For directories with the trailing slash missing for sections not held at
genuki.org.uk, the spider is unaware if the link is to a file or
a directory. The spider tries to get the file using web access protocols,
and then checks if the base address of the returned file is different to
that requested. In the case of a directory, the returned base is terminated
with "/" and the spider can therefore be sure it's dealing with a directory.