Google Python Sitemap Generator - Introduction

The free Python Google sitemap generator can be used to create Google Sitemaps in XML format by walking the file system on the web server and scanning access logs. It requires Python version 2.2 (or compatible newer versions) installed on your server.

It is important to use the latest version, because of bug fixes and improvements. At the moment (9 January 2005) the latest version is 1.4. The sitemap generator is written in Python version 2.2 and it does not work with older versions. Python software can be downloaded from python.org.

Short description

The sitemap generator collects URLs by walking the file system on the web server and by reading access log files. The resulting sitemap is an XML file, either compressed or uncompressed, in the format specified by the Google Sitemap Protocol, with full XML header.

URLs of dynamically generated pages might not appear in the resulting sitemap if the generator uses only file system walking, since it will find only the URLs of the script files used to generate those pages.

Iterations of the sitemap generator reading access log files can be used to update/enlarge the resulting sitemap. If the number of collected URLs exceeds the maximum of 50,000 the generator will create more sitemap files and a sitemap index file (the sitemap index file will have to be submitted from the Google sitemap account panel), see the Google sitemaps group thread Sitemap gen apache log technique coupled with already existing sitemap, and the description below of the sitemap node of the config.xml file.

When reading access log entries, the sitemap generator will include in the sitemap only the URLs that return HTTP response status 200 (OK). It is thus necessary, in order to avoid inclusion of non-existent URLs, to have a website set-up that will return 404 (not found) HTTP response status for non-existent URLs, not a redirection to a page returning HTTP status 200 (OK).

When the generator uses only file system walking, the elements included in the sitemap for each URL are, besides the full URL, lastmod with a value given by the file time stamp (GMT), and priority with a default value of 0.5.

If the generator uses access log files, then the priority value is given by the frequence with which an URL appears in the access logs. If the generator uses only access logs, without file system walk, file time stamps are unavailable and so there are no lastmod elements in the resulting sitemap.

The value for the changefreq element can be specifed individually for each URL by using the url or urllist nodes in the config.xml file, as far as I know it cannot be specified at once for all URLs in a website.

The information specific to each website, like the name of the sitemap file, the domain URL necessary for building the canonical URLs in the sitemap, etc. is contained in a configuration file in XML format, usually called config.xml.

The script obtains the name of the config.xml file from the command line. For example, a command to run the generator from the same directory as sitemap_gen.py can be

$ python sitemap_gen.py --config=/path/config.xml

where /path/config.xml is the path name of the configuration file. The path name of a folder on the server can be easily found for a UNIX/Linux server from a command window with the Unix command pwd. The relative path name can also be used, so if sitemap_gen.py and config.xml are in the same directory, config.xml can be used in the example above as the path name of the configuration file.

Search engine notification and suppression of it for testing

After creating the sitemap file, the generator notifies Google by default using the ping method (the sitemap has to be submitted from the Google Sitemap Account). It is possible to suppress the script search engine notification either from the command line by using the --testing argument, or from the config.xml file by using the suppress_search_engine_notify attribute of the site root node.

An example of suppressing search engines notification from the command line, from the same directory as sitemap_gen.py, is

$ python sitemap_gen.py --config=/path/config.xml --testing

The config.xml file

The distribution package from SourceForge.net contains an example for the configuration file example_config.xml with very good commentaries and explanations.

The generator script processes the config.xml file using the SAX paradigm. SAX is an acronym for Simple API for XML, and refers to a sequential event-based parsing of an XML document, the script processes each XML element as it is encountered in the stream represented by the XML document. The config.xml file has the following nodes with attributes.

The site node is the single root node, which contains all the other nodes, and specifies via its attributes the domain URL and the path name for the resulting sitemap file. The first XML tag in the config.xml file is the opening tag of the site node and the file ends with the closing tag </site> for this root node.

The site node has two required attributes, base_url for the domain URL used in canonicalization of the URLs collected for the sitemap either from the walk of the web server file system or from scanning access log files, and the store_into attribute for the path name of the resulting sitemap XML file. This resulting sitemap file can be uncompressed, with a .xml file name extension, or compressed, with a .xml.gz file name extension.

Attention, a bug in generating the compressed sitemap file has been fixed in version 1.4 of this Python Google generator, so it is important to check that you are using the latest version.

The site node has also some optional attributes, which specify the detail in the diagnostic output that the script gives, suppression of notification to search engines (similar to the --testing command-line argument), and the character encoding to use for URLs and file paths.

The directory nodes specify via attributes the path name of the directory where to start the walking of the file system on the web server. If URLs are dynamically generated by a CGI script file, then only the URL of that script file is added to the sitemap, without the URLs dynamically generated by query strings. In this case it is necessary to use also accesslog nodes to scan access log files, if available.

A directory node has two required attributes, for the directory path name and for the URL corresponding to that path name. There is also the optional attribute default_file for the index file or default file for directory URLs. Setting a default file (for example <directory
path="/var/www/docroot"
url="http://www.example.com/"
default_file="index.html"
/>) causes URLs of the default files of that name in the specified directory and its subdirectories to be suppressed (when URLs are collected by using only file system walking on the server).

URLs to directories will have the lastmod date taken from the default file rather than the directory itself (as explained by Google Employee in the Google Sitemap groups thread in July 2005 Sitemap_gen.py v1.2).

If default_file is not specified, then both the URL to the directory and to the default file will be included in the sitemap, even though they represent the same document.

The sitemap nodes tell the script to scan other Sitemap files, there is one required attribute that is the path to the sitemap file. It can help to iterate readings of the access log files to update the resulting sitemap files.

After a first run of the sitemap generator without the sitemap node in the config.xml file, when at further runs of the script using accesslog nodes to scan the access log files, a sitemap node is added having as attribute the path to the current sitemap file, a feedback loop is created and iterations improve the sitemap. If the collected URLs exceed the maximum number for a sitemap file (50,000), then the sitemap generator script creates new sitemap files and a sitemap index file.

The url and urllist nodes can be used to specify URLs with their lastmode, priority and changefreq attributes for addition to the resulting sitemap file.

The url nodes have one required attribute, that is the URL, and three optional attributes, lastmod, changefreq and lastmod.

The urllist nodes name text files with lists of URLs and the nodes have one required attribute, the path to the file.

These text files with URL lists contain one URL per line. A line can consist of several space-delimited columns, where after a URL that is mandatory, attributes can follow in the form key=value for lastmod, changefreq and priority.

There is a example_urllist.txt example file included in the distribution package.

The generator discards URLs that do not start with the domain's URL, but it does not check if a URL exists on the server.

If the url or urllist nodes specify URLs with the correct base URL, but that have never been on the server, then these URLs are included in the sitemap.

The filter nodes specify patterns that the script compares against all URLs it finds. There are drop filters that cause exclusion of matching URLs and pass filters that cause inclusion of matching URLs.

If no filter at all matches a URL, the URL will be included. Filters are applied in the order specified and a pass filter shortcuts any other later filters that might also match.

Conclusions

The free Python Google generator is relatively easy to use, no knowledge of Python is necessary. The information and sitemap requirements specific to a website can be easily included in the configuration file by using the well commented example_config.xml file which comes with the generator.

There are some things in the current version 1.4 that I think could be improved in future versions. For example, non-existent URLs can be included by mistake in the sitemap, as long as they have the correct base URL, via the url or urllist nodes.

Also, when access logs are used in creating the sitemap, if a URL has been removed during the logged interval, such that it appears in the same access log file at first with HTTP response status 200 (OK) and later with 404 (Not Found), it will still be included to the sitemap.

Another thing is that I cannot see a way for specifying the changefreq at once for all URLs in the sitemap, maybe with globbing. The changefreq element has to be specified, if used, for individual URLs via the url or urllist nodes.