Post navigation

HTTrack

Basically, it allows you to download the contents of a internet site to a local directory. It builds a complete set of recursively directories, getting HTML, images, and other files from the server and stashing them on your computer. These are static, HTML images of the original site, even if it was built using some database centered, dynamic page tool.

I find it great for archiving copies of my sites before making major changes, or shutting them down.

Using HTTrack

There are versions of HTTrack for multiple OS environments. The one I use is for a standard Linux system. I have configured it to run from a script as a CRON task. The script reads a series of files that list small collections of web sites. It only processes one site at a time, to prevent overloading remote sites that are on shared servers. It stashes each collection in a designated directory on my local server for local backup and browsing.

One of the nice features of the %L function is that it automatically builds an index of the site collections in the target folder.

The file list (LinkList-01) is a simple list of targeted sites. I found that WordPress sites seem to like to be listed as “http://sob.boatswain.us/”, while my Mediawiki sites won’t work with that and need to be listed without the domain garbage, simply as “sysadm.equoria.com”.

The user agent (-F) is explained in the next section.

user agent 403 rejections

There appears to be a problem with many sites related to the default User Agent identification.

Like a good boy, HTTrack identifies itself when it connects, and immediately get rejected.

Using wget as a testing tool, you can see that it is the HTTrack User Agent that triggers the forbidden message.