I have been having a problem getting htdig to build a reasonable
database for a particular site. Specifically, the combined database
sizes were ending up on the order of 3 to 4 times larger than the entire
site. I believe I found the cause of this problem, and while not
technically a problem with htdig, I thought I would pass the information
on in the hope that it will save someone else a week of building broken
databases and reading debug output :)

While examining htdig's output using the -vvv option, I discovered that
htdig was creating a lot of broken GET requests. Toward the end, they
were looking something like...

GET
/index.html/queries/fyi/hosts/hosts/fyi/queries/hosts/queries/qpost.htm
..
GET
/index.html/queries/fyi/hosts/hosts/fyi/queries/fyi/queries/qpost.htm
..
GET
/index.html/queries/fyi/hosts/hosts/fyi/queries/queries/queries/qpost.htm
..

From the server's point of view, everything in the URL after the
index.html is garbage, and the same page (index.html) is returned over
and over with all relative links resulting in new unique URLs that again
result in htdig grabbing the same index.html file.

As far as I can tell, the melt down originates with a small syntax error
in one users page. This user had a link that looks like...

I am in the process of trying to crawl the site again with .html/ and
.htm/ added to the exclude_urls attribute. On the off chance that this
doesn't work, does anyone have other ideas about how to avoid this
problem? Well, short of validating thousands of pages contributed by
dozens of people? ;)

Thanks.

Jim Cole
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.