OK, I think this makes some sense.
If I index http://www.misma.org/contact.html using the spider the TITLE is
set to "contact.html" in the swish index file.
HTTP Headers:
HTTP/1.1 200 OK
Date: Mon, 31 May 1999 10:22:55 GMT
Server: Apache/1.2.5
X-Server-CGI: PHP/3.0.7
X-Resource-Indicator:
X-Resource-Modified: 923650015
Expires: Tue, 01 Jun 1999 10:22:55 GMT
Cache-Control: post-check=43200,pre-check=86400
Last-Modified: 1999-04-09T09:26:55Z
Connection: close
Content-Type: text/html; charset=iso-8859-1
If I index http://localhost/test/contact.html using the spider the TITLE is
set to "Contacts - MiSMA..."
HTTP Headers:
HTTP/1.1 200 OK
Date: Mon, 31 May 1999 10:21:54 GMT
Server: Apache/1.3.6 (Win32)
Parser: PHP/3.0.6 (Win32)
Connection: close
Content-Type: text/html
If I index /my_documents/test/contact.html using file system the TITLE is
set to "Contacts - MiSMA..."
No HTTP Header Equivalents.
This is exactly the same file in all three cases. Line feed is Unix LF in
all three cases. I sorta hacked my copy of the swishspider to force it to
index text/html; charset=iso-8859-1. That appears to be the only major
difference which could have an effect on the parsing. Something, somewhere
doesn't recognize that it should be parsing that document with the HTML
parser. There is some other code somewhere that assumes anything not
exactly text/html isn't HTML. Forcing the spider to index the contents of
text/html; charset=... isn't enough.
So, to test this theory I changed my content-type header on the misma.org
server. Sure enough, the titles are now indexed correctly. So, this
appears to be the Content-Type 'feature' of that old PERL module.
I don't know if this helps anyone else. But, I can, at least, hack
something to change my content-type header when swishspider visits a
document until someone figures this out.
,David Norris
World Wide Web - http://www.geocities.com/CapeCanaveral/Lab/1652/
Home Computer - http://illusionary.tzo.cc/
Page via mail - 412039@pager.mirabilis.com
ICQ Universal Internet Number - 412039
E-Mail - kg9ae@geocities.com