While waiting for a reply from my first email, we have been running more
tests to isolate the problem with swish-e indexing our pages. Here is the
latest set of facts.
As I said before, if you leave obeyRobotsNoIndex off, then swish-e will
index all of the pages in the directory tree. However, with obeyRobots set
to yes, it will index all of the pages properly, but it will not return all
of them from a legitimate search. We have tried -T INDEXED_WORDS and other
tests, and the pages are apparently being indexed, just not returned from
the search.
Now we have more clues.
Apparently, on our system, Swish-e suppresses from the results of a search
all valid documents that *precede* a document that has the "noindex" tag.
So even if page1.html does not have a "noindex" tag, if one of its indexed
words is in a document indexed later, say page2.html, which *does* have a
"noindex" tag, then page1.html will not be returned for that indexed word
either. If the very last file indexed by swish-e is one that has the
"noindex" tag, then a search for any word that happens to be in that last
document will return no results at all from any pages.
To lay it out more clearly:
Say the word "testword" is in three documents: sample1.html, sample2.html
and sample3.html. Say that swish-e indexes them in that exact order.
If sample2.html has the tag <meta name="robots" content="noindex">, then in
INDEXED_WORDS, both sample1.html and sample3.html appear in the index under
"testword." However, if you do the search:
swish-e -w testword -f swish_test.index
you will only get sample3.html returned from the search. If, however, the
"noindex" tag appears in sample1.html and not in sample2.html, then you
will get sample2.html and sample3.html returned in the search. If the
"noindex" tag appears in sample3.html only, then you will get no results
from the search for "testword". With obeyRobotsNoIndex off, you will get
all three no matter where the noindex tag is.
This behavior was present in the Swish-e from July and also in the one
available on Sept. 18th. Our setup is the same as described in my email of
a few days ago (solaris 2.8, libxml2 2.4.22, etc. ). Below is our test
configuration file. Any help would be appreciated.
Vernon
####################################################
#
# Swish-e configuration file
#####################################################
################
#WHAT TO INDEX
################
# DIRECTORIES TO INDEX
IndexDir /usr/local/apache/htdocs/data
IndexDir /usr/local/apache/htdocs/resources
# TYPES OF DOCS TO INDEX
IndexContents HTML2 .html .htm
DefaultContents HTML2
#INDEX ONLY FILES WITH THESE EXTENSIONS
IndexOnly .html .htm
################
# INDEX DETAILS
################
#VERBOSITY LEVEL OF FEEDBACK WHEN INDEXING
IndexReport 3
ParserWarnLevel 2
#WHERE TO PLACE THE INDEX FILE
IndexFile /usr/local/apache/htdocs/swish_test.index
####################
#WHAT *NOT* TO INDEX
####################
obeyRobotsNoIndex yes
FollowSymLinks no
# TYPES OF DOCS NOT TO INDEX
NoContents .doc .gif .js .pdf .php .txt .xml
###########################################################
# MetaNames for both special searching and property set up
###########################################################
MetaNames subject
MetaNames description
###########################################
# Properties to be returned in the results
###########################################
StoreDescription HTML <description> 200
PropertyNameAlias swishdescription description
#########################################################
# Replacement in the URL for purposes of results display
#########################################################