Thanks for the many suggestions from this list. The "hack" I got to work
for my application was to only index the first "paragraph" (loosely
defined), which shortened the description (and the relevant words were
generally near the top). Most descriptions were then the same length, which
evened out the problem of the big ones always winning.
I like the title repetition idea, too, I may try that next round (though the
bias adjustments should do that, and maybe they do but it's still not in the
merged indexes).
Thanks again, everyone, for the ideas, it's been an interesting discussion!
Tac
-----Original Message-----
From: swish-e@sunsite3.berkeley.edu [mailto:swish-e@sunsite3.berkeley.edu]
On Behalf Of Bill Moseley
Sent: Friday, February 04, 2005 10:13 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Ranking, even with strong bias
On Fri, Feb 04, 2005 at 03:23:38AM -0800, Thomas R. Bruce wrote:
> Peter Karman wrote:
>
> >indexing as html will artificially inflate the number of occurances
> >whenever a
> >word matches in the <title>.
> >
> >
> This does help, but not enough for some applications. A real problem
> with relevance-ranked searches of collections of judicial opinions is
> that it's hard to force title weight high enough to overcome large
> numbers of term-occurrences in the body text
Yes, that's the problem with our relatively simple ranking system. Once I
hacked rank.c to just not count word frequency over some reasonably small
number and that keep the huge docs from always winning.
Sometimes it's not that helpful to search for a term "foo" and be told that,
yes, it is in that 100 page document. So another approach is to split your
docs into smaller chunks. and index them separately. And if you can link
into sections of your docs (like with URI #fragments)
then your search results are even more targeted. That can help with ranking
a bit, but doesn't help much if you are searching for a common term.
Sounds like you need a better ranking system in general -- something that
tries to figure out what a document is *about*.
> Anyway, our cheap kludge for dealing with this is to run a title-only
> search separately and prepend those results to the hit list for
> full-text search. We tried jiggering the rankings as described in
> this thread and it helped, but not enough.
Does that mean if you have a word hit in the title then it will always be on
top of results without a word hit in the title? So a very common word in
the title would still bring it to the top?
One thing I would suggest (not really related to above) is to use -T to dump
your index (of maybe a small set of files) and look over the words swish is
indexing. You might want to filter your queries of common words for your
corpus when they are not used in an explicit phrase search.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu