You're soo right! I'm totally new to lucene (and text analyses,
searching etc), but now that you showed me I "get it". Thank you so
much for your reply.
Chad
On Aug 8, 2006, at 12:45 AM, Chris Hostetter wrote:
>
> I've never used MoreLikeThis myself, but based on how i know it works,
> your problem probably has more to do with the size of your test
> corpus and
> th frequency of the words in your docs then by the size of the docs
> themselves.
>
> : There's still the issue of the queries from MoreLikeThis not
> : returning results for terms I had expected ("bikes").
>
> A quick glance at the source for MoreLikeThis turns up these lines...
>
> /**
> * Ignore terms with less than this frequency in the source doc.
> * @see #getMinTermFreq
> * @see #setMinTermFreq
> */
> public static final int DEFAULT_MIN_TERM_FREQ = 2;
>
> /**
> * Ignore words which do not occur in at least this many docs.
> * @see #getMinDocFreq
> * @see #setMinDocFreq
> */
> public static final int DEFALT_MIN_DOC_FREQ = 5;
>
> ...which i'm guessing mean that unless a word appears in a doc at
> least
> twice, it's ignored for that doc, and unless a word appears in at
> least 5
> docs, it's ignored completely. that could easily explain your bike
> examples.
>
> : I then loaded some large (5K+) documents and I noticed that
> : MoreLikeThis's query started to return similar documents, but
> explain
> : () said they were similar because of words like "from" and "can"
> rather
> : than the text I expected to be used for similarity in the documents.
>
> Other then a stop words list, one other thing you might consider is
> the notion of a "maxDocFreq" option you could set to ignore words that
> appear in lots of documents -- or a maxDocFreqRatio that would take a
> percentage of the total number of docs ... it should be fairly
> straightforward to add.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org