From java-dev-return-12799-apmail-lucene-java-dev-archive=lucene.apache.org@lucene.apache.org Wed Feb 08 12:59:20 2006
Return-Path:
Delivered-To: apmail-lucene-java-dev-archive@www.apache.org
Received: (qmail 38373 invoked from network); 8 Feb 2006 12:59:18 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199)
by minotaur.apache.org with SMTP; 8 Feb 2006 12:59:18 -0000
Received: (qmail 64419 invoked by uid 500); 8 Feb 2006 12:59:15 -0000
Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org
Received: (qmail 64374 invoked by uid 500); 8 Feb 2006 12:59:14 -0000
Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: java-dev@lucene.apache.org
Delivered-To: mailing list java-dev@lucene.apache.org
Received: (qmail 64363 invoked by uid 99); 8 Feb 2006 12:59:14 -0000
Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2006 04:59:14 -0800
X-ASF-Spam-Status: No, hits=0.5 required=10.0
tests=DNS_FROM_RFC_ABUSE
X-Spam-Check-By: apache.org
Received-SPF: pass (asf.osuosl.org: local policy includes SPF record at spf.trusted-forwarder.org)
Received: from [217.12.10.220] (HELO web26009.mail.ukl.yahoo.com) (217.12.10.220)
by apache.org (qpsmtpd/0.29) with SMTP; Wed, 08 Feb 2006 04:59:13 -0800
Received: (qmail 82491 invoked by uid 60001); 8 Feb 2006 12:58:51 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
s=s1024; d=yahoo.co.uk;
h=Message-ID:Received:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding;
b=1ABss149GAZSlgpZ6siL0cAMPk5wO2GtSG+5y62l5eTsDK2qth7HC4VBIXoAA2bhplESN/++zaNfl+3LyIsGLmtMj8gUJzfz5bPHfUEfbz0imUXBV41Hq4x1mnoW5RDKRB3fV+d5Eq05/0ngC+hEpdWqJ/WSVhvmm67X4h+mcSo= ;
Message-ID: <20060208125851.82489.qmail@web26009.mail.ukl.yahoo.com>
Received: from [193.36.230.96] by web26009.mail.ukl.yahoo.com via HTTP; Wed, 08 Feb 2006 12:58:51 GMT
Date: Wed, 8 Feb 2006 12:58:51 +0000 (GMT)
From: mark harwood
Subject: Re: Preventing "killer" queries
To: java-dev@lucene.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Virus-Checked: Checked by ClamAV on apache.org
X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N
Thanks for the comments, Chris/Doug.
Chris, although I suggested it initially, I'm now a
little uncomfortable in controlling this issue with a
static variable in TermQuery because it doesnt let me
have different settings for different queries, indexes
or fields.
Doug, I'd ideally like to optimize for this condition
in advance rather than get into trouble and throw
exceptions to blow out queries.
I like to think of the ideal solution as a control
which automatically identifies and tunes out what it
sees as stop words but is controllable on a per index,
per field and per query basis, if needs be.
The analyzer seemed a reasonably flexible way to do
this.
I tried looking at performance of Filter vs Query on a
1million doc index as per Chris's suggestion and found
that RangeFilter.bits() does improve on
search.search(TermQuery) and that this improvement was
a constant factor as df increases. The filter.bits
call was typically 60% of the equivalent TermQuery
search time for a range of tested DFs. However, both
filter and query response times increase in a linear
fashion with increases in df so I suspect they are
both ultimately heading for trouble as data volumes
increase - just that TermQuery gets there sooner than
filter.
I'd rather head this problem off sooner by
stop-wording very common terms in large indexes using
the analyzer. Obviously this wouldn't catch
Range/Fuzzy queries which expand at rewrite time but
at large levels of data you have to manage those types
of query carefully anyway.
I did come across a bizarre anomaly I would be
interested to have explained. A RangeFilter based on a
single term with 50% df responds in the same time as a
RangeFilter on a different field for a term with the
same df.
When it comes to TermQuerys though, not all fields are
equal. Using a TermQuery on a "free text" field with
many values for a single term with 50% df takes half
the time of a TermQuery on a constrained field
("doctype") for a single term with similar df. The
doctype field only ever has one of 6 possible values.
Both queries are on the same index, and similar df
values. The relative performance difference was the
same for other DFs I tested across the 2 fields.
What is going on here? If anything, I might have
expected the open-ended field to be slower.
Cheers,
Mark
___________________________________________________________
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org