From java-user-return-51356-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Wed Nov 23 19:04:44 2011
Return-Path:
X-Original-To: apmail-lucene-java-user-archive@www.apache.org
Delivered-To: apmail-lucene-java-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 0FEFD7CE3
for ; Wed, 23 Nov 2011 19:04:44 +0000 (UTC)
Received: (qmail 81025 invoked by uid 500); 23 Nov 2011 19:04:41 -0000
Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org
Received: (qmail 80975 invoked by uid 500); 23 Nov 2011 19:04:41 -0000
Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: java-user@lucene.apache.org
Delivered-To: mailing list java-user@lucene.apache.org
Received: (qmail 80967 invoked by uid 99); 23 Nov 2011 19:04:41 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 19:04:41 +0000
X-ASF-Spam-Status: No, hits=-0.0 required=5.0
tests=SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (nike.apache.org: domain of sokolov@ifactory.com designates 65.223.181.132 as permitted sender)
Received: from [65.223.181.132] (HELO camelot.ifactory.com) (65.223.181.132)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Nov 2011 19:04:32 +0000
Received: from localhost (localhost.localdomain [127.0.0.1])
by camelot.ifactory.com (Postfix) with ESMTP id 261583672CE7;
Wed, 23 Nov 2011 14:04:11 -0500 (EST)
Received: from camelot.ifactory.com ([127.0.0.1])
by localhost (camelot.ifactory.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id TdoH1PkW6STH; Wed, 23 Nov 2011 14:04:10 -0500 (EST)
Received: from [192.168.0.153] (pool-108-7-217-216.bstnma.fios.verizon.net [108.7.217.216])
by camelot.ifactory.com (Postfix) with ESMTPSA id C4D453672CB6;
Wed, 23 Nov 2011 14:04:09 -0500 (EST)
Message-ID: <4ECD43BE.5060008@ifactory.com>
Date: Wed, 23 Nov 2011 14:04:30 -0500
From: Michael Sokolov
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: java-user@lucene.apache.org
CC: "E. van Chastelet"
Subject: Re: Spell check on a subset of an index ( 'namespace' aware spell
checker)
References: <4EBBC08E.4020902@gmail.com> <4ECD031A.7060309@gmail.com>
In-Reply-To: <4ECD031A.7060309@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Checked: Checked by ClamAV on apache.org
could use simply index every term with a namespace prefix like:
Q::term
where Q is the namespace and term the term?
Then when you do spell corrections, submit each candidate term with the
namespace prefix prepended
-Mike
On 11/23/2011 9:28 AM, E. van Chastelet wrote:
> I currently have an idea to get it done, but it's not a nice solution.
>
> If we have an index Q with all documents for all namespaces, we first
> extract the list of all terms that appear for the field namespace in Q
> (this field indicates the namespace of the document).
>
> Then, for each namespace n in the terms list:
> - Get all docs from Q that match +namespace:n
> - Construct a temporary index from these docs
> - Use this temporary index to construct the dictionary, which the
> SpellChecker can use as input.
> - Call indexDictionary on SpellChecker to create spellcheck index for
> current namespace.
> - Delete temporary index
>
> We now have separate spell check indexes for each namespace.
>
> Any suggestions for a cleaner solution?
>
> Regards,
> Elmer van Chastelet
>
>
>
> On 11/10/2011 01:16 PM, E. van Chastelet wrote:
>> Hi all,
>>
>> In our project we like to have the ability to get search results
>> scoped to one 'namespace' (as we call it). This can easily be
>> achieved by using a filter or just an additional must-clause.
>> For the spellchecker (and our autocompletion, which is a modified
>> spellchecker), the story seems different. The spell checker index is
>> created using a LuceneDictionary, which has a IndexReader as source.
>> We would like to get (spellcheck/autocomplete) suggestions that are
>> scoped to one namespace (i.e. field 'namespace' should have a
>> particular value).
>> With a single source index containing docs for all namespaces, it
>> seems not possible to create a spellcheck index for each namespace
>> the ordinary way.
>> Q1: Is there a way to construct a LuceneDictionary from a subset of a
>> single source index (all terms where namespace = %value%) ?
>>
>> Another, maybe better solution is to customize the spellchecker by
>> adding an additional namespace field to the spellchecker index. At
>> query-time, an additional must-clause is added, scoping the
>> suggestions to one (or more) namespace(s). The advantage of this is
>> to have a singleton spellchecker (or at least the index reader) for
>> all namespaces. This also means less open files by our application
>> (imagine if there are over 1000 namespaces).
>> Q2: Will there be a significant penalty (say more than 50% slower)
>> for the additional must-clause at query time?
>>
>> Q3: Or can you think of a better solution for this problem? :)
>>
>> How we currently do it: we currently use Lucene 3.1 with Hibernate
>> Search and we actually already have auto completion and spell
>> checking scoped to one namespace. This is currently achieved by using
>> index sharding, so each namespace has its own index and reader, and
>> another for spell check and auto completion. Unfortunately there are
>> some downsides to this:
>> - Our faceting engine has no good support for multiple indexes, so
>> faceting only works on a single namespace
>> - Needs administration for mapping namespace identifier (String) to
>> index number (integer)
>> - The number of shards (and thus name spaces) is currently hardcoded.
>> At this moment it is set to 100, and this means Hibernate Search
>> opens up 100 index readers/writers, while only n<100 are in use. and
>> therfore:
>> - Much open file descriptors
>> - Hard limit on number of namespaces
>>
>> Therefore it seems better to switch back to having a single index for
>> all namespaces.
>>
>> Thanks!
>>
>> Regards,
>> Elmer van Chastelet
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org