lucene-java-user mailing list archives

well, my main goal is to use a ShingleFilter that will only take
shingles that are not separated by commas etc.
for example, the phrase:
"red apples, green tomatoes, and brown potatoes"
should yield the shingles "red apples", "green tomatoes", "and brown",
"brown potatoes"; but not "apples green" and not "tomatoes and" as those
are separated by commas.
the problem with the common tokenizers is that they get rid of the
commas so if I use a ShingleFilter after them there's no way to tell if
there was a comma there or not.
(another option I consider is to add an Attribute to specify if there
was a comma before or after a token)
if there's a better way -- I'm open to suggestions,
Igal
On 11/3/2012 8:10 PM, Erick Erickson wrote:
> So I've gotta ask... _why_ do you want to inject the spaces?
> If it's just to break this up into tokens, wouldn't something like
> LetterTokenizer do? Assuming you aren't interested in
> leaving in numbers.... Or even StandardTokenizer unless you have
> e-mail & etc.
>
> Or what about PatternReplaceCharFilter?
>
> FWIW,
> Erick
>
>
>
> On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <igal@getrailo.org> wrote:
>
>> You're right. I'm not sure what I was thinking.
>>
>> Thanks for all your help,
>>
>> Igal
>> On Nov 3, 2012 5:44 PM, "Robert Muir" <rcmuir@gmail.com> wrote:
>>
>>> On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <igal@getrailo.org>
>>> wrote:
>>>> hi Robert,
>>>>
>>>> thank you for your replies.
>>>>
>>>> I couldn't find much documentation/examples of this, but this is what I
>>> came
>>>> up with (below). is that the way I'm supposed to use the
>>> MappingCharFilter?
>>> You don't need to extend anything.
>>> You also don't want to create a normalizecharmap for each reader
>>> (thats way too heavy)
>>>
>>> Just build the NormalizeCharMap once, and pass it to
>>> MappingCharFilter's Constructor.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org