I think Mark's idea is better for this. Although I seem to recall
there being some caveats w/ multiple tokens at the same position, but
I don't remember the details. I _think_ term vectors don't like it,
so if you need them, you might have troubles. Perhaps a search of
the mailing lists and JIRA might turn up something or maybe someone
else remembers. At any rate, it may not effect you, so I would try
Mark's suggestion and see if it works.
-Grant
On Jul 12, 2006, at 11:15 AM, Amit Kumar wrote:
> We need to be able to search by word and POS and also have POS
> available for each occurrence. Appending POS to the terms will
> create post processing nightmare to retrieve
> term frequencies right? (I would have to add all the foo_NN and
> foo_ADJ etc.).
>
> I can store the POS in a parallel field and access it via term
> vectors, but that wouldn't allow any kind of search on POS related
> fields right? For example if I wanted to search for any
> adjective with in 3 words of say a term or say If I wanted to get
> all the patterns that follow the sequence ADJ NN ADJ.
>
> Let me look in the developer archives for the payload discussions,
> perhaps implementing that might satisfy my use cases.
>
> Comments?
>
> -Thanks
> Amit
>
>
>
> On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:
>
>> Hi Amit,
>>
>> This is definitely something you can do. What are your goals for
>> it? Do you want to search by word and POS or do you just want POS
>> available for post processing?
>>
>> You could just append the POS tag onto the end of your token as it
>> gets indexed, something like foo_NN or foo_ADJ. This approach may
>> mean you have to use prefix query when you want to search against
>> just "foo". You could also have a parallel field to your main
>> field that stores the POS. Then you could access it via the term
>> vectors array.
>>
>> Also, we have been discussing on the developers list on how to add
>> payloads to a posting (i.e. store related information at a
>> position in the index) similar to what Google discusses in their
>> original paper. Unfortunately, this isn't implemented yet, but if
>> you feel like helping out, check out the discussion on the
>> developer's list (see Flexible Indexing).
>>
>> -Grant
>>
>> On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:
>>
>>> Hi,
>>>
>>> A new project that I am investigating lucene for needs the Parts
>>> of speech information for the tokens. I can get that
>>> information using NLP techniques (GATE etc.), by pre processing
>>> the documents but I would like to store that
>>> information in the Indices. Something along the lines of
>>>
>>> TermVectorOffsetInfo[?].getPartofSpeech();
>>>
>>> I am writing to ask for your advice, you can tell me I am b o n k
>>> e r s or let me know where I should start digging :).
>>> Is that a good idea? Or would it be just less trouble for me to
>>> store the offset information along with parts of speech
>>> outside Lucene.
>>>
>>> Has anyone else done that?
>>>
>>> Best,
>>> Amit
>>>
>>>
>>> ps: Thank you for putting the LuceneInAction source online, it
>>> was a great help to see the CategorizerTest.java.
>>> I am ordering my copy of the book tomorrow :)
>>>
>>> ---------------------------------------------------------
>>> Amit Kumar
>>> Research Programmer
>>> The Graduate School of Library and Information Science
>>> University of Illinois, Urbana Champaign IL, 61820
>>> phone: 217-333-4118 fax: 217-244-3302
>>> ---------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --------------------------
>> Grant Ingersoll
>> Sr. Software Engineer
>> Center for Natural Language Processing
>> Syracuse University
>> 335 Hinds Hall
>> Syracuse, NY 13244
>> http://www.cnlp.org
>>
>> Voice: 315-443-5484
>> Fax: 315-443-6886
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------
> Amit Kumar
> Research Programmer
> The Graduate School of Library and Information Science
> University of Illinois, Urbana Champaign IL, 61820
> phone: 217-333-4118 fax: 217-244-3302
> ---------------------------------------------------------
>
>
>
>
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org