If I'm understanding correctly...
What about a SinkTokenizer that is backed by a Reader/Field instead of
the current one that stores it all in a List? This is more or less
the use case for the Tee/Sink implementations, w/ the exception that
we didn't plan for the Sink being too large, but that is easily
overcome, IMO.
That is, you use a TeeTokenFilter that adds to your Sink, which
serializes to some storage, and then your SinkTokenizer just
unserializes. No need to change Fieldable at all or anything else
Or maybe just a Tokenizer that is backed by a Field would work and
uses a TermEnum on the Field to serve up next() for the TokenStream.
Just thinking out loud...
-Grant
On Aug 27, 2008, at 10:47 AM, Andrzej Bialecki wrote:
> Hi all,
>
> I recently had a situation where I had to pass some metadata
> information to Analyzer. This metadata was specific to a Document
> instance (short story is that the analysis of some fields depended
> on data coming from other fields, and the number of possible values
> was too big to use separate fields for each combination).
>
> It would be nice to have an Analyzer.tokenStream(String fieldName,
> Field f), or even better tokenStream(String fieldName, Document
> doc) ... but probably it's too intrusive to change this. Although I
> would be happy to have tokenStream(String, Fieldable), because then
> I could provide my own Fieldable with metadata.
>
> In the meantime, having neither option, I came up with an idea: I
> will use a subclass of Reader, and attach my metadata there, and
> then use this Reader when creating a Field. However, I quickly
> discovered that if you set a Reader on a Field, this field
> automatically becomes un-stored - not what I wanted ... Field is
> declared final, so no luck there.
>
> In the end I implemented a Fieldable, which sort of breaks the
> contract for Fieldable - but it works :) . Namely, my Fieldable
> returns both readerValue() and stringValue(). The first method
> returns my subclass of Reader with metadata, and the second returns
> the value to be stored.
>
> The reason why it works is that DocInverterPerField first checks the
> tokenStreamValue, then the readerValue, and only then the
> stringValue that it converts to a Reader - so in my case it uses the
> supplied readerValue. At the same time, FieldsWriter, which is
> responsible for storing field values, uses just the stringValue (or
> binaryValue, but that wasn't relevant to my case), which is also set
> to non-null.
>
> So, here are my thoughts on this, and I'd appreciate any comments on
> this:
>
> * is this a justified use of the API? it works, at least at the
> moment ;) and I couldn't find any other way to accomplish this task.
>
> * could we perhaps relax the restriction on Fieldable so that it can
> return non-null values from more than one method, and clearly
> document in what sequence they are processed? This is already hinted
> at in the javadoc.
>
> * I propose to add a new API to Analyzer:
>
> public TokenStream tokenStream(String fieldName, Fieldable field);
>
> to support use cases like the one I described above. The default
> implementation could be something like this:
>
> public TokenStream tokenStream(String fieldName, Fieldable field) {
> Reader r = field.readerValue();
> if (r == null) {
> String s = field.stringValue();
> r = new StringReader(s);
> }
> return tokenStream(fieldName, r);
> }
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org