While I think this approach would have been best to start off with rather than String, I'm concerned that it will do little more than add overhead at this point, resulting in slower code, not faster.

- If any tokenizer or token filter tries setting the termBuffer, any downstream components would need to check for both. It could be made backward compatible by constructing a string on demand, but that will really slow things down, unless the whole chain is converted to only using the char[] somehow.

- It doesn't look like the indexing code currently pays any attention to the char[], right?

- What if both the String and char[] are set? A filter that doesn't know better sets the String... this doesn't clear the char[] currently, should it?

"Yonik Seeley" <yonik@apache.org> wrote: > I had previously missed the changes to Token that add support for > using an array (termBuffer): > > + // For better indexing speed, use termBuffer (and > + // termBufferOffset/termBufferLength) instead of termText > + // to save new'ing a String per token > + char[] termBuffer; > + int termBufferOffset; > + int termBufferLength; > > While I think this approach would have been best to start off with > rather than String, > I'm concerned that it will do little more than add overhead at this > point, resulting in slower code, not faster. > > - If any tokenizer or token filter tries setting the termBuffer, any > downstream components would need to check for both. It could be made > backward compatible by constructing a string on demand, but that will > really slow things down, unless the whole chain is converted to only > using the char[] somehow.

Good point: if your analyzer/tokenizer produces char[] tokens then your downstream filters would have to accept char[] tokens.

I think on-demand constructing a String (and saving it as termText) would be an OK solution? Why would that be slower than having to make a String in the first place (if we didn't have the char[] API)? It's at least graceful degradation.

> - It doesn't look like the indexing code currently pays any attention > to the char[], right?

It does, in DocumentsWriter.addPosition().

> - What if both the String and char[] are set? A filter that doesn't > know better sets the String... this doesn't clear the char[] > currently, should it?

Currently the char[] wins, but good point: seems like each setter should null out the other one?

On 7/19/07, Michael McCandless <lucene@mikemccandless.com> wrote: > "Yonik Seeley" <yonik@apache.org> wrote: > > I had previously missed the changes to Token that add support for > > using an array (termBuffer): > > > > + // For better indexing speed, use termBuffer (and > > + // termBufferOffset/termBufferLength) instead of termText > > + // to save new'ing a String per token > > + char[] termBuffer; > > + int termBufferOffset; > > + int termBufferLength; > > > > While I think this approach would have been best to start off with > > rather than String, > > I'm concerned that it will do little more than add overhead at this > > point, resulting in slower code, not faster. > > > > - If any tokenizer or token filter tries setting the termBuffer, any > > downstream components would need to check for both. It could be made > > backward compatible by constructing a string on demand, but that will > > really slow things down, unless the whole chain is converted to only > > using the char[] somehow. > > Good point: if your analyzer/tokenizer produces char[] tokens then > your downstream filters would have to accept char[] tokens. > > I think on-demand constructing a String (and saving it as termText) > would be an OK solution? Why would that be slower than having to make > a String in the first place (if we didn't have the char[] API)? It's > at least graceful degradation.

It's the rule rather than the exception though. Pretty much everything is based on String.

> > - It doesn't look like the indexing code currently pays any attention > > to the char[], right? > > It does, in DocumentsWriter.addPosition().

Ah, thanks.

> > - What if both the String and char[] are set? A filter that doesn't > > know better sets the String... this doesn't clear the char[] > > currently, should it? > > Currently the char[] wins, but good point: seems like each setter > should null out the other one?

Certainly the String setter should null the char[] (that's the only way to keep back compatibility), and probably vice-versa.

Note that there are many existing filters that directly access and manipulate the package protected String termText. These will need to be changed.

: > Currently the char[] wins, but good point: seems like each setter : > should null out the other one? : : Certainly the String setter should null the char[] (that's the only : way to keep back compatibility), and probably vice-versa.

i haven't really thought baout this before today (i missed seeing the char[] stuff get added to Token as well) but if we're confident the char[] stuff is hte direction we want to go, then i believe the cleanest forward migration plan is...

1) deprecate Token.termText, Token.getTermText(), Token.setTermText 2) make Token.setTermBuffer() null out Token.termText (document) 3) make Token.setTermText() null out Token.termBuffer 4) refactor all of the the "if (null == termBuffer)" logic in DocumentsWriter into a the Token class, ala... public final char[] termBuffer() { initTermBuffer(); return termBuffer; } public final int termBufferOffset() { initTermBuffer(); return termBufferOffset; } public final int termBufferLength() { initTermBuffer(); return termBufferLength; } private void initTermBuffer() { if (null != termBuffer) return; termBufferOffset=0; termBufferLength = termText.length(); termBuffer = char[termBufferLength]; termText.getChars(0, termBufferLength, termBuffer, 0) } ...such that DocumentsWRiter never uses termText just termBuffer 5) at some point down the road, modify all of the "core" TokenStreams to use termBuffer instead of termText 6) at some point way down the road, delete the depreacated methods/variables and the Token.initTermBuffer method.

Unless I've missed something, the end result should be...

a) existing TokenStreams that use termText exclusively and don't know anything about termBuffer will have the exact same performance characteristics that they currently do (a char[] will be created on demand the first time termBuffer is used -- by DocumentsWriter)

b) TokenStreams which wind up being a mix of old and new code using both termText and termBuffer should work correctly in any combination.

c) new TokenStreams that use termBuffer exclusively should work fine, and have decent performance even with the overhead of the initTermBuffer() call (which will get better once the deprecated legacy termText usage can be removed.

Side note: Token.toString() is current jacked in cases where termBuffer is used)

"Chris Hostetter" <hossman_lucene@fucit.org> wrote: > > : > Currently the char[] wins, but good point: seems like each setter > : > should null out the other one? > : > : Certainly the String setter should null the char[] (that's the only > : way to keep back compatibility), and probably vice-versa. > > i haven't really thought baout this before today (i missed seeing the > char[] stuff get added to Token as well) but if we're confident the > char[] > stuff is hte direction we want to go, then i believe the cleanest forward > migration plan is... > > 1) deprecate Token.termText, Token.getTermText(), Token.setTermText > 2) make Token.setTermBuffer() null out Token.termText (document) > 3) make Token.setTermText() null out Token.termBuffer > 4) refactor all of the the "if (null == termBuffer)" logic in > DocumentsWriter into a the Token class, ala... > public final char[] termBuffer() { > initTermBuffer(); > return termBuffer; > } > public final int termBufferOffset() { > initTermBuffer(); > return termBufferOffset; > } > public final int termBufferLength() { > initTermBuffer(); > return termBufferLength; > } > private void initTermBuffer() { > if (null != termBuffer) return; > termBufferOffset=0; > termBufferLength = termText.length(); > termBuffer = char[termBufferLength]; > termText.getChars(0, termBufferLength, termBuffer, 0) > } > ...such that DocumentsWRiter never uses termText just termBuffer > 5) at some point down the road, modify all of the "core" TokenStreams to > use termBuffer instead of termText > 6) at some point way down the road, delete the depreacated > methods/variables and the Token.initTermBuffer method. > > Unless I've missed something, the end result should be... > > a) existing TokenStreams that use termText exclusively and don't know > anything about termBuffer will have the exact same performance > characteristics that they currently do (a char[] will be created on > demand > the first time termBuffer is used -- by DocumentsWriter) > > b) TokenStreams which wind up being a mix of old and new code using both > termText and termBuffer should work correctly in any combination. > > c) new TokenStreams that use termBuffer exclusively should work fine, and > have decent performance even with the overhead of the initTermBuffer() > call (which will get better once the deprecated legacy termText usage can > be removed.

I like this approach Hoss!

I will open an issue and work on it; I'd like to finish up through your step 5 below. This way "out of the box" performance of Lucene is improved, for people who use the core analyzers and filters.

To further improve "out of the box" performance I would really also like to fix the core analyzers, when possible, to re-use a single Token instance for each Token they produce. This would then mean no objects are created as you step through Tokens in the TokenStream ... so this would give the best performance.

The problem is, this would be a sneaky API change and would for example break anyone who gathers the Tokens into an array and processes them later (eg maybe highlighter does?).

Maybe one way to migrate to this would be to add a new method "Token nextDirect()" to TokenStream which is allowed to return a Token that may be recycled. This means callers of nextDirect() must make their own copy of the Token if they intend to keep it for later usage. It would default to "return next()" and I would then default "next()" to call nextDirect() but make a full copy of what's returned. DocumentsWriter would use this to step through the tokens.

Analyzers would then implement either next() or nextDirect(). We would fix all core analyzers to implemente nextDirect(), and then all other analyzers would remain backwards compatible.

From a caller's standpoint the only difference between nextDirect() and next() is that next() guarantees to give you a "full private copy" of the Token but nextDirect() does not.

We could also punt on this entirely since it is always possible for people to make their own analyzers that re-use Tokens, but I think having decent "out of the box" performance with Lucene is important. Ie, Lucene's defaults should be set up so that you get decent performance without changing anything; you should not have to work hard to get good performance and unfortunately today you still do....

> Side note: Token.toString() is current jacked in cases where termBuffer is > used)

Woops -- sorry -- I will add test case & fix it.

> Note that there are many existing filters that directly access and > manipulate the package protected String termText. These will need to > be changed.

Hmmm ... is it too late to make all of Token's package protected attrs private, so we require use of getters? Or, maybe just make the new ones (termBuffer*) private and then leave termText package protected but deprecate it and add a comment saying that core analyzers and filters have switched to using termBuffer and so you may get an NPE if you access termText directly? Plus fix all core analyzers to not directly access termText and use termBuffer instead...

On 7/21/07, Michael McCandless <lucene@mikemccandless.com> wrote: > To further improve "out of the box" performance I would really also > like to fix the core analyzers, when possible, to re-use a single > Token instance for each Token they produce. This would then mean no > objects are created as you step through Tokens in the TokenStream > ... so this would give the best performance.

How much better I wonder? Small object allocation & reclaiming is supposed to be very good in current JVMs. You got good performance out of reusing Field objects, i think, partially because of the convoluted logic-tree in the constructor, including interning the field name.

Token is a much lighter weight class, with no unpredictable branches, less state (except we increased it with the termBuffer stuff ;-) and no string interning (which presumably includes synchronization).

> The problem is, this would be a sneaky API change and would for > example break anyone who gathers the Tokens into an array and > processes them later (eg maybe highlighter does?).

I think there is probably quite a bit of code out there that needs to look at tokens in context (and thus does some sort of buffering).

> Maybe one way to migrate to this would be to add a new method "Token > nextDirect()" to TokenStream which is allowed to return a Token that > may be recycled. This means callers of nextDirect() must make their > own copy of the Token if they intend to keep it for later usage. It > would default to "return next()" and I would then default "next()" to > call nextDirect() but make a full copy of what's > returned. DocumentsWriter would use this to step through the tokens. > > Analyzers would then implement either next() or nextDirect(). We > would fix all core analyzers to implemente nextDirect(), and then all > other analyzers would remain backwards compatible. > > From a caller's standpoint the only difference between nextDirect() > and next() is that next() guarantees to give you a "full private copy" > of the Token but nextDirect() does not. > > We could also punt on this entirely since it is always possible for > people to make their own analyzers that re-use Tokens, but I think > having decent "out of the box" performance with Lucene is important. > Ie, Lucene's defaults should be set up so that you get decent > performance without changing anything; you should not have to work > hard to get good performance and unfortunately today you still do....

I'd be surprised if we can gain meaningful performance with token reuse in a typical full-text analysis chain (stemming, stop-words, etc). Have you done any measurements, or is it a hunch at this point?

> Hmmm ... is it too late to make all of Token's package protected attrs > private, so we require use of getters?

They are package protected, which means that they are safe to change because we own all the code in that package.

Another thing that would slow down slightly with termBuffer use is hash lookups (because Strings cache their hashCode). So an anlalyzer with a synonym filter and a stop filter would end up calculating that twice (or constructing a String). I don't know if it would be worth maintaining that in the Token or not...

> The problem is, this would be a sneaky API change and would for > example break anyone who gathers the Tokens into an array and > processes them later (eg maybe highlighter does?). > org.apache.lucene.analysis.CachingTokenFilter does this.

"Yonik Seeley" <yonik@apache.org> wrote: > On 7/21/07, Michael McCandless <lucene@mikemccandless.com> wrote: > > To further improve "out of the box" performance I would really also > > like to fix the core analyzers, when possible, to re-use a single > > Token instance for each Token they produce. This would then mean no > > objects are created as you step through Tokens in the TokenStream > > ... so this would give the best performance. > > How much better I wonder? Small object allocation & reclaiming is > supposed to be very good in current JVMs. > You got good performance out of reusing Field objects, i think, > partially because of the convoluted logic-tree in the constructor, > including interning the field name. > > Token is a much lighter weight class, with no unpredictable branches, > less state (except we increased it with the termBuffer stuff ;-) and > no string interning (which presumably includes synchronization).

Good question ... I will run some benchmarks to see if it's worthwhile.

> > The problem is, this would be a sneaky API change and would for > > example break anyone who gathers the Tokens into an array and > > processes them later (eg maybe highlighter does?). > > I think there is probably quite a bit of code out there that needs to > look at tokens in context (and thus does some sort of buffering).

Agreed, so we can't change the API. So the next/nextDirect proposal should work well: it doesn't change the API yet would allow consumers that don't require "full private copy" of every Token, like DocumentsWriter, to have good performance.

> > Maybe one way to migrate to this would be to add a new method "Token > > nextDirect()" to TokenStream which is allowed to return a Token that > > may be recycled. This means callers of nextDirect() must make their > > own copy of the Token if they intend to keep it for later usage. It > > would default to "return next()" and I would then default "next()" to > > call nextDirect() but make a full copy of what's > > returned. DocumentsWriter would use this to step through the tokens. > > > > Analyzers would then implement either next() or nextDirect(). We > > would fix all core analyzers to implemente nextDirect(), and then all > > other analyzers would remain backwards compatible. > > > > From a caller's standpoint the only difference between nextDirect() > > and next() is that next() guarantees to give you a "full private copy" > > of the Token but nextDirect() does not. > > > > We could also punt on this entirely since it is always possible for > > people to make their own analyzers that re-use Tokens, but I think > > having decent "out of the box" performance with Lucene is important. > > Ie, Lucene's defaults should be set up so that you get decent > > performance without changing anything; you should not have to work > > hard to get good performance and unfortunately today you still do.... > > I'd be surprised if we can gain meaningful performance with token > reuse in a typical full-text analysis chain (stemming, stop-words, > etc). Have you done any measurements, or is it a hunch at this > point?

Will test.

> > Hmmm ... is it too late to make all of Token's package protected attrs > > private, so we require use of getters? > > They are package protected, which means that they are safe to change > because we own all the code in that package.

OK, I was unsure whether this is considered an API change since users can make their own classes in this package too. So I will make all of them (including Payload) private.

> Another thing that would slow down slightly with termBuffer use is > hash lookups (because Strings cache their hashCode). So an anlalyzer > with a synonym filter and a stop filter would end up calculating that > twice (or constructing a String). I don't know if it would be worth > maintaining that in the Token or not...

On 7/21/07, Michael McCandless <lucene@mikemccandless.com> wrote: >> To further improve "out of the box" performance I would really also >> like to fix the core analyzers, when possible, to re-use a single >> Token instance for each Token they produce. This would then mean no >> objects are created as you step through Tokens in the TokenStream >> ... so this would give the best performance. >How much better I wonder? Small object allocation & reclaiming is >supposed to be very good in current JVMs.

Sorry I cannot give you exact numbers now, but I know for sure that we decided to take "real analysis" into separate phase that gets executed before entering Lucene TokenStreram and Indexing due to String in Token and than do just the simple whitespace tokenisation during indexing. And this was not just out for fun, there was some real benefit in it. The issue with performance here is in making transformations on tokens during analysis (where this applies), you gave very nice example , stemming, that itself generates new Strings, another nice example is NGram generation in SpellChecker that generates rater huge numbers of small objects.

The simplest model, tokenize(without modifying)/index ironically also benefits from char[] as than things go really fast in general so new String() on the way gets noticed in profiler. While testing new indexing code from Mike, we also changed our vanilla Tokenizer to use termBuffer and there was again something like 10-15% boost.

It's been a while since that so I do not know exact numbers, but I learned this many times the hard way, nothing beats char[] when it comes to text processing.

To stop bothering you people, IMHO, there is a hard work in Analyzer chain to be done before Token gets ready for prime time in Lucene core, and this is the place where having String overproduction hurts.

as a side note, Maybe a good point to mention it, clever people from mg4j project solved this polymorphic needs for char[]/String (fast modification/compact, fast hashing, safety) using something called MutableString, it is worth reading javadoc there.

On 7/21/07, Michael McCandless <lucene@mikemccandless.com> wrote: >> To further improve "out of the box" performance I would really also >> like to fix the core analyzers, when possible, to re-use a single >> Token instance for each Token they produce. This would then mean no >> objects are created as you step through Tokens in the TokenStream >> ... so this would give the best performance. >How much better I wonder? Small object allocation & reclaiming is >supposed to be very good in current JVMs.

Sorry I cannot give you exact numbers now, but I know for sure that we decided to take "real analysis" into separate phase that gets executed before entering Lucene TokenStreram and Indexing due to String in Token and than do just the simple whitespace tokenisation during indexing. And this was not just out for fun, there was some real benefit in it. The issue with performance here is in making transformations on tokens during analysis (where this applies), you gave very nice example , stemming, that itself generates new Strings, another nice example is NGram generation in SpellChecker that generates rater huge numbers of small objects.

The simplest model, tokenize(without modifying)/index ironically also benefits from char[] as than things go really fast in general so new String() on the way gets noticed in profiler. While testing new indexing code from Mike, we also changed our vanilla Tokenizer to use termBuffer and there was again something like 10-15% boost.

It's been a while since that so I do not know exact numbers, but I learned this many times the hard way, nothing beats char[] when it comes to text processing.

To stop bothering you people, IMHO, there is a hard work in Analyzer chain to be done before Token gets ready for prime time in Lucene core, and this is the place where having String overproduction hurts.

> Agreed, so we can't change the API. So the next/nextDirect proposal > should work well: it doesn't change the API yet would allow consumers > that don't require "full private copy" of every Token, like > DocumentsWriter, to have good performance.

If we eventually go this way, my preferred API for reuse is next(Token resToken) where a non-null resToken reads as "if supported, please use this object for the returned result"

"Doron Cohen" <DORONC@il.ibm.com> wrote: > > Agreed, so we can't change the API. So the next/nextDirect proposal > > should work well: it doesn't change the API yet would allow consumers > > that don't require "full private copy" of every Token, like > > DocumentsWriter, to have good performance. > > If we eventually go this way, my preferred API for reuse is > next(Token resToken) > where a non-null resToken reads as > "if supported, please use this object for the returned result"

OK, I'm taking this approach right now in the benchmarks I just tested. I added this method:

boolean next(Token resToken)

which returns true if it has updated resToken with another token, else false if end-of-stream was hit.

Did you try it w/o token reuse (reuse tokens only when mutating, not when creating new tokens from the tokenizer)? It would be interesting to see what's attributable to Token reuse only (after core filters have been optimized to use the char[] setters, etc).

We've had issues in the past regarding errors with filters dealing with token properties: 1) filters creating a new token from and old token, but forgetting about setting positionIncrement 2) legacy filters losing "new" information such as payloads when creating , because they didn't exist when the filter was written.

#1 is solved by token mutation because there are setters for the value (before, the filter author was forced to create a new token, unless they could access the package-private String).

#2 can now be solved by clone() (another relatively new addition)

So what new problems might crop up with token reuse? - a filter reusing a token, but not zeroing out something new like payloads because they didn't exist when the filter was authored (the opposite problem from before)

I haven't tried this variant yet. I guess for long filter chains the GC cost of the tokenizer making the initial token should go down as overall part of the time. Though I think we should still re-use the initial token since it should (?) only help.

> It would be interesting to see what's attributable to Token reuse only > (after core filters have been optimized to use the char[] setters, > etc).

Good question; it could be the gains are mostly from switching to char[] termBuffer and less so from Token instance re-use. Too many tests to try :)

> We've had issues in the past regarding errors with filters dealing > with token properties: > 1) filters creating a new token from and old token, but forgetting > about setting positionIncrement > 2) legacy filters losing "new" information such as payloads when > creating , because they didn't exist when the filter was written. > > #1 is solved by token mutation because there are setters for the value > (before, the filter author was forced to create a new token, unless > they could access the package-private String).

Ahhh, good!

> #2 can now be solved by clone() (another relatively new addition) > > So what new problems might crop up with token reuse? > - a filter reusing a token, but not zeroing out something new like > payloads because they didn't exist when the filter was authored (the > opposite problem from before) > > Would a Token.clear() be needed for use by (primarily) tokenizers?

> boolean next(Token resToken) > > which returns true if it has updated resToken with another token, > else false if end-of-stream was hit.

I would actually prefer Token next(Token resToken) because: - this was the API with reuse is very much like the one without reuse (except for the reusable param) - easier for app dev. - it allows to return useful result also in cases where the specific implementation does not support reuse, or null was passed (indicating reuse is not desired), or something in the data/state inhibited reuse.

"Doron Cohen" <DORONC@il.ibm.com> wrote: > "Michael McCandless" wrote: > > > boolean next(Token resToken) > > > > which returns true if it has updated resToken with another token, > > else false if end-of-stream was hit. > > I would actually prefer > Token next(Token resToken) > because: > - this was the API with reuse is very much like the one without reuse > (except for the reusable param) - easier for app dev. > - it allows to return useful result also in cases where the specific > implementation does not support reuse, or null was passed (indicating > reuse > is not desired), or something in the data/state inhibited reuse.

I see. So the callee is allowed to return a different token than resToken? Ie passing "resToken" is just a suggestion that you may use resToken to return your result but you are not required to. OK I'll try this approach...

On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote: > "Yonik Seeley" <yonik@apache.org> wrote: > > On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote: > > > OK, I ran some benchmarks here. > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all > > > Wikipedia content using LowerCaseTokenizer + StopFilter + > > > PorterStemFilter. I think it's worth pursuing! > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not > > when creating new tokens from the tokenizer)? > > I haven't tried this variant yet. I guess for long filter chains the > GC cost of the tokenizer making the initial token should go down as > overall part of the time. Though I think we should still re-use the > initial token since it should (?) only help.

If it weren't any slower, that would be great... but I worry about filters that need buffering (either on the input side or the output side) and how that interacts with filters that try and reuse.

Tokens that do output buffering could be slowed down if they must copy the token state to the passed token. I like Doron's idea that a new Token could be returned anyway.

The extra complexity seemingly involved in trying to make both scenarios perform well is what prompts me to ask what the performance gain will be.

"Yonik Seeley" <yonik@apache.org> wrote: > On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote: > > "Yonik Seeley" <yonik@apache.org> wrote: > > > On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote: > > > > OK, I ran some benchmarks here. > > > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and > > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all > > > > Wikipedia content using LowerCaseTokenizer + StopFilter + > > > > PorterStemFilter. I think it's worth pursuing! > > > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not > > > when creating new tokens from the tokenizer)? > > > > I haven't tried this variant yet. I guess for long filter chains the > > GC cost of the tokenizer making the initial token should go down as > > overall part of the time. Though I think we should still re-use the > > initial token since it should (?) only help. > > If it weren't any slower, that would be great... but I worry about > filters that need buffering (either on the input side or the output > side) and how that interacts with filters that try and reuse.

OK I will tease out this effect & measure performance impact.

This would mean that the tokenizer must not only produce new Token instance for each term but also cannot re-use the underlying char[] buffer in that token, right? EG with mods for CharTokenizer I re-use its "char[] buffer" with every Token, but I'll change that to be a new buffer for each Token for this test.

> Tokens that do output buffering could be slowed down if they must copy > the token state to the passed token. I like Doron's idea that a new > Token could be returned anyway. > > The extra complexity seemingly involved in trying to make both > scenarios perform well is what prompts me to ask what the performance > gain will be.

Yes I like Doron's idea too -- it's just a "suggestion" to use the input Token if it's convenient.

I think the resulting API is fairly simple with this change: if you (the consumer) want a "full private copy" of the Token (like QueryParser, Highlighter, CachedTokenFilter, a filter that does input buffering, etc.) you call the input.next() call. If instead you can handle re-use because you will consume this Token once, right now, and never look at it again (like DocumentsWriter), then you call the next(Token) API.

"Yonik Seeley" <yonik@apache.org> wrote: > On 7/25/07, Michael McCandless <lucene@mikemccandless.com> wrote: > > "Yonik Seeley" <yonik@apache.org> wrote: > > > On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote: > > > > "Yonik Seeley" <yonik@apache.org> wrote: > > > > > On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote: > > > > > > OK, I ran some benchmarks here. > > > > > > > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and > > > > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all > > > > > > Wikipedia content using LowerCaseTokenizer + StopFilter + > > > > > > PorterStemFilter. I think it's worth pursuing! > > > > > > > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not > > > > > when creating new tokens from the tokenizer)? > > > > > > > > I haven't tried this variant yet. I guess for long filter chains the > > > > GC cost of the tokenizer making the initial token should go down as > > > > overall part of the time. Though I think we should still re-use the > > > > initial token since it should (?) only help. > > > > > > If it weren't any slower, that would be great... but I worry about > > > filters that need buffering (either on the input side or the output > > > side) and how that interacts with filters that try and reuse. > > > > OK I will tease out this effect & measure performance impact. > > > > This would mean that the tokenizer must not only produce new Token > > instance for each term but also cannot re-use the underlying char[] > > buffer in that token, right? > > If the tokenizer can actually change the contents of the char[], then > yes, it seems like when next() is called rather than next(Token), a > new char[] would need to be allocated.

Right. So I'm now testing "reuse all" vs "tokenizer makes a full copy but filters get to re-use it".

> > EG with mods for CharTokenizer I re-use > > its "char[] buffer" with every Token, but I'll change that to be a new > > buffer for each Token for this test. > > It's not just for a test, right? If next() is called, it can't reuse > the char[]. And there is no getting around the fact that some > tokenizers will need to call next() because of buffering.

Correct -- the way I'm doing this now is in TokenStream.java I have a default "Token next()" which calls "next(Token result)" but makes a complete copy before returning it. This keeps full backwards compatiblity even in the case where a consumer wants a private copy (calls next()) but the provider only provides the "re-use" API (next(Token result)).

On 7/26/07, Steven Parkes <steven_parkes@esseff.org> wrote: > First I create a single large file that has one doc per line > from > Wikipedia content, using this alg > > Anybody disagree that the 1-line-per-doc format is better (at least for > Wikipedia)? If so, I'll get rid of the intermediate one-file-per-doc > step.

+1 opening a lot of files is an overhead, and something we aren't trying to test.

OK, I tested this case where CharTokenizer makes a new Token (and new char[] array) for every token instead of re-using each. This way is 6% slower than fully re-using the Token (585 sec -> 618 sec) -- using same test as described in https://issues.apache.org/jira/browse/LUCENE-969.

> > > EG with mods for CharTokenizer I re-use > > > its "char[] buffer" with every Token, but I'll change that to be a new > > > buffer for each Token for this test. > > > > It's not just for a test, right? If next() is called, it can't reuse > > the char[]. And there is no getting around the fact that some > > tokenizers will need to call next() because of buffering. > > Correct -- the way I'm doing this now is in TokenStream.java I have a > default "Token next()" which calls "next(Token result)" but makes a > complete copy before returning it. This keeps full backwards > compatiblity even in the case where a consumer wants a private copy > (calls next()) but the provider only provides the "re-use" API > (next(Token result)). > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org >