Details

Description

In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.

This led to the comment that a different or compressed encoding would be a generally useful feature.

BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.

SCSU is another Unicode compression algorithm that could be used.

An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.

Robert Muir
added a comment - 18/Nov/09 13:25 Earwin, if implemented as a directory, we lose many of the advantages.
For example, if you are using BOCU-1, lets say with Hindi language, then according to the stats here: http://unicode.org/notes/tn6/#Performance
you can encode/decode BOCU-1 to/from UTF-16 more than twice as fast as you can UTF-8 to/from UTF-16 (for this language)
also, resulting bytes are less than half the size of UTF-8 (for this language), yet sort order is still preserved, so it should work for term dictionary, etc.
Note: I have never measured bocu performance in practice.
I took a look at the flex indexing branch and this appears like this might be possible in the future thru a codec...

> Earwin, if implemented as a directory, we lose many of the advantages.
Damn. I believed all strings pass through read/writeString() on IndexInput/Output. Naive. Well, one can patch UnicodeUtil, but the solution is no longer elegant.
Waiting for flexible indexing, hoping it's gonna be flexible..

Earwin Burrfoot
added a comment - 18/Nov/09 16:10 > Earwin, if implemented as a directory, we lose many of the advantages.
Damn. I believed all strings pass through read/writeString() on IndexInput/Output. Naive. Well, one can patch UnicodeUtil, but the solution is no longer elegant.
Waiting for flexible indexing, hoping it's gonna be flexible..

it looked to me, at a glance that some things would still be wierd. like TermVectors aren't "flexible" yet, so wouldn't be BOCU-1.
I do not know if in flexible indexing, it will be possible for a codec to change behavior like this...
maybe someone knows if this is planned eventually or not?

Robert Muir
added a comment - 18/Nov/09 16:29 Waiting for flexible indexing, hoping it's gonna be flexible..
it looked to me, at a glance that some things would still be wierd. like TermVectors aren't "flexible" yet, so wouldn't be BOCU-1.
I do not know if in flexible indexing, it will be possible for a codec to change behavior like this...
maybe someone knows if this is planned eventually or not?

The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.

Thanks Mike! as far as the encoding itself, BOCU-1 is available in the ICU library, so we do not need to implement it and deal with the conformance/patent stuff
(To get the royalty-free patent you must be "fully compliant", they have already done this).

If this feature is desired, I think something like a Codec in contrib that encodes the index with BOCU-1 from ICU would be the best.

Robert Muir
added a comment - 18/Nov/09 18:27 The flex API will let you completely customize how the terms dict/index is encoded, but not yet term vectors.
Thanks Mike! as far as the encoding itself, BOCU-1 is available in the ICU library, so we do not need to implement it and deal with the conformance/patent stuff
(To get the royalty-free patent you must be "fully compliant", they have already done this).
If this feature is desired, I think something like a Codec in contrib that encodes the index with BOCU-1 from ICU would be the best.

Robert Muir
added a comment - 18/Nov/09 18:30 by the way, here are even more details on BOCU, including more in-depth size and performance, at least compared to the UTN:
http://icu-project.org/repos/icu/icuhtml/trunk/design/conversion/bocu1/bocu1.html

ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

Earwin Burrfoot
added a comment - 18/Nov/09 19:11 as far as the encoding itself, BOCU-1 is available in the ICU library
ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.

Earwin, at least in ICU trunk you have the following (public class) in com.ibm.icu.impl.BOCU:

But I think this class only supports encoding, not decoding (only used by Collation API for so called BOCSU).
For decoding, we might have to use registered charset and ByteBuffer... unless theres another way.

Robert Muir
added a comment - 18/Nov/09 19:26 ICU's API requires to use ByteBuffer and CharBuffer for input/output. And even if I missed some nice method, encoder/decoder operates internally on said buffers. Thus, a wrap/unwrap for each String is inevitable.
Earwin, at least in ICU trunk you have the following (public class) in com.ibm.icu.impl.BOCU:
public static int compress( String source, byte buffer[], int offset)
public static int getCompressionLength( String source)
...
But I think this class only supports encoding, not decoding (only used by Collation API for so called BOCSU).
For decoding, we might have to use registered charset and ByteBuffer... unless theres another way.

So it would be of course better to have something more suitable similar to UnicodeUtil, plus you could ditch the lib dependency.
but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

Robert Muir
added a comment - 18/Nov/09 19:38 Earwin, i do not really like this implementation either.
So it would be of course better to have something more suitable similar to UnicodeUtil, plus you could ditch the lib dependency.
but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

but then i guess we have to deal with this patent thing... i do not really know what is involved with that.

CPAN holds BOCU-1 implementation, derived from "Sample C code", with all necessary copyrights and patent mentioned, but there's no word of them formally obtaining a license. I'm not sure if this is okay, or just overlooked.

Earwin Burrfoot
added a comment - 18/Nov/09 21:51 but then i guess we have to deal with this patent thing... i do not really know what is involved with that.
CPAN holds BOCU-1 implementation, derived from "Sample C code", with all necessary copyrights and patent mentioned, but there's no word of them formally obtaining a license. I'm not sure if this is okay, or just overlooked.

DM Smith
added a comment - 19/Nov/09 16:11 The sample code is probably what is on this page, here:
http://unicode.org/notes/tn6/#Sample_Code
From what I gather reading the whole page:
If we port the sample code and the test case and then provide demonstration that all test pass, then we will be granted a license.
There's contact info at the bottom of the page for getting the license. Maybe, contact them for clarification?
As the code is fairly small, I don't think it would be too hard to port. The trick is that the sample code appears to deal in 32-bit arrays and we'd probably want a byte[].

So while I don't think things like wildcard, etc will work due to the nature of BOCU-1, term and phrase queries should work fine, and it maintains UTF-8 order so sorting is fine, and range queries should work once we fix TermRangeQuery to use byte.

the impl is probably a bit slow (uses charset api) as its just for playing around.

note: I didnt check the box because of the patent thing, (not sure it even applies since i use the icu impl here), but either way i dont want to involve myself with that.

Robert Muir
added a comment - 20/Jul/10 19:35 attached is a simple prototype for encoding terms as BOCU-1
So while I don't think things like wildcard, etc will work due to the nature of BOCU-1, term and phrase queries should work fine, and it maintains UTF-8 order so sorting is fine, and range queries should work once we fix TermRangeQuery to use byte.
the impl is probably a bit slow (uses charset api) as its just for playing around.
note: I didnt check the box because of the patent thing, (not sure it even applies since i use the icu impl here), but either way i dont want to involve myself with that.

For correctness of code: target.offset = buffer.arrayOffset() + buffer.position();
But for most cases position() will be 0, but this is quite often an error. If you use limit() you have to use position(), else its inconsistent. arrayOffset() gives the offset corresponding to position=0. And length should be remaining()(for example see payload contrib IdentityEncoder)

Uwe Schindler
added a comment - 20/Jul/10 19:48 - edited For correctness of code: target.offset = buffer.arrayOffset() + buffer.position();
But for most cases position() will be 0, but this is quite often an error. If you use limit() you have to use position(), else its inconsistent. arrayOffset() gives the offset corresponding to position=0. And length should be remaining()(for example see payload contrib IdentityEncoder)
And the default factory could be a singleton...

Uwe, sure, if we were to implement this I wouldnt use NIO anyway though, like i said i dont plan on committing anything (unless somethign is figured out about the patent), but it might be useful to someone.

Robert Muir
added a comment - 20/Jul/10 19:56 Uwe, sure, if we were to implement this I wouldnt use NIO anyway though, like i said i dont plan on committing anything (unless somethign is figured out about the patent), but it might be useful to someone.
I tested this on some hindi text:
encoding
tii
tis
utf-8
60,205
3,740,329
bocu-1
28,431
2,168,407

And one thing more, in the non-array case:
buffer.get(target.bytes, target.offset, limit); target's offset should be set to 0 on all write operations to bytesref (see UnicodeUtil.UTF16toUTF8WithHash()). Else the grow() before does not resize correct!

Uwe Schindler
added a comment - 20/Jul/10 19:58 And one thing more, in the non-array case:
buffer.get(target.bytes, target.offset, limit); target's offset should be set to 0 on all write operations to bytesref (see UnicodeUtil.UTF16toUTF8WithHash()). Else the grow() before does not resize correct!

Uwe Schindler
added a comment - 20/Jul/10 20:26 Here the policed one
In my opinion something is better than nothing. The patents are not violated here, as we only use an abstract API and the string "BOCU-1". You can use the same code to encode in "ISO-8859-1".

Linking to icu4j-charsets is done dynamically by reflection. If you don't have ICU4J charsets in your classpath, the attribute throws explaining exception

We dont need to ship the rather large JAR file with Lucene just for this class

We dont have legal patent problems as we neither ship the API nor use it directly

The backside is that the Test simple prints a warning but passes, so the class is not tested until you install icu4j-charsets.jar. We can put the JAR file on hudson, so it can be used during nightly builds. Or we download it dynamically on build.

Uwe Schindler
added a comment - 21/Jul/10 07:07 Here is a 100% legally valid implementation:
Linking to icu4j-charsets is done dynamically by reflection. If you don't have ICU4J charsets in your classpath, the attribute throws explaining exception
We dont need to ship the rather large JAR file with Lucene just for this class
We dont have legal patent problems as we neither ship the API nor use it directly
The backside is that the Test simple prints a warning but passes, so the class is not tested until you install icu4j-charsets.jar. We can put the JAR file on hudson, so it can be used during nightly builds. Or we download it dynamically on build.
I added further improvements to the encoder ittself:
less variables
correct error handling for encoding errors
remove floating point from main loop

A new patch that completely separates the BOCU factory from the implementation (which moves to common/miscellaneous). This has the following advantages:

You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work

Theoretically you could remove the BOCU classes at all, one that wants to use, can simply get the Charset from ICUs factory and pass it to the AttributeFactory. The convenience class is still useful, especially if we can later natively implement the encoding without NIO (when patent issues are solved...)

The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and verifies that the created BytesRefs have the same format like a BytesRef created using the UnicodeUtils.

The test also checks that encoding errors are bubbled up as RuntimeExceptions

If you want your complete index e.g. in ISO-8859-1, there should be also convenience methods that take CharSequences/char[] in the factory/attribute to quickly convert strings to BytesRefs like UnicodeUtil does - by this its possible to create TermQueries directly using e.g. ISO-8859-1 encoding.

Uwe Schindler
added a comment - 21/Jul/10 09:03 A new patch that completely separates the BOCU factory from the implementation (which moves to common/miscellaneous). This has the following advantages:
You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work
Theoretically you could remove the BOCU classes at all, one that wants to use, can simply get the Charset from ICUs factory and pass it to the AttributeFactory. The convenience class is still useful, especially if we can later natively implement the encoding without NIO (when patent issues are solved...)
The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and verifies that the created BytesRefs have the same format like a BytesRef created using the UnicodeUtils.
The test also checks that encoding errors are bubbled up as RuntimeExceptions
TODO:
docs
handling of encoding errors configureable (replace with replacement char?)
If you want your complete index e.g. in ISO-8859-1, there should be also convenience methods that take CharSequences/char[] in the factory/attribute to quickly convert strings to BytesRefs like UnicodeUtil does - by this its possible to create TermQueries directly using e.g. ISO-8859-1 encoding.

This is fabulous! And a great example of what's now possible w/ the cutover to opaque binary terms w/ flex – makes it easy to swap out how terms are encoded.

BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages.

This encoding would also naturally reduce the RAM required for the terms index and Terms/TermsIndex FieldCache (used when you sort by string field) as well, since Lucene just loads the [opaque] term bytes into RAM.

Michael McCandless
added a comment - 21/Jul/10 09:24 This is fabulous! And a great example of what's now possible w/ the cutover to opaque binary terms w/ flex – makes it easy to swap out how terms are encoded.
BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages.
This encoding would also naturally reduce the RAM required for the terms index and Terms/TermsIndex FieldCache (used when you sort by string field) as well, since Lucene just loads the [opaque] term bytes into RAM.

Robert Muir
added a comment - 21/Jul/10 11:53 You can use any Charset to encode your terms. The javadocs should only note, that the byte[] order should be correct for range queries to work
I don't think we should add support for any non-unicode character sets.
If you want your complete index e.g. in ISO-8859-1
I am 100% against doing this.

UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

Michael McCandless
added a comment - 21/Jul/10 12:18 Is there any reason not to make BOCU-1 Lucene's default encoding?
UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

I'd like to play with swapping it in as the default, just to see what problems (if any) there are, and to make sure all queries are supported, etc. I can upload a new patch that does it this way and we can play.

Robert Muir
added a comment - 21/Jul/10 12:23 Is there any reason not to make BOCU-1 Lucene's default encoding?
in my opinion, just IBM But maybe we can make a strong implementation and they will approve it and give us a patent:
http://unicode.org/notes/tn6/#Intellectual_Property
UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like we expect little to no indexing or searching perf penalty (once we have a faster interface to BOCU1, eg our own private impl, like UnicodeUtil).
I'd like to play with swapping it in as the default, just to see what problems (if any) there are, and to make sure all queries are supported, etc. I can upload a new patch that does it this way and we can play.

But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?

Michael McCandless
added a comment - 21/Jul/10 12:42
> Is there any reason not to make BOCU-1 Lucene's default encoding?
in my opinion, just IBM
But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?

But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?

I dont know... personally i wouldnt feel comfortable committing something without getting guidance first. but we can explore the technicals with patches on this jira issue and not check the box and i think this is all ok for now.

Robert Muir
added a comment - 21/Jul/10 12:54 But... ICU's license is compatible w/ ASL (I think), and includes a working impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that impl, tweak it, add to our sources, and include ICU's license in our LICENSE/NOTICE?
I dont know... personally i wouldnt feel comfortable committing something without getting guidance first. but we can explore the technicals with patches on this jira issue and not check the box and i think this is all ok for now.

Robert Muir
added a comment - 22/Jul/10 11:29 attached is a really really rough patch that sets bocu-1 as the default encoding.
Beware: its a work in progress and a lot of the patch is auto-generated (eclipse) so some things need to be reverted.
Most tests pass, the idea is to find bugs in tests etc that abuse bytesref/assume utf-8 encoding, things like that.

btw that patch is huge because i just sucked in the icu charset stuff to have an implementation that works for testing...

its not intended to ever be that way as we would just implement the stuff we need without this code, but it makes it easier to test since you dont need any external jars or muck with the build system at all.

Robert Muir
added a comment - 22/Jul/10 11:34 btw that patch is huge because i just sucked in the icu charset stuff to have an implementation that works for testing...
its not intended to ever be that way as we would just implement the stuff we need without this code, but it makes it easier to test since you dont need any external jars or muck with the build system at all.

Robert Muir
added a comment - 27/Jul/10 21:38 attached is a patch for the start of a "BOCUUtil' with unicodeutil like methods.
For now i only implemented encode (and encodeWithHash):
I generated random strings with _TestUtil.randomRealisticUnicodeString and benchmarked, and the numbers are stable.
encoding
time to encode 20 million strings (ms)
number of encoded bytes
UTF-8
1,757
596,516,000
BOCU-1
1,968
250,202,000
So I think we get good compression, and good performance close to UTF-8 for encode.
I'll work on decode now.

I think I agree with not unrolling the 4-byte, the "diff" from the previous character has to be > 187659 [0x2dd0b]
this is like pottery writings and oracle bone script... but the previous ones (2x, 3x) speed up CJK and other scripts and are very useful.

Robert Muir
added a comment - 28/Jul/10 00:06 I ran tests, each one of Mike's optimizations speed up the encode...
I think I agree with not unrolling the 4-byte, the "diff" from the previous character has to be > 187659 [0x2dd0b]
this is like pottery writings and oracle bone script... but the previous ones (2x, 3x) speed up CJK and other scripts and are very useful.

Robert Muir
added a comment - 28/Jul/10 12:11 i optimized the surrogate case here, moving it into the 'prev' calculation.
now we are faster than utf-8 on average for encode.
encoding
time to encode 20 million strings (ms)
number of encoded bytes
UTF-8
1,756
596,516,000
BOCU-1
1,724
250,202,000

I took a stab at benchmarking encoding speed only with some different languages.
I encoded a word at a time (which happens at indexing time).
I used some text from wikipedia in different languages: english, german, french, spanish, and chinese.
I used WhitespaceAnalyzer for the first 4 and StandardAnalyzer for the chinese (but analysis speed is not measured.)

encoding

english

german

french

spanish

chinese

UTF-8 size

1888

4550

4875

5123

4497

BOCU-1 size

1888

4610

4995

5249

4497

BOCU slowdown

29%

39%

47%

61%

80%

I suspect that the StandardAnalyzer is spitting out individual CJK chars, and hence the same size of BOCU-1 and UTF-8?
I'll try and see if I can get SmartChineseAnalyzer working and re-run the chinese test.

Yonik Seeley
added a comment - 28/Jul/10 17:59 I took a stab at benchmarking encoding speed only with some different languages.
I encoded a word at a time (which happens at indexing time).
I used some text from wikipedia in different languages: english, german, french, spanish, and chinese.
I used WhitespaceAnalyzer for the first 4 and StandardAnalyzer for the chinese (but analysis speed is not measured.)
encoding
english
german
french
spanish
chinese
UTF-8 size
1888
4550
4875
5123
4497
BOCU-1 size
1888
4610
4995
5249
4497
BOCU slowdown
29%
39%
47%
61%
80%
I suspect that the StandardAnalyzer is spitting out individual CJK chars, and hence the same size of BOCU-1 and UTF-8?
I'll try and see if I can get SmartChineseAnalyzer working and re-run the chinese test.

yonik, what were you benchmarking? I think you should benchmark overall indexing time, of which encode is just a blip (<1% of).

and yes, since the start state is 0x40 the FIRST cjk char is a diff from 0x40, but any subsequent ones yield savings.

in general you wont get much compression for chinese.. id say max 25%
for russian, arabic, hebrew, japanese it will do a lot better: max 40%
for indian languages you tend to get about 50%.

I also dont know how you encoded word at a time, because i get quite different results. I focused a lot on 'single-byte diffs' to be fast (e.g. just subtraction) and I think i do a lot better for english than the 160% described in http://unicode.org/notes/tn6/

Furthermore, utf-8 is a complete no-op for english, so being a compression algorithm that is only 29% slower than (byte) char is good in my book, but i dont measure 29% for english.

Robert Muir
added a comment - 28/Jul/10 18:19 yonik, what were you benchmarking? I think you should benchmark overall indexing time, of which encode is just a blip (<1% of).
and yes, since the start state is 0x40 the FIRST cjk char is a diff from 0x40, but any subsequent ones yield savings.
in general you wont get much compression for chinese.. id say max 25%
for russian, arabic, hebrew, japanese it will do a lot better: max 40%
for indian languages you tend to get about 50%.
I also dont know how you encoded word at a time, because i get quite different results. I focused a lot on 'single-byte diffs' to be fast (e.g. just subtraction) and I think i do a lot better for english than the 160% described in http://unicode.org/notes/tn6/
Furthermore, utf-8 is a complete no-op for english, so being a compression algorithm that is only 29% slower than (byte) char is good in my book, but i dont measure 29% for english.
I don't think there is any problem in encode speed at all.

Michael Busch
added a comment - 28/Jul/10 18:24 Yonik can you give more details about how you ran your tests?
Was it an isolated string encoding test or does BOCU slow down overall indexing speed by 29%-80% (which would be hard to believe).

I'm isolating encoding speed only (not analysis, not indexing, etc) of tokens in different languages.
So I took some text from wikipedia, analyze it to get a list of char[], then encode each char[] in a loop. It's only the last step that is benchmarked to isolate the encode performance. I'm certainly not claiming that indexing is n% slower.

Yonik Seeley
added a comment - 28/Jul/10 18:43 Yonik can you give more details about how you ran your tests?
I'm isolating encoding speed only (not analysis, not indexing, etc) of tokens in different languages.
So I took some text from wikipedia, analyze it to get a list of char[], then encode each char[] in a loop. It's only the last step that is benchmarked to isolate the encode performance. I'm certainly not claiming that indexing is n% slower.

Robert Muir
added a comment - 28/Jul/10 18:53 attached is my benchmark for english text.
UTF-8: 15530ms
BOCU-1: 15687ms
Note, i use a Sun JVM 1.6.0_19 (64bit)
Yonik if you run this benchmark and find a problem with it / or its slower on your machine, let me know your configuration, because i dont see the results you do.

Yonik Seeley
added a comment - 28/Jul/10 18:57 in general you wont get much compression for chinese.. id say max 25%
Ah, OK.
I just tried russian w/ whitespace analyzer used to split and did get a good size savings:
UTF8_size=11056 BOCU-1_size=6810 BOCU-1_slowdown=32%

the fact we can encode 100 million terms in 15 seconds, means any speed stuff is meaningless (though i still insist, something is wrong: either your benchmark, or it runs slower on your JDK or something (which we should try to improve)

Robert Muir
added a comment - 28/Jul/10 18:59 Yonik, please see my issue.
the fact we can encode 100 million terms in 15 seconds, means any speed stuff is meaningless (though i still insist, something is wrong: either your benchmark, or it runs slower on your JDK or something (which we should try to improve)

The char[] -> byte[] encode time is a miniscule part of indexing time. And, in turn, indexing time is far less important than impact on search performance. So... let's focus on the search performance here.

Most queries are unaffected by the term encoding; it's only AutomatonQuery (= fuzzy, regexp, wildcard) that do a fair amount of decoding...

Michael McCandless
added a comment - 28/Jul/10 19:03 The char[] -> byte[] encode time is a miniscule part of indexing time. And, in turn, indexing time is far less important than impact on search performance. So... let's focus on the search performance here.
Most queries are unaffected by the term encoding; it's only AutomatonQuery (= fuzzy, regexp, wildcard) that do a fair amount of decoding...
Net/net BOCU1 sounds like an awesome win over UTF8.

But looking at the benchmark, it looks like the majority of the time could be just making random strings.
I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance.
Here are my results:

UTF8=2936 BOCU-1=4310
It turns out that making the random strings to encode took up 81% of the UTF8 time.

Yonik Seeley
added a comment - 28/Jul/10 19:31 OK, I just tried Robert's Benchmark.java (i.e. fake english word encoding):
UTF8=15731 BOCU-1=16961 (lowest of 5 diff runs)
But looking at the benchmark, it looks like the majority of the time could be just making random strings.
I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance.
Here are my results:
UTF8=2936 BOCU-1=4310
It turns out that making the random strings to encode took up 81% of the UTF8 time.
System: Win7 64 bit, JVM=Sun 1.6.0_21 64 bit -server

Robert Muir
added a comment - 28/Jul/10 19:42 Thats good news, so we can encode 100 million strings in 4.3 seconds?
I dont think we need to discuss performance any further, this is a complete non-issue.

Yonik Seeley
added a comment - 28/Jul/10 19:59 Thats good news, so we can encode 100 million strings in 4.3 seconds? I dont think we need to discuss performance any further, this is a complete non-issue.
Well... hopefully it's not an issue.
That should really be tested with real indexing when the time comes (micro-benchmarks can do funny things).

Well... hopefully it's not an issue.
That should really be tested with real indexing when the time comes (micro-benchmarks can do funny things).

its definitely not an issue no lucene indexer can do anything with 100 million strings in any reasonable time where this will matter.

instead most non-latin languages will be writing less bytes, causing less real i/o, using half the RAM at search time, etc which is way more dramatic.

utf-8 is a non-option for our internal memory encoding, i'm suggesting bocu-1, but if you want to try to fight me all the way, then i'll start fighting for a reversion back to char[] instead... its at least less biased.

Robert Muir
added a comment - 28/Jul/10 20:05
Well... hopefully it's not an issue.
That should really be tested with real indexing when the time comes (micro-benchmarks can do funny things).
its definitely not an issue no lucene indexer can do anything with 100 million strings in any reasonable time where this will matter.
instead most non-latin languages will be writing less bytes, causing less real i/o, using half the RAM at search time, etc which is way more dramatic.
utf-8 is a non-option for our internal memory encoding, i'm suggesting bocu-1, but if you want to try to fight me all the way, then i'll start fighting for a reversion back to char[] instead... its at least less biased.

I dont think its measurable. 100 million strings in 4.3 seconds? this has no affect.

keep in mind, i fixed the analysis in 3.1 and doubled the speed of the default english indexing in solr,
so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code.

Robert Muir
added a comment - 28/Jul/10 20:39 I dont think its measurable. 100 million strings in 4.3 seconds? this has no affect.
keep in mind, i fixed the analysis in 3.1 and doubled the speed of the default english indexing in solr,
so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code.

so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code.

I have only been measuring performance at this point, and I haven't expressed an option about what defaults should be used.
If we convert to BOCU-1 as a default, and if UTF-8 remains an option, then I'd at least want to be able to document any trade-offs and when people should consider setting the encoding back to UTF-8.

Yonik Seeley
added a comment - 28/Jul/10 20:54 so if you want to improve indexing speed, i think you will be more successful looking at other parts of the code.
I have only been measuring performance at this point, and I haven't expressed an option about what defaults should be used.
If we convert to BOCU-1 as a default, and if UTF-8 remains an option, then I'd at least want to be able to document any trade-offs and when people should consider setting the encoding back to UTF-8.

You havent really been measuring performance, you have just been trying to pick a fight.

any difference in encode has almost no effect on indexing speed, like i said, 100 million strings in 4.3 seconds?

you aren't factoring i/o nor ram into the equation for the writing systems (of which there are many) where this actually cuts terms to close half their size.

since this is a compression algorithm (and I'm still working on it), its vital to include these things, and not post useless benchmarks about whether it takes 2.9 or 4.3 seconds to encode 100 million strings, which nothing in lucene can do anything with in any short time anyway.

I have a benchmark for UTF-8: and its that i have a lot of text that is twice as big on disk and causes twice as much io and eats up twice as much ram than it should.
bocu-1 fixes that, and at the same time keeps ascii at a single-byte encoding (and other latin languages are very close).
so everyone can potentially win.

Robert Muir
added a comment - 28/Jul/10 21:15 I have only been measuring performance at this point
You havent really been measuring performance, you have just been trying to pick a fight.
any difference in encode has almost no effect on indexing speed, like i said, 100 million strings in 4.3 seconds?
you aren't factoring i/o nor ram into the equation for the writing systems (of which there are many) where this actually cuts terms to close half their size.
since this is a compression algorithm (and I'm still working on it), its vital to include these things, and not post useless benchmarks about whether it takes 2.9 or 4.3 seconds to encode 100 million strings, which nothing in lucene can do anything with in any short time anyway.
I have a benchmark for UTF-8: and its that i have a lot of text that is twice as big on disk and causes twice as much io and eats up twice as much ram than it should.
bocu-1 fixes that, and at the same time keeps ascii at a single-byte encoding (and other latin languages are very close).
so everyone can potentially win.

You havent really been measuring performance, you have just been trying to pick a fight.

I'm sorry if it appeared that way, and apologize for anything I said to encourage that perception.

I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages.

Yonik Seeley
added a comment - 28/Jul/10 21:40 You havent really been measuring performance, you have just been trying to pick a fight.
I'm sorry if it appeared that way, and apologize for anything I said to encourage that perception.
I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages.

I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages.

For all of unicode yes, you just didnt pick a good variety of languages, or didnt tokenize them well (e.g. using an english tokenizer for chinese).
I've been measuring against many, and i already checked the bigram (cjktokenizer) case to make sure that cjk was always smaller (its not much... e.g. 5 bytes instead of 6, but its better)

Robert Muir
added a comment - 28/Jul/10 21:49 I was genuinely surprised when you reported "now we are faster than utf-8 on average for encode", so I set out to benchmark it myself and report back. In addition, I wanted to see what the encoding speed diff was for some different languages.
For all of unicode yes, you just didnt pick a good variety of languages, or didnt tokenize them well (e.g. using an english tokenizer for chinese).
I've been measuring against many, and i already checked the bigram (cjktokenizer) case to make sure that cjk was always smaller (its not much... e.g. 5 bytes instead of 6, but its better)

since the compression is a diff from the 'middle of the alphabet' (unicode block), an unaccented char, accented char, unaccented char combination will cause 2 2-byte diffs.
in utf-8 encoding this sequence is 4 bytes, but in bocu it becomes 5.

The reason you experienced anything of measure is, I think because of whitespaceanalyzer (which i feel is a tad unrealistic)
for example, all the german stemmers do something with the umlauts (remove or substitute ue, oe, etc).

In general, lots of our analysis for lots of languages folds and normalizes characters in ways like this, that also serves to help the compression
so I think if you used germananalyzer on the german text instead of whitespaceanalyzer, you wouldn't see much of size increase.

Robert Muir
added a comment - 28/Jul/10 22:09 by the way, to explain your results on french and german:
since the compression is a diff from the 'middle of the alphabet' (unicode block), an unaccented char, accented char, unaccented char combination will cause 2 2-byte diffs.
in utf-8 encoding this sequence is 4 bytes, but in bocu it becomes 5.
The reason you experienced anything of measure is, I think because of whitespaceanalyzer (which i feel is a tad unrealistic)
for example, all the german stemmers do something with the umlauts (remove or substitute ue, oe, etc).
In general, lots of our analysis for lots of languages folds and normalizes characters in ways like this, that also serves to help the compression
so I think if you used germananalyzer on the german text instead of whitespaceanalyzer, you wouldn't see much of size increase.

But looking at the benchmark, it looks like the majority of the time could be just making random strings.
I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance.
Here are my results:

UTF8=2936 BOCU-1=4310

I think your benchmark isnt very reliable (i got really different results), so i added an extra 0 to do 10x more terms:
char[][] terms = new char[10000][];

Robert Muir
added a comment - 28/Jul/10 22:25
But looking at the benchmark, it looks like the majority of the time could be just making random strings.
I made a modified Benchmark.java that pulls out this string creation and only tests encoding performance.
Here are my results:
UTF8=2936 BOCU-1=4310
I think your benchmark isnt very reliable (i got really different results), so i added an extra 0 to do 10x more terms:
char[][] terms = new char [10000] [];
ret=716132704 UTF-8 encode: 35081
ret=716132704BOCU-1 encode: 36517
Like i said before, i don't see a 20% difference.

Yonik Seeley
added a comment - 28/Jul/10 22:36 I think your benchmark isnt very reliable (i got really different results), so i added an extra 0 to do 10x more terms:
Did that change the ratio for you? I just tried 10x more terms, and I got the exact same ratio:
ret=708532704 UTF-8 encode: 30524
ret=708532704BOCU-1 encode: 44635

Robert Muir
added a comment - 28/Jul/10 23:01 yeah it did (it didnt seem 'stable' but the first run was much different than yours, e.g. 3300 vs 3500 or so).
I just ran with -server also [using my same 64-bit 1.6.0_19 as before] :
there is more of a difference, however not as much as yours
ret=704032704 UTF-8 encode: 32134
ret=704032704BOCU-1 encode: 36391
but go figure, if i run with my 32-bit [same jdk: 1.6.0_19] , i get horrible numbers!
here is -client
ret=684832704 UTF-8 encode: 26237
ret=684832704BOCU-1 encode: 54662
here is -server
ret=697132704 UTF-8 encode: 30062
ret=697132704BOCU-1 encode: 46293
so there is definitely an issue with 32-bit jvm, sure yours is 64-bit?

DM Smith
added a comment - 06/Mar/12 14:49 Would someone be able to champion this. It appears ready to go. for the last 1.5 years. Looks like it is merely a permission problem. I'd like to see it get in the 3.x series.