Details

Description

Object Container Files could use a 1 byte sync marker (set to zero) using zig-zag and COBS encoding within blocks to efficiently escape zeros from the record data.

Zig-Zag encoding

With zig-zag encoding only the value of 0 (zero) gets encoded into a value with a single zero byte. This property means that we can write any non-zero zig-zag long inside a block within concern for creating an unintentional sync byte.

Activity

An outsider here – I've got an idea on how to avoid the performance pitfalls of COBS' byte-by-byte nature and as I thought through it, I spotted many other opportunities for enhancement since larger chunks afford a lot more bits in the Code that can be used for things other than the length of the following literal chunk.

Proposal – COLS, a modification of COBS

(for greater performance and extensibility for large data streams)

Java is particularly bad at byte-by-byte operations. The COBS paper clearly indicates its design intention was stuffing data through embedded systems such as telephone lines and other networks where byte-by-byte processing of the whole payload is already mandatory.

Doing so here would be a performance bottleneck in Java. Some simple tests can be constructed to prove or disprove this claim.

I propose that rather than use COBS, one uses COLS or COWS ... that is Constant Overhead Long Stuffing or Constant Overhead Word Stuffing instead.

This would be inefficient if we expect most payloads to be small (less than 256 bytes), but I suspect most hadoop related payloads to be large, and often very large.

I favor stuffing Longs rather than Ints, since most systems will soon be running 64 bit JVMs in the future. Sun's next JRE release has Object Pointer Compression, which makes the memory overhead of a 64 bit JVM very small compared to a 32 bit JVM, and performance is generally faster than the 32 bit JVM due to native 64 bit operations and more registers (for x86-64 at least).http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

I will describe the proposal below assuming a translation of COBS to COLS, from 1 byte at a time to 8 byte at a time encoding. However, it is clear that a 4 byte variant is very similar and may be preferable.

Proposed Changes – Simple Block format with COLS

name

format

length in bytes

value

meaning

sync

byte

8

0L

The sync long serves as a clear marker for the start of a block

type

1 byte

1

non-zero

The type field expresses whether the block is for metadata or normal data. note - if this is only ever going to be a binary flag, it can be packed into the length or sequence number as a sign value.However, it is decoding performance critical to keep the non-COLS header 8 byte aligned

block sequence number

3 byte unsigned int

3 bytes

0 - 2^24

the block sequence number – a client can use this to resume a stream from the last successful block. This may not be needed if the metadata blocks take care of this.

length

fixed 4 byte signed int

variable

>= 0L

The length field expresses the number of bytes of COLS_payload data in bytes. Useful for skipping ahead to the next block.

COLS_payload

COLS

length as above

see COLS description below

The data in this block, encoded.

The above would put cap the stream length to 2GB * 16M = 32PB. There is room to increase this significantly by taking bits from the type and giving those to the block count. 2GB blocks are rather unlikely for now however – as is multi-PB streams.

Discussion

The entire stream would need to be 8 byte aligned in order to process it cleanly with something like java.nio.LongBuffer. This would include metadata blocks.

The sequence is assumed to be in network-order. Endianness can be handled and is not discussed in detail here.

The type can likely be encoded in a single bit in the block sequence number or length field. If more than two types of blocks are expected, more bits can be reserved for future use.

The length can be stored as the number of longs rather than bytes (bytes / 8) since the COLS payload is a multiple of 8 bytes.

The COLS payload here differs from the original proposal. It will have an entire COBS-like stream, with possibly many COLS code markers (at least one per 0L value in the block data).

One may want to have both the encoded length above, and the decoded length (or a checksum) as extra data validation. Perhaps even 4 types: METADATA, METADATA_CSUM, NORMAL, NORMAL_CSUM – where the ordinary variants store the length (fast, but less reliable) and the _CSUM variants store a checksum (slower, but highly reliable).

Basic COBS to COLS description

COBS describes a byte-by-byte encoding where a zero byte cannot exist, and a set of codes are used to encode runs of data that does not contain a zero byte. All codes but but one have an implicit trailing zero. The last block is assumed to have no implicit zero regardless of the code.

COLS is a simple extension of this scheme to 64 bit chunks. In its base form, it does nothing more than work with larger chunks:

COLS Code (Long, 8 bytes)

Followed by

Meaning

0L

N/A

(not allowed)

1L

nothing

A single zero Long

2L

one long (8 bytes)

The single data long, followed by a trailing zero long *

3L

two longs (16 bytes)

The pair of data longs, followed by a trailing zero long *

nL

(n-1) longs

The (n-1) longs, followed by a trailing zero long *

MAX **

MAX - 1 longs

MAX -1 longs, with no trailing zero

* The last code in the sequence (which can be identified by the length header or a 0L indicating the start of the next block) does NOT have an implicit trailing zero.
** MAX needs to be chosen, and can't realistically be very large since encoding requires an arraycopy of size (MAX -1) * 8

The COLS_payload has multiple COLS Code entries (and literals), up to the length specified in the header (where a 0L should then occur).

However – there are drawbacks to using such a large chunk without other modifications from COBS:

64 bits is far too large for a length field. For encoding, a COBS code block must fit in RAM, and for performance, should probably fit in half an L2 cache. However, for decoding COLS code length is irrelevant.

If the size of the data encoded is not a multiple of 8 bytes, we need a mechanism to encode that up to 7 trailing bytes should be truncated (3 bits).

For most blocks, the overhead will be exactly 8 bytes (unless the block has a trailing 0L).

Very long data streams without a zero Long are unlikely, so very large chunk lengths are not very useful.

There are also benefits however. The above suggests that most of the 8 byte COLS code block space is not needed to encode length. Much can be done with this!
Some thoughts:

The 3 bits needed to define the truncation behavior can be stored in the COLS code.

The overhead can be reduced, by encoding short trailing sequences into the upper bits rather than purely truncating – e.g. you can append 2 bytes instead of truncating 6.

Rudimentary run-length encoding or other light weight compression can be done with the extra bits (completely encoder-optional).

We can remove the requirement that most codes have an implicit trailing zero, and encode that in one of the extra bits.

If only the lower 2 bytes of an 8 byte COLS code represent the size, (MAX = 2^16 - 1), then the max literal size is 512KB - 8B. If we remove the implicit trailing zero, an encoder can optionally encode smaller literal sequences (perhaps for performance, or compression).
What can be done with the remaining 48 bits?
Some ideas:

The highest 4 bytes can represent data to append to the literal. In this way, half of the size overhead of the encoding is removed. This should generally only apply to the last COLS code in the block (for performance reasons and maintaining 8 byte alignment on all arraycopy operations, but its encoder optional).

the next bit represents whether the COLS block has an implicit 0L appended.

a bit can be used to signify endianness (this might be a better fit for the Block header or stream metadata – detecting zero's works without known endianness)

The next three bits can represent how much data is truncated or appended to the literal, (before the optional implicit 0L):

value

meaning

000

do not truncate or append

100

append all 4 leading bytes in the COLS code after the literal

111

append the first 3 leading bytes in the COLS code after the literal

110

append the first 2 leading bytes in the COLS code after the literal

101

append the leading byte in the COLS code after the literal

011

truncate the last 3 bytes of the literal

010

truncate the last 2 bytes of the literal

001

truncate the last byte of the literal

This leaves us with 12 bits. I propose that these be used for rudimentary (optional) compression:

Option A:

Run length only – the 12 bits represent the number of times to repeat the literal. Or 4 bits are the number of COLS chunks backwards (including this one) to repeat, and 8 bits is the number of repeats. Or ... some other form of emitting copies of entire COLS chunks.

Option B:

Some form of LZ-like compression that copies in 8 byte chunks – 4 bits represent the number of Longs to copy (so, max match size is 15 * 8 bytes), and 8 bits represents the number of Longs backwards (from the end of this COLS chunk) to begin that copy (up to 2KB). Because of the truncation/append feature, this is not constrained to 8-byte aligned copies on the output, but the encoded format is entirely 8 byte aligned and all copies are multiples of 8 bytes. I would not be surprised if this was as fast as LZO or faster, since it is very similar but operates in a more chunky fashion. Compression levels would not be that great, but like most similar algorithms to this the encoder can do more work to search for matches. Decoding uncompressed data should be essentially free (if the 4 bits are 0, do nothing – and most COLS blocks would be fairly large so this check does not occur that frequently).

Option C:

Reserve those 12 bits for future use / research

Alternatively, one to 4 extra bytes used for the "append" feature can be reassigned to have more than 12 bits for compression metadata.

The next portion, is to determine the state of truncation or appending.
Two options are listed – only truncation, and truncation/appending. The appending could be up to 5 bytes if we squeeze all the rest of the space. The example below is for up to 4 bytes appended and 3 bytes truncated.

appendCode = (Code >> 28) & 0xF;

appendCode & 0x7

Append or truncate

From

truncate only option

0x0

0

nothing

0

0x1

(-)1

nothing

(-)1

0x2

(-)2

nothing

(-)2

0x3

(-)3

nothing

(-)3

0x4

(+)1

Code >>> 56

(-)4

0x5

(+)2

Code >>> 48

(-)5

0x6

(+)3

Code >>> 40

(-)6

0x7

(+)4

Code >>> 32

(-)7

It may be wiser to choose an option between these. If 3 bytes are chosen as the max arbitrary append length, with 4 truncated, 20 bits are left for other purposes, rather than 12. The average COLS chunk would be one byte larger.

AppendCode & 0x8

Append 0L

0

do not append 0L (8 zero bytes)

1

do append 0L (8 zero bytes)

Encoding

The writer would perform processing in 8 byte chunks until the end of the block where some byte-by-byte processing would occur. Compression options would be entirely the writer's choice.
The state overhead can be very low or large at the writer's whim. Larger COLS chunk sizes require more state (and larger arraycopys), and any compression option adds state overhead.

Decoding

Decoding in all circumstances reads data in 8 byte chunks. Copies occur in 8 byte chunks, 8 byte aligned save for the end of a block if the block does not have a multiple of 8 bytes in its payload. An encoder can cause copy destinations (but not sources) to not be 8 byte aligned if certain special options (compression) or intentionally misaligned encoding is done. Generally, an encoder can choose to make all but the last few bytes of the last block in the stream aligned.

Scott Carey
added a comment - 07/May/09 09:02 - edited An outsider here – I've got an idea on how to avoid the performance pitfalls of COBS' byte-by-byte nature and as I thought through it, I spotted many other opportunities for enhancement since larger chunks afford a lot more bits in the Code that can be used for things other than the length of the following literal chunk.
Proposal – COLS, a modification of COBS
(for greater performance and extensibility for large data streams)
Java is particularly bad at byte-by-byte operations. The COBS paper clearly indicates its design intention was stuffing data through embedded systems such as telephone lines and other networks where byte-by-byte processing of the whole payload is already mandatory.
Doing so here would be a performance bottleneck in Java. Some simple tests can be constructed to prove or disprove this claim.
I propose that rather than use COBS, one uses COLS or COWS ... that is Constant Overhead Long Stuffing or Constant Overhead Word Stuffing instead.
This would be inefficient if we expect most payloads to be small (less than 256 bytes), but I suspect most hadoop related payloads to be large, and often very large.
I favor stuffing Longs rather than Ints, since most systems will soon be running 64 bit JVMs in the future. Sun's next JRE release has Object Pointer Compression, which makes the memory overhead of a 64 bit JVM very small compared to a 32 bit JVM, and performance is generally faster than the 32 bit JVM due to native 64 bit operations and more registers (for x86-64 at least).
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/
I will describe the proposal below assuming a translation of COBS to COLS, from 1 byte at a time to 8 byte at a time encoding. However, it is clear that a 4 byte variant is very similar and may be preferable.
Proposed Changes – Simple Block format with COLS
name
format
length in bytes
value
meaning
sync
byte
8
0L
The sync long serves as a clear marker for the start of a block
type
1 byte
1
non-zero
The type field expresses whether the block is for metadata or normal data. note - if this is only ever going to be a binary flag, it can be packed into the length or sequence number as a sign value. However, it is decoding performance critical to keep the non-COLS header 8 byte aligned
block sequence number
3 byte unsigned int
3 bytes
0 - 2^24
the block sequence number – a client can use this to resume a stream from the last successful block. This may not be needed if the metadata blocks take care of this.
length
fixed 4 byte signed int
variable
>= 0L
The length field expresses the number of bytes of COLS_payload data in bytes. Useful for skipping ahead to the next block.
COLS_payload
COLS
length as above
see COLS description below
The data in this block, encoded.
The above would put cap the stream length to 2GB * 16M = 32PB. There is room to increase this significantly by taking bits from the type and giving those to the block count. 2GB blocks are rather unlikely for now however – as is multi-PB streams.
Discussion
The entire stream would need to be 8 byte aligned in order to process it cleanly with something like java.nio.LongBuffer. This would include metadata blocks.
The sequence is assumed to be in network-order. Endianness can be handled and is not discussed in detail here.
The type can likely be encoded in a single bit in the block sequence number or length field. If more than two types of blocks are expected, more bits can be reserved for future use.
The length can be stored as the number of longs rather than bytes (bytes / 8) since the COLS payload is a multiple of 8 bytes.
The COLS payload here differs from the original proposal. It will have an entire COBS-like stream, with possibly many COLS code markers (at least one per 0L value in the block data).
One may want to have both the encoded length above, and the decoded length (or a checksum) as extra data validation. Perhaps even 4 types: METADATA, METADATA_CSUM, NORMAL, NORMAL_CSUM – where the ordinary variants store the length (fast, but less reliable) and the _CSUM variants store a checksum (slower, but highly reliable).
Basic COBS to COLS description
COBS describes a byte-by-byte encoding where a zero byte cannot exist, and a set of codes are used to encode runs of data that does not contain a zero byte. All codes but but one have an implicit trailing zero. The last block is assumed to have no implicit zero regardless of the code.
COLS is a simple extension of this scheme to 64 bit chunks. In its base form, it does nothing more than work with larger chunks:
COLS Code (Long, 8 bytes)
Followed by
Meaning
0L
N/A
(not allowed)
1L
nothing
A single zero Long
2L
one long (8 bytes)
The single data long, followed by a trailing zero long *
3L
two longs (16 bytes)
The pair of data longs, followed by a trailing zero long *
nL
(n-1) longs
The (n-1) longs, followed by a trailing zero long *
MAX **
MAX - 1 longs
MAX -1 longs, with no trailing zero
* The last code in the sequence (which can be identified by the length header or a 0L indicating the start of the next block) does NOT have an implicit trailing zero.
** MAX needs to be chosen, and can't realistically be very large since encoding requires an arraycopy of size (MAX -1) * 8
The COLS_payload has multiple COLS Code entries (and literals), up to the length specified in the header (where a 0L should then occur).
However – there are drawbacks to using such a large chunk without other modifications from COBS:
64 bits is far too large for a length field. For encoding, a COBS code block must fit in RAM, and for performance, should probably fit in half an L2 cache. However, for decoding COLS code length is irrelevant.
If the size of the data encoded is not a multiple of 8 bytes, we need a mechanism to encode that up to 7 trailing bytes should be truncated (3 bits).
For most blocks, the overhead will be exactly 8 bytes (unless the block has a trailing 0L).
Very long data streams without a zero Long are unlikely, so very large chunk lengths are not very useful.
There are also benefits however. The above suggests that most of the 8 byte COLS code block space is not needed to encode length. Much can be done with this!
Some thoughts:
The 3 bits needed to define the truncation behavior can be stored in the COLS code.
The overhead can be reduced, by encoding short trailing sequences into the upper bits rather than purely truncating – e.g. you can append 2 bytes instead of truncating 6.
Rudimentary run-length encoding or other light weight compression can be done with the extra bits (completely encoder-optional).
We can remove the requirement that most codes have an implicit trailing zero, and encode that in one of the extra bits.
If only the lower 2 bytes of an 8 byte COLS code represent the size, (MAX = 2^16 - 1), then the max literal size is 512KB - 8B. If we remove the implicit trailing zero, an encoder can optionally encode smaller literal sequences (perhaps for performance, or compression).
What can be done with the remaining 48 bits?
Some ideas:
The highest 4 bytes can represent data to append to the literal. In this way, half of the size overhead of the encoding is removed. This should generally only apply to the last COLS code in the block (for performance reasons and maintaining 8 byte alignment on all arraycopy operations, but its encoder optional).
the next bit represents whether the COLS block has an implicit 0L appended.
a bit can be used to signify endianness (this might be a better fit for the Block header or stream metadata – detecting zero's works without known endianness)
The next three bits can represent how much data is truncated or appended to the literal, (before the optional implicit 0L):
value
meaning
000
do not truncate or append
100
append all 4 leading bytes in the COLS code after the literal
111
append the first 3 leading bytes in the COLS code after the literal
110
append the first 2 leading bytes in the COLS code after the literal
101
append the leading byte in the COLS code after the literal
011
truncate the last 3 bytes of the literal
010
truncate the last 2 bytes of the literal
001
truncate the last byte of the literal
This leaves us with 12 bits. I propose that these be used for rudimentary (optional) compression:
Option A:
Run length only – the 12 bits represent the number of times to repeat the literal. Or 4 bits are the number of COLS chunks backwards (including this one) to repeat, and 8 bits is the number of repeats. Or ... some other form of emitting copies of entire COLS chunks.
Option B:
Some form of LZ-like compression that copies in 8 byte chunks – 4 bits represent the number of Longs to copy (so, max match size is 15 * 8 bytes), and 8 bits represents the number of Longs backwards (from the end of this COLS chunk) to begin that copy (up to 2KB). Because of the truncation/append feature, this is not constrained to 8-byte aligned copies on the output, but the encoded format is entirely 8 byte aligned and all copies are multiples of 8 bytes. I would not be surprised if this was as fast as LZO or faster, since it is very similar but operates in a more chunky fashion. Compression levels would not be that great, but like most similar algorithms to this the encoder can do more work to search for matches. Decoding uncompressed data should be essentially free (if the 4 bits are 0, do nothing – and most COLS blocks would be fairly large so this check does not occur that frequently).
Option C:
Reserve those 12 bits for future use / research
Alternatively, one to 4 extra bytes used for the "append" feature can be reassigned to have more than 12 bits for compression metadata.
So, with the above modifications, the COLS code looks like this:
The COLS code is 8 bytes. The low 16 bits encode basic meaning.
An 8 byte COLS code cannot be 0L.
Code & 0xFFFF (low 2 bytes)
Followed by
Meaning
0x0000
N/A
(not allowed)
0x0001
nothing
A single zero Long
0x0002
one long (8 bytes)
The single data long
0x0003
two longs (16 bytes)
The pair of data longs
n
(n-1) longs
The (n-1) longs
0xFFFF
2^16 - 2 longs
2^16 - 2 longs
The next portion, is to determine the state of truncation or appending.
Two options are listed – only truncation, and truncation/appending. The appending could be up to 5 bytes if we squeeze all the rest of the space. The example below is for up to 4 bytes appended and 3 bytes truncated.
appendCode = (Code >> 28) & 0xF;
appendCode & 0x7
Append or truncate
From
truncate only option
0x0
0
nothing
0
0x1
(-)1
nothing
(-)1
0x2
(-)2
nothing
(-)2
0x3
(-)3
nothing
(-)3
0x4
(+)1
Code >>> 56
(-)4
0x5
(+)2
Code >>> 48
(-)5
0x6
(+)3
Code >>> 40
(-)6
0x7
(+)4
Code >>> 32
(-)7
It may be wiser to choose an option between these. If 3 bytes are chosen as the max arbitrary append length, with 4 truncated, 20 bits are left for other purposes, rather than 12. The average COLS chunk would be one byte larger.
AppendCode & 0x8
Append 0L
0
do not append 0L (8 zero bytes)
1
do append 0L (8 zero bytes)
Encoding
The writer would perform processing in 8 byte chunks until the end of the block where some byte-by-byte processing would occur. Compression options would be entirely the writer's choice.
The state overhead can be very low or large at the writer's whim. Larger COLS chunk sizes require more state (and larger arraycopys), and any compression option adds state overhead.
Decoding
Decoding in all circumstances reads data in 8 byte chunks. Copies occur in 8 byte chunks, 8 byte aligned save for the end of a block if the block does not have a multiple of 8 bytes in its payload. An encoder can cause copy destinations (but not sources) to not be 8 byte aligned if certain special options (compression) or intentionally misaligned encoding is done. Generally, an encoder can choose to make all but the last few bytes of the last block in the stream aligned.

Before we get to far, let's review the motivation for this. From Matt's message:

It make more sense to make that we use the same record boundary (0) for all Avro records instead of having them be random. The format would be more resilient to data corruption easier to parse. It's also possible (although improbable) that the 16-byte UUID might be part of the payload... especially given the size of the data Hadoop processes.

What's the tangible advantage of a single record boundary?

Why would this be more corruption resistant?

How likely is a collision? By my reading of http://en.wikipedia.org/wiki/Birthday_attack, we have a ~1% chance of collision in an exabyte (10^18B) of data, roughly 1000 times todays largest datasets, if we used the same marker for the full exabyte, which we would not, since we'd choose a new marker per output partition. Switching to a 32 byte marker would raise this to 10^37B. So we might consider that if we're worried about collisions.

Doug Cutting
added a comment - 07/May/09 21:49 Before we get to far, let's review the motivation for this. From Matt's message:
It make more sense to make that we use the same record boundary (0) for all Avro records instead of having them be random. The format would be more resilient to data corruption easier to parse. It's also possible (although improbable) that the 16-byte UUID might be part of the payload... especially given the size of the data Hadoop processes.
What's the tangible advantage of a single record boundary?
Why would this be more corruption resistant?
How likely is a collision? By my reading of http://en.wikipedia.org/wiki/Birthday_attack , we have a ~1% chance of collision in an exabyte (10^18B) of data, roughly 1000 times todays largest datasets, if we used the same marker for the full exabyte, which we would not, since we'd choose a new marker per output partition. Switching to a 32 byte marker would raise this to 10^37B. So we might consider that if we're worried about collisions.

1. What is the tangible advantage of a single record boundary?
2. Why would this be more corruption resistant?

I'm imagining a situation where you have part of an Avro Object container file minus the header/footer metablock because of data loss or subscribing to a data stream in "real-time" midstream. In that situation, determining the random 16 byte sync marker would require some work (e.g. finding recurring 16-byte values, searching for the string "schema" and working back, etc). Having a constant sync value (with an escaped payload) makes this recovery easier and the code a little cleaner. To be honest, this point is weakened by the fact that we're not planning on streaming Object container files anyway.

3. How likely is a collision?

Seems like this is a non-issue with a 16-byte sync value as it is now but it's always good to be future proof.

I'm curious what other Java experts (since I'm not) out there feel about COBS in Java . It sounds from Scott's comment that byte stuffing in Java is a non-starter.

Matt Massie
added a comment - 08/May/09 00:29
1. What is the tangible advantage of a single record boundary?
2. Why would this be more corruption resistant?
I'm imagining a situation where you have part of an Avro Object container file minus the header/footer metablock because of data loss or subscribing to a data stream in "real-time" midstream. In that situation, determining the random 16 byte sync marker would require some work (e.g. finding recurring 16-byte values, searching for the string "schema" and working back, etc). Having a constant sync value (with an escaped payload) makes this recovery easier and the code a little cleaner. To be honest, this point is weakened by the fact that we're not planning on streaming Object container files anyway.
3. How likely is a collision?
Seems like this is a non-issue with a 16-byte sync value as it is now but it's always good to be future proof.
I'm curious what other Java experts (since I'm not) out there feel about COBS in Java . It sounds from Scott's comment that byte stuffing in Java is a non-starter.
There is code at..
https://bosshog.lbl.gov/repos/java-u3/trunk/sea/src/gov/lbl/dsd/sea/nio/util/COBSCodec.java
...from Lawrence Berkeley Labs to do COBS encoding in Java with the following comment
/* Performance Note: The JDK 1.5 server VM runs <code>decode(encode(src))</code>
* at about 125 MB/s throughput on a commodity PC (2 GHz Pentium 4). Encoding is
* the bottleneck, decoding is extremely cheap. Obviously, this is way more
* efficient than Base64 encoding or similar application level byte stuffing
* mechanisms.
*/

I'm curious what other Java experts (since I'm not) out there feel about COBS in Java . It sounds from Scott's comment that byte stuffing in Java is a non-starter.

That really depends on the performance requirement.

If the requirement is to be able to encapsulate data and stream at near Gigabit ethernet speed or teamed Gigabit (~100MB/sec to 200MB/sec), it will get in the way.
If other things already significantly limit streaming capability then it may not be a large incremental overhead.
For example, if the Avro serialization process is already going byte-by-byte somewhere else, this could 'piggyback' almost for free – but it would have to be embedded in that other code, in the same loop.

I also want to highlight that the byte-by-byte streaming in Java can be compared to larger chunk sizes with a fairly simple benchmark to validate (or disprove) my claims that it is slow in comparison.

The data from LBL is useful. It should be fairly easy to change that to a larger chunk size and compare on a new JVM.

Scott Carey
added a comment - 08/May/09 18:51 I'm curious what other Java experts (since I'm not) out there feel about COBS in Java . It sounds from Scott's comment that byte stuffing in Java is a non-starter.
That really depends on the performance requirement.
If the requirement is to be able to encapsulate data and stream at near Gigabit ethernet speed or teamed Gigabit (~100MB/sec to 200MB/sec), it will get in the way.
If other things already significantly limit streaming capability then it may not be a large incremental overhead.
For example, if the Avro serialization process is already going byte-by-byte somewhere else, this could 'piggyback' almost for free – but it would have to be embedded in that other code, in the same loop.
I also want to highlight that the byte-by-byte streaming in Java can be compared to larger chunk sizes with a fairly simple benchmark to validate (or disprove) my claims that it is slow in comparison.
The data from LBL is useful. It should be fairly easy to change that to a larger chunk size and compare on a new JVM.
I'll try to characterize this on my own time this weekend.

If the Java performance of byte-by-byte processing is the major issue, is it worth considering native code to optimize this? I don't generally like using native code, but I feel like it may be worth it if the advantages of COBS are significant enough.

On a side note, I recently read a paper that added a JVM optimization to really improve element-by-element processing of arrays by automatically eliminating bounds checking. I imagine that would apply here. Unfortunately, basing a system around a JVM that doesn't exist yet isn't so wise But down the road this performance issue may be ameliorated.

Todd Lipcon
added a comment - 08/May/09 19:10 If the Java performance of byte-by-byte processing is the major issue, is it worth considering native code to optimize this? I don't generally like using native code, but I feel like it may be worth it if the advantages of COBS are significant enough.
On a side note, I recently read a paper that added a JVM optimization to really improve element-by-element processing of arrays by automatically eliminating bounds checking. I imagine that would apply here. Unfortunately, basing a system around a JVM that doesn't exist yet isn't so wise But down the road this performance issue may be ameliorated.

My tests verified that the byte array wasn't altered by the encoding/decoding process (there were no failures).

These number are meant to be ballpark values since my MacBook was "quiet" during the tests... I was cranking some Radiohead on iTunes.

One of the factors that can effect the speed of COBS is the number of zeros you need to encode/decode. In the worse case, you are encoding nothing but zeros. In that case, you'll essentially be replace all zeros with ones.

The results from this worse case (nothing but zeros) are as follows...

Encoding at 38.22 MB/sec
Decoding at 17.85 MB/sec

If we have one zero every 10 bytes...

Encoding at 57.26 MB/sec
Decoding at 151.91 MB/sec

If you have one zero every 100 bytes...

Encoding at 74.81 MB/sec
Decoding at 846.56 MB/sec

If you have one zero every 1000 bytes...

Encoding at 73.70 MB/sec
Decoding at 1128.75 MB/sec

If you have one zero every 10,000 bytes...

Encoding at 74.40 MB/sec
Decoding at 1118.88 MB/sec

If you have no zeros at all...

Encoding at 73.98 MB/sec
Decoding at 1151.08 MB/sec

So it looks to me like... even with native Java code... we'll be able to push ~100MB/sec - 200MB/sec... (except for the worse case where we have 64MB of zeros).

Matt Massie
added a comment - 08/May/09 20:02 The suspense was just killing me so I had to get some benchmarks myself.
Scott, I'll be interested to see if you have similar results over the weekend.
I rewrote the LBL code to use ByteBuffers instead of ArrayByteList from the older Apache commons primitives. The new API looks like...
public static void decode(ByteBuffer src, int from, int to, ByteBuffer dest) throws IOException
public static void encode(ByteBuffer src, int from, int to, ByteBuffer dest)
I chose ByteBuffers because I didn't want to realloc new byte arrays but instead operate on the same byte array for each test.
My test results are the average of 10 tests run on a 64 MB ByteBuffer running on my MacBook Pro
Model Name: MacBook Pro
Model Identifier: MacBookPro5,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2.4 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 3 MB
Memory: 4 GB
Bus Speed: 1.07 GHz
Since my test wasn't multithreaded... only one core was used.
My tests verified that the byte array wasn't altered by the encoding/decoding process (there were no failures).
These number are meant to be ballpark values since my MacBook was "quiet" during the tests... I was cranking some Radiohead on iTunes.
One of the factors that can effect the speed of COBS is the number of zeros you need to encode/decode. In the worse case, you are encoding nothing but zeros. In that case, you'll essentially be replace all zeros with ones.
The results from this worse case (nothing but zeros) are as follows...
Encoding at 38.22 MB/sec
Decoding at 17.85 MB/sec
If we have one zero every 10 bytes...
Encoding at 57.26 MB/sec
Decoding at 151.91 MB/sec
If you have one zero every 100 bytes...
Encoding at 74.81 MB/sec
Decoding at 846.56 MB/sec
If you have one zero every 1000 bytes...
Encoding at 73.70 MB/sec
Decoding at 1128.75 MB/sec
If you have one zero every 10,000 bytes...
Encoding at 74.40 MB/sec
Decoding at 1118.88 MB/sec
If you have no zeros at all...
Encoding at 73.98 MB/sec
Decoding at 1151.08 MB/sec
So it looks to me like... even with native Java code... we'll be able to push ~100MB/sec - 200MB/sec... (except for the worse case where we have 64MB of zeros).
I'll post my code to this Jira so others can point and laugh.

Whether relying on optimizations only available in a not-yet-released JVM is a good idea is certainly up for debate. Given that Avro is still in its infancy, JDK 7 might be common by the time Avro is in production use.

Todd Lipcon
added a comment - 08/May/09 23:04 It turns out the paper I read has been implemented in JDK 7. If someone has this mythical beast installed, it would be very interesting to see the results of Matt's benchmark code.
Here's a link to someone else's experiences with it:
http://lingpipe-blog.com/2009/03/30/jdk-7-twice-as-fast-as-jdk-6-for-arrays-and-arithmetic/
Whether relying on optimizations only available in a not-yet-released JVM is a good idea is certainly up for debate. Given that Avro is still in its infancy, JDK 7 might be common by the time Avro is in production use.

Todd: I think that many of the JDK 7 enhancements have been backported to JDK 1.6.0_u14. I'll run some experiments later.

Matt:
Great stuff! Your results make sense to me based on previous experience. I went and made some modifications myself to try out doing this 4 bytes at a time.

Unfortunately, this just made things more confusing for now.

First, on your results:

75MB/sec is somewhat slow. If anything else is roughly as expensive (say, the Avro serialization itself) then the max rate one client can encode and stream to another will be ~half that. The decode rate is good.

As a microbenchmark of sorts, we'll want to make sure the JVM warms up, run an iteration or two of the test, garbage collect, then measure.

Apple's JVM is going to be a bit off. I'll run some tests on a Linux server with Sun's JVM later, and try it with the 1.6.0_14 improvements as well.

There is a bug – the max interval between 0 byte occurances is 256 – which is probably why the results behaved like they did.

I ran the same tests on my machine using Apple's 1.5 JVM with similar results. With Apple's (64 bit) 1.6 JVM, the results are much higher.

One 0 byte per 1000 (actually less due to the bug).
Encoding at 224.48262 MB/sec
Decoding at 1233.1406 MB/sec

one in 10 0's:
Encoding at 143.20877 MB/sec
Decoding at 405.06326 MB/sec

So there is quite the potential for the latest Sun JVM to be fast ... or slow.

I wrote a "COWSCodec" to try this out with 4 byte chunks. The initial encoding results were good ... up to 300MB/sec with all 0 bytes.
However, that implementation uses ByteBuffer.asIntBuffer(). And those IntBuffer views do not support the .array() method, so I had to use the IntBuffer.put(IntBuffer) signature for bulk copies.
To do that cleanly, it made most sense to refactor the whole thing to use Java nio.Buffer style method signatures (set position, limit before a copy, use mark(), flip(), etc). After doing so, it turns out that the IntBuffer views created by ByteBuffer.asIntBuffer do not really support bulk get/put operations. The max decode speed is about 420MB/sec.

So, there is one other way to do larger chunk encodings out of a ByteBuffer source and destination – use the ByteBuffer.getInt() and raw copy stuff rather than an intermediate IntBuffer wrapper.
I can also test out a 'real' IntBuffer which is backed by an int[] rather than a byte[] which should be the fastest – but not applicable to reading/writing from network or file.

Both of those should be fairly simple – I'll clean up what I have, add that stuff, and put it up here in a day or two.
Linux tests and variations with the latest/greatest JVM will be informative as well.

Scott Carey
added a comment - 11/May/09 22:03 Todd: I think that many of the JDK 7 enhancements have been backported to JDK 1.6.0_u14. I'll run some experiments later.
Matt:
Great stuff! Your results make sense to me based on previous experience. I went and made some modifications myself to try out doing this 4 bytes at a time.
Unfortunately, this just made things more confusing for now.
First, on your results:
75MB/sec is somewhat slow. If anything else is roughly as expensive (say, the Avro serialization itself) then the max rate one client can encode and stream to another will be ~half that. The decode rate is good.
As a microbenchmark of sorts, we'll want to make sure the JVM warms up, run an iteration or two of the test, garbage collect, then measure.
Apple's JVM is going to be a bit off. I'll run some tests on a Linux server with Sun's JVM later, and try it with the 1.6.0_14 improvements as well.
There is a bug – the max interval between 0 byte occurances is 256 – which is probably why the results behaved like they did.
I ran the same tests on my machine using Apple's 1.5 JVM with similar results. With Apple's (64 bit) 1.6 JVM, the results are much higher.
One 0 byte per 1000 (actually less due to the bug).
Encoding at 224.48262 MB/sec
Decoding at 1233.1406 MB/sec
All 0 bytes:
Encoding at 122.69939 MB/sec
Decoding at 62.184223 MB/sec
one in 10 0's:
Encoding at 143.20877 MB/sec
Decoding at 405.06326 MB/sec
So there is quite the potential for the latest Sun JVM to be fast ... or slow.
I wrote a "COWSCodec" to try this out with 4 byte chunks. The initial encoding results were good ... up to 300MB/sec with all 0 bytes.
However, that implementation uses ByteBuffer.asIntBuffer(). And those IntBuffer views do not support the .array() method, so I had to use the IntBuffer.put(IntBuffer) signature for bulk copies.
To do that cleanly, it made most sense to refactor the whole thing to use Java nio.Buffer style method signatures (set position, limit before a copy, use mark(), flip(), etc). After doing so, it turns out that the IntBuffer views created by ByteBuffer.asIntBuffer do not really support bulk get/put operations. The max decode speed is about 420MB/sec.
So, there is one other way to do larger chunk encodings out of a ByteBuffer source and destination – use the ByteBuffer.getInt() and raw copy stuff rather than an intermediate IntBuffer wrapper.
I can also test out a 'real' IntBuffer which is backed by an int[] rather than a byte[] which should be the fastest – but not applicable to reading/writing from network or file.
Both of those should be fairly simple – I'll clean up what I have, add that stuff, and put it up here in a day or two.
Linux tests and variations with the latest/greatest JVM will be informative as well.

I'm imagining a situation where you have part of an Avro Object container file minus the header/footer metablock because of data loss or subscribing to a data stream in "real-time" midstream.

But metainfo is required to make sense of the stream. You need its schema, codec, etc. Getting the sync marker doesn't seem a huge burden on top of that, unless you're figuring you'd skip to the next metadata flush before you try to make sense of the stream? How critical is this streaming-without-metadata use case? If it becomes an important use case, we might define a streaming-specific container, or use RTSP or somesuch, rather than using the existing container file format at all.

Not that this isn't an interesting area, but I'd much more interested in, e.g., gzip and lzf compression codecs for Avro's file format, or Avro InputFormat and OutputFormat's for mapreduce, or perhaps a version of Dumbo that uses the Pipes protocol to more efficiently get complex Avro data in and out of Python programs, etc.

Doug Cutting
added a comment - 11/May/09 23:08 I'm imagining a situation where you have part of an Avro Object container file minus the header/footer metablock because of data loss or subscribing to a data stream in "real-time" midstream.
But metainfo is required to make sense of the stream. You need its schema, codec, etc. Getting the sync marker doesn't seem a huge burden on top of that, unless you're figuring you'd skip to the next metadata flush before you try to make sense of the stream? How critical is this streaming-without-metadata use case? If it becomes an important use case, we might define a streaming-specific container, or use RTSP or somesuch, rather than using the existing container file format at all.
Not that this isn't an interesting area, but I'd much more interested in, e.g., gzip and lzf compression codecs for Avro's file format, or Avro InputFormat and OutputFormat's for mapreduce, or perhaps a version of Dumbo that uses the Pipes protocol to more efficiently get complex Avro data in and out of Python programs, etc.

Test COBS / COWS / COLS codecs. First batch of files. These three files are described as follows:

COBSCodec2.java – minor modification of the previous version for an improved testing loop. Also modified to test in batch with the other new additions.

COWSCodec.java – first, hack-ish version of a COBS-like encoding that works in 4 byte chunks. This version uses ByteBuffer.asIntBuffer(), and does all copies with the default nio 'copy from position() to limit()' behavior. This turns out to be slow. asIntBuffer does not have optimal copy operation as can be seen in the slow decode.

Scott Carey
added a comment - 26/May/09 05:30 Test COBS / COWS / COLS codecs. First batch of files. These three files are described as follows:
COBSCodec2.java – minor modification of the previous version for an improved testing loop. Also modified to test in batch with the other new additions.
COWSCodec.java – first, hack-ish version of a COBS-like encoding that works in 4 byte chunks. This version uses ByteBuffer.asIntBuffer(), and does all copies with the default nio 'copy from position() to limit()' behavior. This turns out to be slow. asIntBuffer does not have optimal copy operation as can be seen in the slow decode.
COWSCodec2.java – re-implimented using ByteBuffer.getInt() and putInt(). Significantly faster.
Three more files after this and a set of benchmarks on Linux with recent JRE's.
The point of all this is experimentation and optimization. Although this specific JIRA may not become relevant – the results of this investigation may be useful in other contexts as well.

COWSCodec3.java – Slightly more optimized and cleaner version of COWSCodec2.
COLSCodec.java – A version that encodes with 8 byte chunks using ByteBuffer getLong() and putLong().

The above two have at least one minor bug left but the performance experiment should still be valid (there is a case were the decoded output can be 1 word too large). Also, these don't yet work with encoding or decoding streams that are not even multiples of 4 and 8 bytes.

COBSPerfTest.java – a class for executing a test against all the variants in one go, with various ratios of zero words. Used for performance results that I'll post later.

Scott Carey
added a comment - 26/May/09 05:35 COWSCodec3.java – Slightly more optimized and cleaner version of COWSCodec2.
COLSCodec.java – A version that encodes with 8 byte chunks using ByteBuffer getLong() and putLong().
The above two have at least one minor bug left but the performance experiment should still be valid (there is a case were the decoded output can be 1 word too large). Also, these don't yet work with encoding or decoding streams that are not even multiples of 4 and 8 bytes.
COBSPerfTest.java – a class for executing a test against all the variants in one go, with various ratios of zero words. Used for performance results that I'll post later.

First, an overview:
The 64 bit JRE on MacOS X has roughly similar performance characteristics in these tests to the Linux Sun JRE 1.6.0_12. The Mac OSX 32 bit 1.5 JRE is vastly different.
A 32 bit JVM is slightly faster than a 64 bit JVM on the byte-by-byte work, roghly the same at 4 byte at a time work, and slower at 8 byte at a time work. This is mostly expected.
Variations in VM from Sun 1.6.0_12 through a few early access versions of 1.6.0_14 have roghly the same performance. That is, the performance improvements in the latest JRE (of which, there are many) don't seem to have an impact here.

Larger byte chunks help decoding only a little unless zero words dominate, and then it helps a lot.
Larger chunks help encoding significantly across the board. COLS – working with 8 byte chunks – is about 4x faster than COBS.

The results below could use some formatting work – it is very verbose.
Al results with Centos 5.3
Xeon 5335 is 2.0Ghz, 4MB cache per pair of cores, 2x quad core
Xeon E5440 is 2.83Ghz, 6MB cache per pair of cores, 2x quad core

COLSCodec, one zero word every 10 words
Encoding at 354.09015 MB/sec
Decoding at 812.4928 MB/sec
Original array was modified!

That Sir, is the remaining bug I alluded to but didn't highlight enough in my previous comment. If you change the size of the array, the random number seed, or just about anything else it will go away (or pop up elsewhere).

The before and after arrays have the same bytes, but the one that was encoded and decoded has an extra word at the end. I stepped through that case briefly, but was too lazy to fix it. I don't think it is relevant to the overall results. (and any real Codec would be written cleaner, with plenty of unit tests to cover the corner cases).

Which reminds me, these are the main conclusions I draw not specific to this JIRA:

ByteBuffer.getInt() , getLong(), are rather optimized, as are the matching putInt() and putLong() operations. Bulk put operations are also fast on ByteBuffer, but not IntBuffer if created from ByteBuffer.asIntBuffer().

Any encoder or decoder in Java will see potentially large performance gains if it can read / write in larger chunks.

I could be evil and try the same test and misalign the array – start at position 1 instead of 0 (the JVM aligns array data to 8 byte boundaries, and many processor instructions are faster if aligned).

Ok, I decided to be evil and try it on my laptop with misaligned bytes (added a put(0) to the start of the encoder and a get() to the start of the decoder, to misalign the whole thing by a byte). Now, perhaps getLong() will be a lot less efficient. Lets see:

These are within the usual margin of error, and essentially the same. Perhaps the JVM's JIT isn't smart enough to recognize that in the first case, all access is aligned and use the processor load instructions for aligned access which are faster? I could write a COLSCodec2 that operated on LongBuffer rather than ByteBuffer to see what that does.

But the main conclusion is that accessing in larger chunks has big gains when it is possible to do.

Scott Carey
added a comment - 26/May/09 06:50
COLSCodec, one zero word every 10 words
Encoding at 354.09015 MB/sec
Decoding at 812.4928 MB/sec
Original array was modified!
That Sir, is the remaining bug I alluded to but didn't highlight enough in my previous comment. If you change the size of the array, the random number seed, or just about anything else it will go away (or pop up elsewhere).
The before and after arrays have the same bytes, but the one that was encoded and decoded has an extra word at the end. I stepped through that case briefly, but was too lazy to fix it. I don't think it is relevant to the overall results. (and any real Codec would be written cleaner, with plenty of unit tests to cover the corner cases).
Which reminds me, these are the main conclusions I draw not specific to this JIRA:
ByteBuffer.getInt() , getLong(), are rather optimized, as are the matching putInt() and putLong() operations. Bulk put operations are also fast on ByteBuffer, but not IntBuffer if created from ByteBuffer.asIntBuffer().
Any encoder or decoder in Java will see potentially large performance gains if it can read / write in larger chunks.
I could be evil and try the same test and misalign the array – start at position 1 instead of 0 (the JVM aligns array data to 8 byte boundaries, and many processor instructions are faster if aligned).
Ok, I decided to be evil and try it on my laptop with misaligned bytes (added a put(0) to the start of the encoder and a get() to the start of the decoder, to misalign the whole thing by a byte). Now, perhaps getLong() will be a lot less efficient. Lets see:
Aligned (COLS):
COLSCodec, one zero word every 1 words
Encoding at 323.87604 MB/sec
Decoding at 419.4213 MB/sec
COLSCodec, one zero word every 10 words
Encoding at 376.7943 MB/sec
Decoding at 1041.8271 MB/sec
COLSCodec, one zero word every 10000 words
Encoding at 439.01627 MB/sec
Decoding at 1350.2242 MB/sec
COLSCodec, one zero word every 1000000 words
Encoding at 415.91876 MB/sec
Decoding at 1411.3434 MB/sec
Misaligned (COLS):
COLSCodec, one zero word every 1 words
Encoding at 327.0196 MB/sec
Decoding at 402.65366 MB/sec
COLSCodec, one zero word every 10 words
Encoding at 377.48105 MB/sec
Decoding at 974.4739 MB/se
COLSCodec, one zero word every 10000 words
Encoding at 445.4802 MB/sec
Decoding at 1440.7946 MB/s
COLSCodec, one zero word every 1000000 words
Encoding at 443.61166 MB/sec
Decoding at 1423.9922 MB/sec
These are within the usual margin of error, and essentially the same. Perhaps the JVM's JIT isn't smart enough to recognize that in the first case, all access is aligned and use the processor load instructions for aligned access which are faster? I could write a COLSCodec2 that operated on LongBuffer rather than ByteBuffer to see what that does.
But the main conclusion is that accessing in larger chunks has big gains when it is possible to do.

So, aligned access is important – However, the JVM 's JIT can not guarantee it on a ByteBuffer or byte[], but can on a LongBuffer or long[]. Here are results on my laptop akin to the above, but with a COLSCodec2 that uses a LongBuffer rather than a ByteBuffer + getLong()/putLong().

Unfortunately, for anything reading/writing from the network or a file, byte streams and arrays are the only option. And as demonstrated before, asLongBuffer or asIntBuffer is not optimized and fairly restrictive. This seems to indicate that in the future, there is more that the JVM can do, or there are Java APIs that could be made so that the JIT can easily detect data alignment and be more efficient.

Scott Carey
added a comment - 26/May/09 07:16 So, aligned access is important – However, the JVM 's JIT can not guarantee it on a ByteBuffer or byte[], but can on a LongBuffer or long[]. Here are results on my laptop akin to the above, but with a COLSCodec2 that uses a LongBuffer rather than a ByteBuffer + getLong()/putLong().
COLSCodec, one zero word every 1 words
Encoding at 939.8201 MB/sec
Decoding at 980.54034 MB/sec
COLSCodec, one zero word every 10 words
Encoding at 822.7025 MB/sec
Decoding at 1188.7073 MB/sec
COLSCodec, one zero word every 1000 words
Encoding at 1104.4512 MB/sec
Decoding at 1429.9589 MB/sec
Unfortunately, for anything reading/writing from the network or a file, byte streams and arrays are the only option. And as demonstrated before, asLongBuffer or asIntBuffer is not optimized and fairly restrictive. This seems to indicate that in the future, there is more that the JVM can do, or there are Java APIs that could be made so that the JIT can easily detect data alignment and be more efficient.

Doug Cutting
added a comment - 23/Jun/09 20:25 I don't think adding this is worthwhile pursuing at this point. While having nice properties, this inserts a non-negligible decoding operation to all data file processing.
We can potentially add this to a future file format, but for the file format specfied in the 1.0 release I'd like to keep it as-is. Objections?

I agree, COBS-like encoding is only useful for streaming data where a specific character or word must be avoided which is a format issue.

If all that is needed is identifying block boundaries, there are other methods.

A "magic number" approach can be collision proof by detecting the collision: On encode, look for the magic number and if present, follow it with a 'not at the end of the block' word; at the end of the block place the magic number and a 'end of block' word. On decode look for the magic number and discard the following word, if the following word is the end of block word also discard the magic word. COBS came about because the worst case scenario for a magic word approach is poor, and if the size of the magic word is small (one byte) the worst case is likely.

This might prove very useful at some point. Some of the general optimization findings here will be useful somewhere.

Scott Carey
added a comment - 23/Jun/09 20:57 I agree, COBS-like encoding is only useful for streaming data where a specific character or word must be avoided which is a format issue.
If all that is needed is identifying block boundaries, there are other methods.
A "magic number" approach can be collision proof by detecting the collision: On encode, look for the magic number and if present, follow it with a 'not at the end of the block' word; at the end of the block place the magic number and a 'end of block' word. On decode look for the magic number and discard the following word, if the following word is the end of block word also discard the magic word. COBS came about because the worst case scenario for a magic word approach is poor, and if the size of the magic word is small (one byte) the worst case is likely.
This might prove very useful at some point. Some of the general optimization findings here will be useful somewhere.