Moo-able Type for Cowtowncoder.com

Friday, October 29, 2010

Faster compression on Java platform with LZF

One open source project that I have worked on a bit lately is compress-lzf,
a simple Java library that implements block- and stream-accessible
version of LZF compression algorithm. LZF algorithm itself is very
simple, and mostly (only?) documented here,
in form of working reference implementation that doubles as
documentation.

Our implementation is a simple rewrite of C version, and fully
compatible with the reference implementation including file header and
block signature (this is different from the other existing Java version
that projects like H2 include). While simple implementation, it is not
naive translation; and some effort has been spent on optimizing its
performance.

And as should be the case for such low-level libraries, there are no
external dependencies beyond JDK 1.5; meaning it should work on variety
of platforms, including Android (in fact it could probably be compiled
on any 1.x JDK)

1. What is LZF?

LZF compression algorithm is basic Lempel-Ziv,
with little additional processing; so one way to think of it is that it
is "gzip without Huffman encoding" (gzip uses Deflate, which is
Lempel-Ziv followed by Huffman encoding, with variable bit-length).And
Lempel-Ziv is just method whereby one finds repeating sub-sequences in
output, replacing sequences of three or more "already seen bytes" with a
back-reference; matches are typically stored in a hash table indexed by
hash of first 3 bytes, value being offset of last instance of
potentially matching sequence.

It is worth noting that many other similar (and sometimes similarly
named) algoritms likewise use basic LZ; "Fastlz" for example is very
similar.

2. Gzip exists, why bother with LZF?

But why does this matter? If gzip is this and more, why would you want
to use a "lesser" algorithm?

Key benefit is that of speed: LZF can be up to twice as fast to
uncompress; but where it REALLY shines is compression speed -- here it
can be 2x or 3x as fast (writing a stand-alone benchmark is on my todo
list, so hopefully I will get to talk a bit more about this in future).
Performance difference is mostly due to relatively high overhead that
deflate uses for non-byte-aligned output (which is needed for optimal
Huffman-encoding of codewords) -- while it definitely helps compression,
it is expensive in terms of CPU usage.

LZF compression rate is lower than with gzip, but compression profile is
similar: things that compress well with gzip also compress well with
LZF; and those that compress less do so with LZF as well. This makes
sense given that most compression often comes from LZ algorithm, which
is implemented similarly.

So it all comes down to this: gzip compression is very CPU-intensive
(read: slow); so much so that you typically use 5x as long to compress
things like JSON data than writing content itself. With fast networks
and I/O this often becomes bottleneck.LZF can be a good alternative
since its CPU overhead is much more modest; and it can still compress
"fluffy" content pretty well -- where gzip can compress something by,
say, 90%, LZF usually gets to 75-80% ratio (this obviously varies by
types of content; basically depends on which part of deflate algorithm
is more effective)

3. Usage

Assuming that above sounds interesting, you can get compress-lzf library
either from GitHub; or via Maven. In latter case, information is:

<groupId>com.ning</groupId>

<artifactId>compress-lzf</artifactId>

current version at time of this writing is "0.6.0"

And from there you can use class from com.ning.compress.lzf
package:

LZF is a simple command-line utility you can use for manual
compression/uncompression (jar defines this class as the Main-Class so
you can just invoke it with "java -jar compress-lzf.jar" to get usage
information)

LZFEncoder and LZFDecoder are block-based codecs that can be used to
encode (compress) and decode (uncompress) chunks (byte[]) of data

Both streaming and block-based interfaces have their uses; for example,
when storing BLOBs in database, block API works better; and when reading
data over network or from filesystem, streams make more sense.

3.1 Note: avoid LZFOutputStream.flush() ("just let it float...")

One note on LZFOutputStream: be careful with flush()ing -- just like
other block compressors (including gzip), LZF can only do
back-references within single block; and since LZFOutputStream.flush()
forces closing of an open block, it can dramatically reduce compression
rate. As such it is best to avoid flush()ing if at all possible, unless
you actually want to create a logical boundary at some point.

4. Future plans for the library

While there are always improvements for any Open Source projects, I
think the natural scope for LZF library is quite small. Similar to Java
Uuid Generator, it is quite possible that there won't be need for
major changes to implementation or API; which means that future
improvements are most likely in areas of integration; helping other
projects use LZF.

For example, given unique CPU/io/bandwidth trade-off LZF can offer, it
could make a really nice component for large-scale data transfer and
processing, for example:

Combination of high-performance data formats (Jackson Smile, binary
JSON serialization) and compression -- can be combined with other use
cases too.

So it will be interesting to see where things go in near future.

5. On benefits of working for a company that is comfortable with Open
Source

Unlike with many earlier Open Source projects that I have worked on,
much of work that went into this open source library was done on company
time. It also was a team effort (Jon wrote the streaming handling, I
wrote block codec); compared to most other open source projects where I
have been the main author. So I have come a long way, having worked for
a semi-OS-hostile company, then a bit for a neutral (agnostic?) company;
and now working for one hat is actually engaged with community (Ning).
And I must admit that I really like that Ning is openly committed to
working with open source projects, not just using open source artifacts
others make. This is not solely meant as an ad for Ning (although we are
indeed hiring...), I just think it is important to give credit where it
is due.

Beyond this, Ning section
at Github already has some public projects, and there are literally a
dozen more being cleaned up to be ready for publishing. Within just a
few more months I think Ning will be more recognized for its
contributions in Open Source space; and eventually become a signifant
player. I see this as a major benefit, making it bit less difficult to
hire the best developers; and especially so for a company that is still
relatively small, and thus having less name recognition than the big
boys.