Sunday, November 7, 2010

The blaze-builder library: faster construction of bytestrings

Hi, I am Simon Meier, a swiss Haskell enthusiast currently pursuing his PhD in computer science at ETH Zurich. In this blog post, I'll introduce you to the blaze-builder library.

The blaze-builder library provides you with a Builder type that you can use to efficiently construct sequences of bytes represented in a packed form as a strict or lazy bytestring. Hence, typical use cases for a Builder are saving your application data in a space efficient binary form to a file or sending a response to some request over the network.

Probably, you know about the binary package, which also provides a Builder type in the Data.Binary.Builder module targeting exactly the same usecase as our Builder. This is no coincidence. During this year's Google Summer of Code, Jasper Van der Jeugt and I developed the blaze-builder library to overcome performance shortcomings of Data.Binary.Builder with respect to the specific needs of the blaze-html HTML generation library. Since then, I have restructured the blaze-builder library to serve as a drop-in replacement for Data.Binary.Builder, which it improves upon with respect to both speed as well as expressivity.

Usage example
We start by importing the necessary modules. We also define a convenient abbreviation for mappend, which actually will become part of the base library according to rumors I heard at this years ZuriHac.

Our example is about serializing a very simple representation of a person to a sequence of bytes. As usual, this serialization also requires us to fix the encoding format. We encode strings using UTF-8 and prefix them with their length encoded as a 32bit little-endian integer to make parsing unambiguous. We also encode the age of a person as a 32bit little-endian integer. I guess the code speaks for itself.

The above code is typical for serialization code based on builders. One uses the predefined functions for creating builders with a fixed encoding format from standard Haskell values. These builders are then combined using the functions from the Monoid typeclass. Builders essentially store the recipe for building their corresponding sequence of bytes. Once one needs a concrete representation of this sequence of bytes, one just calls toLazyByteString or toByteString to execute that recipe.

The benefit of using builders to construct a bytestring is twofold: First, appending two builders is an O(1) operation, which is also efficient in absolute terms, as it corresponds to a single function call. Second, when constructing the resulting lazy bytestring the blaze-builder makes sure that the average chunk size is large. A large average chunk size is important to make good use of cache prefetching in later processing steps (e.g. compression) and it also reduces the sytem call overhead when writing the resulting lazy bytestring to a file or sending it over the network.

For example, the above code results in the following sequence of chunk sizes.

The 170001 bytes represented by lazyBinaryCloneVillage feature an average chunk size of ~24kb. The first buffer is only ~4kb large, because for short output sequences the buffer allocation cost is significant. toLazyByteString compensates this cost by allocating the first buffer with the minimal expected chunk size. Note that these chunk sizes reflect the default settings of toLazyByteString, which is optimized to yield efficient and well-chunked results for all lengths of output sequences. If you know more about your typical serialization tasks, then you can tune these settings to your favor.

Speaking of efficiency, I'm quite sure you would also like to see some benchmark figures. I'm not going to present the figures for the above example. Not because they are embarassing; they are not. However, without good competition, the interpretation of benchmark figures is difficult; and currently, I don't know of a good competitor for the above usecase. However, we can also use builders to pack a [Word8] list into a strict or lazy bytestring; and there, we definitely do have good competitors.

Packing [Word8]
For our benchmark, we use the following implementations for packing [Word8] lists.
The implementations S.pack, L.pack, declPackLazy, and binaryDeclPackLazy are trivial. The implementations packStrict and packLazy make use of fromWord8s :: [Word8] -> Builder, which is a very efficient function to serialize lists of bytes, as the following plot shows.

The plot is a log-log plot of the mean time for packing [Word8] lists using the above implementations when being run on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. I created this plot by adapting Bryon O'Sullivan's excellent Criterion benchmarking library to handle scaling benchmarks (cf. ScalingBenchmarks.hs). In the spirit of Criterion, I also generate a boxplot version for every scaling benchmark (using more transparent lines to draw the quartiles and whiskers), which allows us to judge the quality of the measurements. The boxplot version of the above plot shows that nothing went wrong during its measurement.

Note that the mean times are plotted with respect to a logarithmic scale. Hence, a constant difference between two graphs means a constant factor improvement. As you can see from the measurement log, using blaze-builder is a definitive win for output sequences longer than 1kb: packStrict beats S.pack by almost a factor 2 and packLazy beats L.pack by a factor 10 and binaryDeclPackLazy by a factor 92 (!).

The crucial ingredient for this improvement is the fromWord8s function. It is constructed using the Write abstraction Jasper introduced during his work on blaze-html. The function fromWrite8List forces and writes eight list elements at a time, which allows the compiler to bundle the actual writes to the output buffer.

For shorter output sequences, the improvement gained from using blaze-builder gets smaller and S.pack is even faster for very short sequences. The following plot, its boxplot version, and the measurement log give a more detailed comparison for such short sequences.

The results are not surprising when comparing the implementations: packStrict uses toByteString, which simply runs toLazyByteString and copies all chunks into a single buffer of the appropriate size. Hence, packStrict is always slightly slower than packLazy. The S.pack function from Data.ByteString works in two passes over the input list: first, it determines the length of the list and then it copies all bytes to the allocated buffer. Traversing linked lists of bytes is costly and pays off only for very short lists, as there the output buffer allocation cost is dominant. The peak of packLazy at 64 bytes stems from the fact that it first allocates a 64 byte buffer which is copied to a 4kb buffer once its clear that more than 64 bytes are output. This is done to compensate the buffer allocation cost for very short output sequences. It can be switched off using toLazyByteStringWith, if required.

Conclusions
The blaze-builder library provides an expressive and efficient way to construct both lazy as well as strict bytestrings. The accompanying benchmarks show that it improves (often significantly) in all cases over Data.Binary.Builder from the binary package. The benchmarks presented in this post also show that the implementation of blaze-builder compares favorably against special purpose functions for packing [Word8] lists; on a Core2 Duo T7500 with 2GB RAM and Linux 2.6.32-24 i686 and GHC 6.12.3. Yeah, that's what the benchmarks state ;-). However, I expect that the conclusions drawn from them stay also valid for most other settings. For example, the GHC-7.0.1 release candidate makes Data.ByteString.Builder run a bit faster, but still not as fast as blaze-builder.

During the work on blaze-html, I learned from several benchmarks that ensuring a large average chunk size is very important for lazy bytestrings to be efficient. However, many encoding functions on Hackage produce bytestrings or lazy bytestrings. Hence, we have to copy their result again to guarantee large average chunk sizes, which is a waste of resources. Hence, I suggest that encoding functions produce a builder instead of strict or lazy bytestrings. Apart from guaranteeing a fast append and a large average chunk size, this change also simplifies and generalizes the encoding code, as it separates the buffer allocation strategy from the encoding function.

In order for such a change to be effective, I suggest that the bytestring library itself provides an implementation of Data.ByteString.Builder,which would provide a blessed way to incrementally create bytestrings. The blaze-builder library offers one possible implementation path for such a bytestring builder. If the community would see it fit, then I'd be happy to port the builder parts to the bytestring library. The string encodings currently provided by blaze-builder would then move into their own libraries.

Well that's it for now. I will publish more of the experiments I have done during the work on blaze-builder once I find some more time. I'm also looking very much forward to your feedback.

Yes, I've seen Data.Text.Lazy.Builder and as I stated above I'm also in favor that packed data representations also provide appropriate builders.

However for bytestrings, I currently favor the Builder design of blaze-builder, as it is more efficient than Data.Binary.Builder in the benchmarks I've seen. See for example the results of the adaption of the throughput benchmark that is distributed together with the binary package: https://gist.github.com/664979#file_comparison_to_binary_package.txt

The blog says:> it first allocates a 64 byte buffer which is copied to a 4kb buffer once its clear that more than 64 bytes are output

It isn't exactly 4kb, right? Data.ByteString.Lazy intentionally uses a value less than page boundary so the the underlying (foreign) allocation routine can fit its book-keeping and the buffer in one page, thus avoiding wasting space.

@Bryan: I started from the Data.Binary.Builder with the single goal of improving the performance of 'mappend'. I streamlined the control flow such that the Builders could spend as much time on actually writing output and not on function calls, copying state information, or checking if there is space left. The current solution should be portable to Text with the exception that you might need to track three values (base, offset, end) instead of only two values (next-free, end). A currently open question for my implementation is if a strict pair of these two values would be beneficial.

Sorry, for the german buttons. I'll see if I can rename them. I didn't find the option at the first point.

@ TomMD: No I do subtract the same overhead from the 4kb as bytestrings do; see http://hackage.haskell.org/packages/archive/blaze-builder/0.2.0.1/doc/html/src/Blaze-ByteString-Builder-Internal.html#defaultMinimalBufferSize