The Quest

A few months back, I was developing a little Flash widget as a tangent to
my main project, and I needed an asynchronous PNG encoder.
I found several, but none of them
were quite what I wanted. Some were pretty good, but synchronous only.
Most of the asynchronous ones did the final compression step all at once at the end,
with a call to ByteArray's deflate() method, which meant they weren't
really asynchronous -- there'd be a noticeable pause while the
compression took place.

In-Spirit's PNG encoder, which compiled
zlib using Alchemy(!), came very close to what I wanted. It maintained
a consistent framerate and offered good, configurable compression.
There was only one problem: its asynchronous mode was slow. And I mean
really slow. Actually, since I was developing on a netbook, it was
intolerably slow, even for medium-sized images. The SWC, at over 100KB,
was also a little hefty considering the small scope of my widget.

The Journey

So, being easily distracted, I decided to build my own PNG encoder.
I was officially working on a tangent of a tangent. I had two goals:
speed, and true asynchronous encoding.

I started with the haXe port of the basic as3corelib PNG
encoder from Adobe. I spent a while optimizing it for speed: I removed
unnecessary casts, unrolled loops, and inlined function calls. I moved
as much as I could into domain memory, which is
basically a raw hunk of memory, byte-addressable, that is really, really
fast since the reads and writes are done using Alchemy opcodes (made
possible through the awesomeness of haXe). Because only one chunk of memory
at once can be selected to be the domain memory, it needs to be partitioned
manually into regions for different purposes. For example, I have a region with
a CRC lookup table in it, and another for the raw pixel data to be compressed,
and another for doing the compression, etc. Another caveat stemming from the singularity
of the domain memory is that if two different packages use it, and they
both assume sole ownership, then at least one of them will end up manipulating the wrong
data. To make sure this didn't happen with my PNG encoder, everywhere that I
use domain memory, I first save what it was before overwriting it, then set it back
when I'm done. Because Flash is single-threaded, this works beautifully to ensure that
my encoder will never cause conflicts with any other code that uses domain memory,
even if it assumes sole ownership.

There's two main parts to PNG encoding: First, the bitmap data has to
be converted into the right RGBA format (BitmapData.getPixels() and friends unfortunately
yield data in ARGB format). During this conversion, a filter can be applied that increases
the compressibility of the data (e.g. by using deltas between adjacent
pixel values in place of the actual values).
The second phase is to compress the pixel data. There's also some bookkeeping
related to the PNG format, which stores things in "chunks" with CRC-32
checksums, and there's a header and footer chunk, but those are minor details.

I managed to optimize the first phase, which transformed the raw pixel data
into the right format, to the point where it was no longer the bottleneck.
The compression phase was taking over 60% of the entire encoding time -- but there
was nothing I could do about it, since it was abstracted away to a single
call to deflate(). This also meant I couldn't make it asynchronous.
So, of course, I decided to go on a tangent of a tangent of a tangent, and
implement zlib and DEFLATE from scratch (as described by RFCs
1950 and 1951).
This took a lot longer than I expected, but it
was a success! I managed to write a compression algorithm that was competitive enough to the
built-in one in terms of speed and compression ratio (on the GOOD setting) and
much faster for highly redundant data (as many images are) on the FAST setting.
What's more, since I had complete control over the implementation, I was able to
write it in such a way that it would be easy to adapt to a chunk-at-a-time,
asynchronous architecture.

Making it Asynchronous

When Flash developers use the word "asynchronous", they typically don't mean it in
its usual sense of "multiple things happening at once", since Flash is single-threaded.
They use it to refer to an algorithm that spreads its processing across multiple frames
so that the UI doesn't appear to lock up (and, in extreme cases, cause the script to timeout).
The Flash AVM2 has, at its core, what's been described as the "elastic racetrack".
Basically, a loop which dispatches events and updates the display goes around and around
as fast as it can in order to maintain the chosen frame rate as best as possible. If the frame
rate is low, or updating the display is very fast, then it might go through several event dispatching
cycles (provided there are pending events) before rendering the next frame.

I wanted to make the asynchronous mode of the encoder complete as quickly as possible, but
without degrading the frame-rate intolerably. Another PNG encoder, from BIT-101,
handled a fixed number of scanlines (each horizontal row of pixels is called a scanline) per
frame. This method has a couple of disadvantages: Different Flash programs have different processing
loads, and different platforms yield different performance (which varies depending on background
load too). These both lead to either sub-optimal frame rates, or wasted cycles around the racetrack.
To avoid these issues, I tried an adaptive approach that continuously monitored both the
framerate, and how fast the encoder could process a single scanline of a given image. I could then
use this information to make an estimate for the number of scanlines to process during the next
update -- I called this the "step" size. Theoretically, all I had to do was increase the step size
until the frame rate decreased (to use up any slack space of free cycles), and decrease the step size
as needed to maintain that frame rate. In practice, that turned out to be tricky and error-prone, partly
because of timing inaccuracies, but mostly because the framerate can vary from external factors outside
of the encoder. I ended up just including a target FPS setting (it defaults to 20) and aiming
for that. This was simpler, and gave much better performance. Each update, the step size is updated
to match the target FPS as closely as possible; if the current framerate is more than more than 15%
worse than the target, then a correction is made to attempt to bring the error delta to zero.

Discoveries

First, images are big data. I was mostly using a teeny 200x200px image for testing.
That's 40000 pixels total. And each pixel is 4 bytes (with alpha), giving a total size of
160000 bytes. Anything you do 160000 times (at least, on a netbook processor ;-) )
is magnified hugely. Changes in the order of if conditions would give significant
performance speedups or slowdowns. I ended up with a very tight loop iterating
over every byte, and everything in there mattered, and everything outside
that loop didn't. A regular-sized 1024x768px image is over three million bytes
of raw pixel data!

Also, jumps are slow. Anything involving a jump was automatically suspicious;
unrolling loops, working around ifs, and inlining function calls gave huge
boosts to performance. Nearly every function of the library
was inlined. Disclaimer: I was not using a high resolution timer to measure
performance, so, of course, my measurements had a fairly large margin of error; however,
I only included optimizations that brought down the total time to encode (which was
coarse-grained enough not to worry about timing errors). As always, do your own
benchmarks, and never trust sweeping generalizations like mine ;-)

Finally, it turns out that, despite Flash being single-threaded, it's possible for event handlers,
particularly timer tick event handlers, to be re-entrant (i.e. the event handler function gets
called before a previous call to the same handler has completed). How is this possible given
that there's no multithreading? Well, it turns out that any call to dispatchEvent() (for whatever
purpose) might prompt Flash to deal with pending events during that call to dispatchEvent().
For example, imagine a high-frequency timer with an event handler that, when fired, dispatches a
progress event. During the progress dispatchEvent() call, Flash notices that another timer event is due,
and dispatches the same event handler that caused the progress event in the first place! This is
actually fairly easy to work around (just queue all events you want to dispatch until you're done
doing everything else), but can cause all sorts of nasty, subtle bugs if you're not aware of it.

The Results

It took me a while to complete the library and write this blog post, but I've finally finished!
I've imaginatively named my new encoder "PNGEncoder2". You can grab it from GitHub.
See the README file for the full feature list, installation instructions, and usage examples.

Here's a benchmark (source) comparing PNGEncoder2 with other PNG encoders (both synchronous and asynchronous).
Note that the first run might take a little longer than others because of one-time initializations and such.
My encoder uses the Paeth filter to improve the compressibility of the pixel data; for the other PNG
encoders that support filters, I've set them to also use Paeth to match (not all do support filters, however).

No doubt there's still a few bugs; if you find one, I'd
appreciate a link to a sample image that exhibits the bug.
At one point, there was a particularly nasty bug caused by
a typo in one constant out of a column of 32, which only
showed up when encoding a particular image of a unicorn.
(They are magical!)