Saturday, February 21, 2015

I'm a big fan of Rad's Telemetry profiler, which I used at Valve for various Linux profiling tasks, but it's unfortunately out of reach to independent developers. So I'm just making my own mini version of Telemetry, using imgui for the UI. For most uses Telemetry itself is actually overkill anyway. I may open source this if I can find the time to polish it more.

I've optimized LZHAM v1.1's compressor for more efficient multithreading vs. v1.0, but it's still doesn't scale quite as much as I was hoping. Average utilization on Core i7 Gulftown (6 cores) is only around 50-75%. Previously, the main thread would wait for all parsing jobs to complete before coding, v1.1 can now overlap parsing and coding. I also optimized the match finder so it initializes more quickly at the beginning of each block, and the finder jobs are a little faster.

Here are the results on LZHAM v1.1 (currently on github here), limited to 6 total threads, after a few hours of messing around learning imgui. Not all the time in compress_block_internal() has been marked up for profiling, but it's a start:

Thread 0 is the main thread, while threads 1-5 are controlled by the task pool. The flow of time is from left to right, and this view visualizes approx. 616.4ms.

At the beginning of a block, the compressor allocated 5 total threads for match finding and 2 total threads for parsing.

The major operations:

- Thread 0 (the caller's thread) controls everything. compress_block_internal() is where each 512KB block is compressed. find_all_matches() prepares the data structures used by the parallel match finder, kicks off a bunch of find_matches_mt() tasks to find matches for the entire block, then it finds all the nearest len2 matches for the entire block. After this is done it begins parsing (which can be done in parallel) and coding (which must be done serially).

The main thread proceeds to process the 512KB input block in groups of (typically) 3KB "parse chunks". Several 3KB chunks can be parsed in parallel within a single group, but all work within a group must finish before proceeding to the next group.

- Coding is always done by thread 0.

Thread 0 waits for each chunk in a group to be parsed before it codes the results. Coding must proceed sequentially, which is why it's on thread 0. The first chunk in a group is always parsed on thread 0, while the other chunks in the group may be parsed using jobs on the thread pool. The parse group size is dynamically increased as the match finders finish up in order to keep the thread pool's utilization high.

- Threads 1-5 are controlled by the thread pool. In LZHAM v1.1, there are only two task types: match finding, and parsing. The parsers consume the match finder's output, and the main thread consumes the parser's output. Coding is downstream of everything.

The match finders may sometimes spin and then sleep if they get too far ahead of the finders, which happens more than I would like (match finding is usually the bottleneck). This can cause the main thread to stall as it tries to code each chunk in a group.

Here's a zoom in of the above graph, showing the parsers having to wait in various places:

Some of the issues I can see with LZHAM v1.1's multithreading:
- find_all_matches() is single threaded, and the world stops until this function finishes, limiting parallelization.
- find_len2_matches() should be a task.
- It's not visualized here, but the adler32() call (done once per block) can also be a task. Currently, it's done once on thread 0 after the finder tasks are kicked off.
- The finder tasks should take roughly the same amount of time to execute, but it's clear that the finder job on thread 4 took much longer than the others.

Here's an example of a very unbalanced situation on thread 3. These long finder tasks need to be split up into much smaller tasks.

- There's some serial work in stop_encoding(), which takes around 5ms. This function interleaves the arithmetic and Huffman codes into a single output bitstream.

v1.1 is basically ready, I just need to push it to the main github repo. I'm going to refine the profiler and then use it to tune LZHAM v1.1's multithreading more.

Sunday, February 8, 2015

Improved tiny block performance (i.e. calling lzham_decompress_memory() on a 88 byte string) by 3.3x by profiling the decompression of 400k blocks and focusing on the hotspots.

The top bottleneck was related to initializing the default Huffman tables, as I expected. At decomp startup, all the symbols have a frequency of 1, and the min/max code sizes are either equal (if the table's num_syms==pow2) or only differ by 1 (for non-pow2 tables). So this case was easy to optimize throughout the Huffman decoder table construction code.

The next major bottleneck were some calls to malloc()/free() (up to a total of 34K worth, excluding the dictionary when using unbuffered decompression). I fixed this by adding a malloc_context parameter to any object or container that allocated/freed memory (which was a big pain), then allowing the user to optionally specify a fixed-size memory arena when they create the malloc context. The allocator functions in lzham_mem.cpp then try to allocate from this arena, which just treats it as a simple stack. Only the decompressor uses an arena, because its allocation patterns are very simple.

I won't be pushing these changes up until a lot more testing. I should probably make a branch.

Saturday, February 7, 2015

Over the previous week I've increased the decompressor's average speed by roughly 1.5%-2%, by enabling Huffman decoding acceleration tables on the distance LSB symbol table.

I'm currently testing several changes which decrease the time it takes for the decompressor to initialize, and increases avg. throughput on tiny (<10KB) streams:

The initial set of Huffman tables are very simple because the symbol frequencies are all 1's. The decompressor now quickly detects this case and bypasses the more general canonical Huffman codesize computation step.

The decompressor bypasses the construction of Huffman decoding acceleration tables near the beginning of the stream. It computes the approx. cost of constructing the accel. tables and decoding the next X symbols using those tables, verses just decoding the next X symbols without accel. tables, and chooses whatever is predicted to be cheapest. The Huff decode acceleration tables can be rather big (up to 2048 entries), so this is only a win at the beginning of streams (when the tables are rapidly updated).

These optimizations should significantly reduce the time it takes to reinit() and restart decompression (hopefully around 30-50%). I'm going to profile the decompressor over the time it takes to startup and decode the first few KB next.

Wednesday, February 4, 2015

Igor Pavlov and the devs at encode.ru gave me some pointers on how to create a custom 7z codec DLL. Here's the first 32-bit build of the 7zip custom codec plugin DLL, compatible with 7zip 9.38 beta x86 (note I've updated this with 1 decompressor bugfix since the first release a couple days ago):

To use this: install 9.38 beta (I got it from http://sourceforge.net/projects/seve...es/7-Zip/9.38/), manually create a subdirectory named "Codecs" in your install directory (i.e. "C:\Program Files (x86)\7-Zip\Codecs"), and extract LzhamCodec.DLL into this directory. If you run "7z i" you should see this:

-md=64K, etc. - Override dictionary size. If you don't set this the dictionary size is inferred via the -mx=# parameter setting.

Example usage:

7z -m0=LZHAM -mx=8 a temp *.dll

To use LZHAM from the GUI, you can put custom parameters on the command line (without -m prefixes), i.e.: "0=bcj 1=lzham x=8". (I think you can only specify algorithm and level options here.) Note if you set the Compression Level pulldown to "Ultra" LZHAM will use best of 4 (extreme) parsing and be pretty slow, so the highest you probably want to use is "Maximum":

Sunday, February 1, 2015

I integrated the LZHAM codec into the awesome open source 7zip archiver a few years ago for testing purposes, but I was hesitant to release it because I was still frequently changing LZHAM's bitstream. The bitstream is now locked, so here it is in case anyone else out there finds it useful. (Updated: See my next post for a 7zip 9.38 beta compatible codec plugin.)

Important note: Please do *not* depend on this for anything important, this is only for testing purposes. There are very likely bugs in here, and the LZHAM codec id will be changing. I'll be releasing an official codec within the next week or two.

Note I haven't updated the makefiles yet, just the VS 2008 project files. This has only been tested by me, and I'm not an expert on the very large 7zip codebase (so buyer beware). I did most of this work several years ago, so this is undoubtedly an outdated version of 7zip.
I've only been able to compile the 32-bit version of 7zip so far, so the max. dictionary size is limited to 64MB. (Important note: I'm not trying to fork or break 7zip in any way, this is *only* for testing and fooling around and any archives it makes in LZHAM mode shouldn't be distributed.)

I'll be merging my changes over into the latest version of 7zip, probably next weekend. Also, LZHAM is statically linked in at the moment, I'll be changing this to load LZHAM as a DLL.

Here are some example command line usages (you can also select LZHAM in the GUI too). The method may range from 1-9, just like LZMA, and internally it's converted to the following LZHAM settings. You can use the "-md=16M" or "-md=128K" option to override the dictionary size. The -mmt=on/off option controls threading, which is on by default (i.e. -mmt=on or -mmt=off), and this new option controls deterministic parsing (which defaults to *off*): -mz=on

In practice, beware using anything more than -mx=8 ("Maximum" in the GUI) unless you have a very powerful machine and some patience. Also, unless you're on a Core i7 or Xeon LZHAM's compressor will seem very slow to you, because the compressor is totally hamstrung on single core CPU's. (LZHAM is focused on decompression speed combined with very high ratios, so compression speed totally takes back seat.)

About Me

Back in the day I worked for several years at Digital Illusions on things like the first shipping deferred shaded game ("Shrek" - 2001), software renderers, and game AI. Then, after working for Microsoft at Ensemble Studios for 5 years as engine lead on Halo Wars, I took a year off to create "crunch", an advanced DXTc texture compression library. I then worked 5 years at Valve, where I contributed to Portal 2, Dota 2, CS:GO, and the Linux versions of Valve's Source1 games. I was one of the original developers on the Steam Linux team, where I worked with a (somewhat enigmatic) multi-billionare on proving that OpenGL could still hold its own vs. Direct3D. I also started the vogl (Valve's OpenGL debugger) project from scratch, which I worked on for over a year. In my spare time I work on various open source lossless and texture compression projects: crunch, LZHAM, miniz, jpeg-compressor, and picojpeg.