This code works better that the first versions and works approximately with the same speed as GZIP. (which is pretty optimized piece of code) The size of the whole library is also acceptable (~1400 bytes).

So, how this code can be improved? Other, faster technique able to provide this functionality? MMX?

Having written the full boat of deflate and inflate in x86_64, I can only tell you how I implemented the bit buffering... both deflate and inflate in my library are on average 30% faster than the reference gzip/zlib.

I am sure you are probably already aware of how the reference C version does it, and that is it keeps a "hold" and a bitcount that is in it, and just snips off the lower bits as needed (and occasionally of course puts them back when necessary).

My macros to actually feed my own version wouldn't make sense here in standalone fashion, but since I am living in the land-o-plenty register-wise, I dedicated a register for each. My hold is 64 bits wide, and I use a 32bit register to keep track of its bitcount.

I disliked how the reference version only ever fed its hold with 8 bits at a time (and due to its portability requirements/etc it must be that way), so I feed my "accumulator" with 32 bits at a time, and it is quite fast. All of the non-general purpose instructions to deal with bit shuffling around are indeed much slower.

My actual full-spec-compatible inflate routine is 12736 bytes including embedded table and jump data, then toss in the adler32, crc32, file/memory management, and everything else that is needed to make a standalone inflate, my end static binary is ~30K.

Anyway, using my accumulator method ended up being nearly 1:1 with how the reference zlib does it... in the below code snippet, the accumulator (especially early on) is prepopulated, so the macros don't actually do anything other than to verify there is indeed enough bits in them, and then it just accesses the accumulator lower bits directly (accumulator here is r12):

@redsock - At first I never read the sources of zlib or gunzip, simply because I am not skillful enough in C/C++ and reading such sources is very boring for me. My Implementation is entirely following RFC-1951.

Although, your code looks pretty similar to one I use, except it is 64bit (It is good to have so many registers. )
Anyway, it seems gzip for 64bit does not use all the power of 64 bit platform, because I didn't noticed any difference in speed between 32bit and 64bit versions.

BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop )

BTW, isn't 12K too much for such an algorithm? My implementation is 1400 bytes (well after some size optimizations) and it inlines all inner loops code because of the speed optimizations. (Read it as "All code", because Inflate is actually one big inner loop )

Hahah, about 2500 bytes of it is fixed inline tables, the actual spec-compliant inflate "one big inner loop" is ~2100 lines of assembler (which includes of course a good many inlined macro calls that expand to more still).

I think you'll find that in order to be zlib (wrap==1), gzip (wrap==2) compliant, it must be around this size

Also re: your earlier comment of gzip/zlib not taking advantage of 64 bit, it has to do with several things that are a core part of the zlib design. I find it most amusing that the contributed x86_64 assembler "longest_match" function actually slows it down (and thus why it isn't a part of the actual x86_64 build by default).

Most people don't realize how it all looks from a profiling perspective. For example, this is a profiler run from my deflate routine (profiling of course adds a great deal of overhead, but gives a very accurate picture of each routine and where time is spent):

The input for that profiler run is 460MB.

Inflate is obviously much simpler, and here is profiler run from inflating the output of the previous step:

My code is already 30% faster than the reference version, but you can see a great deal of optimization room here, more so if I opted not to do proper crc32/adler32/etc...

I could spend a full year doing nothing but optimizing them further, haha... register stalls and dependency chains through it are nasty

The 32bit versus 64bit zlib isn't "dramatic" due to the way it deals with bit buffering, windows, and longest match searching, and without gutting a good chunk of it (thus ruining portability), hmm, I'd say it is the way it is quite intentionally (the reference version that is)...

So, after some optimization, my routine is now about 20% faster than gzip implementation and is under 1300 bytes in size, which I decided is good enough result.

Profiling the code, I discovered that the most time is wasted scanning the stream bit by bit and traversing the Huffman tree, so optimizations of memory access or multi-bit reading gives very small performance gain and is useless at all.

Also, all kinds of code alignments was useless - the speed was always the same.

I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now.

The paper describes a mutation of the huffman tree with an added code size based lookup table that can improve the performance.

I read this article and the trick is really interesting, but I am afraid it is not usable for Deflate/Inflate algorithm, because it needs the whole Huffman tree building algorithm to be changed. But this way it will be new compression/decompression algorithm, not compatible with Deflate/Inflate.

I am not using zlib for testing. Simply have one big compressed file and use "time gzip -d testfile.gz" and "time TestDeflate" in order to test the speed. Notice that I am not handling zip or gz files formats. I am implementing only Inflate algorithm. My test program reads the compressed file, decompress it and writes the output file. And this is implemented only as a quick test code, not to be released as a gzip replacement. But all the integrity and error checks for the Deflate stream are implemented and working. My main goal is to provide the algorithm for using in PNG decoding library. That is why for now only Inflate part is implemented. Maybe later, some day, I will write Deflate as well, but not now.

Yeah that's what I figured My only point was that from the code you posted, there were no crc32 verifications, nor code/distance error checking... so it might be misleading for you to say your routine is 20% faster than gzip -d, considering that gzip -d is also doing crc32 validation against all of its input, as well as inner-loop distance/code validations. I think you may find that modifying gzip/zlib's inflate_fast routine to behave as yours is, that zlib's would be faster still.

I also use my inflate routine for PNG support, though I didn't bother with the entire subset of PNG encodings as I only wanted PNG24 support.

Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference. But still I am not trying to beat gzip. I only wanted to have a code with normal performance. Anyway, as I wrote above, my research shows that big performance increase can be achieved only with different Huffman tree traversing algorithms, not simple code optimizations.

@JohnFound: Absolutely no offense intended, but you are way off the mark here.

As opposed to discussing the internals of zlib (which is effectively gzip/gunzip), I have taken the liberty of modifying zlib-1.2.8 specifically to highlight the differences that really are happening when you do your simplistic timing view.

I would like to point out that the _only_ reason I feel compelled to do so is that by you stating your routine is 20% faster than gzip, you are doing a tremendous disrespect to Mark Adler (whom I do not know, and am not affiliated with whatsoever). Zlib itself has been the subject of a great many respected people carefully reviewing and optimising it. When I say my code is 30% faster than it, I am performing the exact same amount of work (to detect bitrot in archives, etc, etc). And it isn't because my own library and/or statements is correct, it is only that you don't seem to understand just how it all comes together (nor should you, given you have only read the RFC).

I appreciate that you are not trying to outperform gzip, but there are a great many factors involved in what gzip is that your code simply does not do, and your passing crc32 off as being insignificant/negligible is just plain wrong.

So, back to my case-in-point: I downloaded a fresh copy of zlib-1.2.8 from zlib.net, and then spent 10 minutes modifying its inflate routines to include provisions for JOHNFOUND #ifdefs.

If I went further, and disabled all invalid lengths/distances/codes checking (and there are many), I think you'd find that zlib itself would go yet-again much faster.

My point mainly is to highlight to other readers that you can't say your code is faster than someone else's when in fact it is not, whether you appreciate the associated nasties involved in the what/why/how isn't really important.

$$rant complete$$ hahah again, I not trying to be rude or offensive, I have just spent a LOT of time dwelling in this area (and thus why I latched onto your thread in the first place).

Well, crc32 is much faster than the Inflate decoder, so I don't think there will be significant difference.

Modern CPUs are very complex and I think you can't simply shrug off such a thing so easily without actually doing a test to see the difference. There are many things interacting and using resources that even a seemingly simple change can have a dramatic effect in the right (wrong?) conditions.

JohnFound wrote:

But still I am not trying to beat gzip. I only wanted to have a code with normal performance.

Then I guess you must have changed your target after the initial posting:

JohnFound wrote:

what is the fastest way

I expect many people are confused by your response since it appears to go against the initial purpose.

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum