I have been waiting for this to hit 1.0 and more importantly get popular so that I can use it everywhere. I am really a fan of Yann Collet's work. These are extremely impressive work specially when you consider that lz4 seems to be better than snappy (by google) and zstandard from LZFSE (from apple). I think he is the first one to write a practical fast arithmetic coder using ANS. And look at how his huffman implementation blazes past zlib huffman though compresses less than FSE [0]. I also like reading his blog posts. While a lot of them goes over my head I can generally make a sense of what he is trying and why something's working despite the complexity.

Cyan4973/Yann also is well known for xxhash[1], which is one of the faster hashers[2] out there. Post a new hasher and people will probably ask about xxhash (ex: metrohash[3]). Guy is an absolute machine. If you search Google for 'zstd' right now, you'll find him, not Facebook, namely: https://github.com/Cyan4973/zstd . Glad his work is being supported by someone now! Immensely well deserved, after so many years of helping make everyone fast.

PS - I didn't know a lot of the terms you used, but the Finite State Entropy (FSE) link you provided does a good job intro'ing some of them, and the linked paper Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding [4] (ANS) seems interesting.

Yes. I forgot about xxHash :). I think it was created as part of checksumming on LZ4 (not entirely sure). That is another amazing part of his work. He is producing this things xxHash, FSE, Huff0 that are state of the art projects in their own right for his state of the art compression algorithm making sure others can benefit the most from his work without reinventing it. At the same time he is also blogging his though process and experiment in a manner that even a layman like me gets the gist of for most part. Now whether its useful enough for an expert in the field I can't say.

There is a lot of other gem in his blog. He points to resources and other people he have worked with making it much easier to get a bigger picture on the related items.

> I think he is the first one to write a practical fast arithmetic coder using ANS.

I don't think he is the first; although RAD game tools has been cagey about the details, many strongly suspect their recently announced Kraken, etc. use ANS in some form and some of their previous products; see the discussion here:

Yeah read about Kraken a while ago and found the encode.ru thread you pointed. Its unfortunate that we can't compare that side by side. Those guys seem another of those compression geniuses. Overall the world of compression is moving very fast and well these days.

But even with that Yann might be the first. Yan's work on FSE is old [0] and if I'm reading this correctly his work on FSE is a mix of his own work and Jarek Duda. But it seems one of his failed attempt challenged Jarek to invent another version of ANS called rANS [1] and as you can see in the encode.ru comments from the RAD game tools that they seem to be using rANS at least for Kraken. Regardless these are impressive works and these people are bouncing ideas of each others, being challenged and inspired which is a very good thing for the rest of us :).

There are now actually open source decompressors[0] for Oodle formats, and yes, many use rANS. Especially LZNA's clever use of SSE2 to decode and update 16 symbol models is interesting. FSE (which is tANS) is a different beast, though, and more suitable for rarely updated models.

That was because of the belief that he had used an illegal copy of the DLL, and that he had copied all implementation details (neither is true). A representative from RAD said they would withdraw the DMCA complaint now after that misunderstanding was cleared.

Now here things gets really interesting [0]. And the fact that the claim is that higher level of compression has negligible impact on decode speed making Kraken really impressive for one time compression and lots of decompression situation.

I agree, I just started using Zstd for uniprot.org (dev branch) and it looks like it is a lot faster in decoding than deflate or zlib and even smaller on disk. It is one of those things where the improvement really matters for us and gives us faster results for our users.

i.e. download speed is up to 20% faster with Zstd as compression algo for backing store compared to Deflate. Assuming bandwidth is available ;)

There is just so much awesome stuff in this article. Finite State Entropy and Asymmetric Numeral System are completely new concepts to me (I've got 7 open tabs just from references FB supplied in the article), as is repcode modeling. I love that they've already built in granular control over the compression tradeoffs you can make, and I can't wait to look into Huff0. If anyone outside of Facebook has started playing with it or is planning to put it into production right away I'd love to hear about it.

The best TL;DR of ANS is something like this (without being too wrong). It's still too long:

Huffman requires at least one bit to represent any symbol, because it finds unique prefix codes for every symbol by varying the leading bits.

Arithmetic coding encodes symbols as fractional numbers of bits, by using binary fractions.
It divides up the range to make this work. In the end, you get one fraction per "message" that can be decoded back into the message (IE a fraction like 0.53817213781271231 ....)

range coding is similar, it just uses integers instead of floating point. You get one number that can be decoded back into the message (IE a number like 12312381219129123123).

Note that in both of these, you have not changed the number system at all. The number of even numbers and odd numbers still has the same density.

Another way to look at the above is based on what a bit selects.

In huffman, a single bit is generally not enough to select something, the prefixes are long. So you need to walk a bunch of bits to figure out what you've got.

In arithmetic or range coding, a single bit selects a very large range. They change the proportions of those ranges, but at some point, something has to describe that range. This is because it's a range. Knowing you have gotten to the 8 in 0.538 doesn't tell you anything on it's own, you need to know "the subrange for symbol a is 0.530 ... 0.539", so it's a. So you have to transmit that range.

ANS is a different trick. Instead of encoding things using the existing number system, it changes the number system.
That is, it redefines the number system so that our even and odd numbers are still uniformly distributed but have different densities. If you do this in the right way, you end up with a number system that lets you only require one number to determine the state, instead of two (like in a range).

With regular arithmetic coding and Huffman coding you don't need to send over a dictionary. Instead, you can have an adaptive model that learns to compress the data as it goes (e.g. keeps running track of symbol frequencies), and it will still be reversible.

Static is much cheaper, uses the same probabilities for the entire data block (e.g. 30 kB), probabilities are stored in the header - practically all Huffman and tANS compressors (however, there are considered exceptions: https://en.wikipedia.org/wiki/Adaptive_Huffman_coding ).

I know that's what adaptive coding means, I thought that was impossible with rANS. All the descriptions of rANS I could find (back when I looked at it) described it terms of a static model, and I ran into problems trying to generalize it to an adaptive one.

Imgur album now compares brotli and zstd. Looks like zstd is always significantly faster, but Brotli compresses slightly better for every dataset except the XML one.

This is with brotli level 1, by the way. My understanding is that brotli is pretty quick through the first few levels, but the levels that ask for the highest compression are insanely slow (which is a valuable thing to have as an option, for things like game assets or something, which are compressed once and delivered many times!)

For "compress once, decompress many times" workloads like game assets, you might want to check out Oodle, a proprietary library that kills zstd in decompression speed while achieving better compression ratios.

Not just tuned for web workloads in general, but for specific web workloads. The Brotli dictionary is mostly composed of English words and phrases, and fragments of HTML, CSS, and Javascript. It would perform poorly on non-English text.

I have a feeling that the dictionary was designed with the specific goal of performing well on a specific corpus similar to the Large Text Compression Benchmark[1]. It has quite a few words and phrases that I'd associate with Wikipedia's "house style".

I understand that brotli is incredibly slow as compared to LZMA when using these high ratio settings (q11, w24), so slow as to be impractical in production even in a write-once, read-many scenario if you have any non-trivial amount of data being regularly produced. I do not want to have a farm of machines just to handle the brotli compression load of our data sets just because it is 10x slower than LZMA.

>What is the compression time for brotli to achieve compression ratios comparable to LZMA...

Compression time is indeed the achilles heel of Brotli, which is why it's something I would only use for compress once (and preferably decompress very often) scenarios.

Compared to lzma at it's best compression setting for this particular data (-mx9 -m0=LZMA:d512m:fb273:lc8), brotli took 6 minutes and 4 seconds to compress, while the same data took 1 minute and 47 seconds for lzma.

On the other hand, brotli decompressed the same data in 0.6 seconds, while it took lzma 2.2 seconds.

Yann will be giving a talk on Zstandard at today's @Scale 2016 conference, and the video will be posted. He can answer the most technical questions about Zstandard, but I may be able to answer some as well; we both work on compression at Facebook.

I am really looking forward to this. I usually like to read more than vidoes but for complicated topic with a good presenter it can actually be a comprehensive starting point. Would the video also be posted today or we will have to wait?

One thing I haven't figured out from either today's post or Yann's blog is whether Zstandard is switching between huff0 and FSE depending on compression level or is it somehow using both together? Also the post says its both OoO friendly and multi-core friendly but the speed benchmarks are those in a single core context or multi-core? Does only the format/algorithm multi-core friendly or the standard cli can run multi-threaded.

All benchmarks today are single threaded. The algorithm itself is single threaded, but can be parallelized across cores. We will soon release a pzstd command line utility to demonstrate this, similar to pigz, which accelerates both compression and decompression.

Zstandard uses both huff0 and FSE together when it compresses -- it doesn't switch between them based on the input.

The modern trend of compressors is to use more memory to achieve speed. This is good if you're using big-iron cloud computers...

"Zstandard has no inherent limit and can address terabytes of memory (although it rarely does). For example, the lower of the 22 levels use 1 MB or less. For compatibility with a broad range of receiving systems, where memory may be limited, it is recommended to limit memory usage to 8 MB. This is a tuning recommendation, though, not a compression format limitation."

8MB for the smallest preset? Back in the mid-2000s, I was attending a Jabber/XMPP discussion, about the viability of using libz for compressing the stream. It turned out that even just a 32kb window is huge when your connection server is handling thousands of connections at a time, and they were investigating the effect of using a modified libz with an even smaller window (it was hard-coded, back then).

I know Moore's law is in ZStandard's favor w.r.t. memory usage (what's 8MB when your server's got 64GB or more?), but I think it's useful to note that this is squarely aimed at web traffic backed by beefy servers.

Any modern server that handles a thousand or more concurrent connections on commodity hardware already uses only about as many threads as there are processor cores. In that architecture it's trivial to also limit the number of compression threads to the number of processor cores. That architecture gives the best performance and very low memory use.

In the mid-2000 it was still accepted norm to spawn one thread for each connection, where memory usage of the compressor would have been a problem. I doubt that it's a problem with today's software architecture.

A server like this could only work by buffering an entire response before compressing it once, requiring (compressed+uncompressed) bytes temporary space. In reality most servers of the design you mention operate streamily, flushing the compressor's output just in time as the backend fills its input buffer. In designs like that (most of them) a compression context per connection is still required

Not sure I agree. The 8 MB is the recommended upper limit so I don't think anyone is planning to use it for the web traffic. I think its designed to be faster and have better compression even at lower window size though not sure how low. It most likely perform better than zlib even at 32KB at least faster I would assume. Now if you are a jabber/chat server opening thousands of long running connection there it can be an issue. You already said how even standard zlib doesn't work there.

I don't think 8MB is the smallest preset, the text you quoted says that the lower levels use "1 MB or less".

The concern I have is that this makes it sound like the compressor can choose how much memory the decompressor will need to use. Does this mean that zstd can't be used in a potentially adversarial environment? (Eg. is there a denial-of-service vector here by forcing the server to use large amounts of memory to decompress my requests?)

It will not use (much) more memory than the size of the output in any case. 8MB is the window here, which just means the decompressor can discard data that falls outside this 8MB window as it is decompressing.

You can get pretty far with "amnesiac" zlib for networking, too. You collect up writes in your out-buffer, and use zlib to compress it before transmission. The trick is that you don't retain context or find matches between chunks, so there's no memory overhead between sends.

I'm a complete dunce when it comes to compression and how it fits in the industry, so help me out here. Say that everyone accepts that Zstandard is amazing and we should start using it. What would the adoption process look like? I understand individual programs could implement it since they would handle both compression and decompression, but what about the web?

Would HTTP servers first have to add support, then browser vendors would follow?

The browser sends the server a request header indicating which compression methods it understands. Current Firefox for example sends

Accept-Encoding: gzip, deflate, br

meaning the server is free to send a response compressed with either gzip, deflate or brotli. Or the server can choose to send the data uncompressed.

This means the adoption path for the web would be an implemtation in at least one major browser, which advertises this capability with the Accept-Encoding header. Then any server can start using Zstandard for clients accepting it.

It's also possible to implement a decompressor in javascript to support browsers which don't do it natively. The performance would likely suck but if you're truly bandwidth constrained and don't mind users having a bit of a lag, it's an option...

Thanks! To clarify, we didn't rename the language, just the GitHub organization. However, dlang.org is what we've been using as the canonical address for the website, and "dlang" is a much more searchable term than just "D". (Now, if only Google would stop auto-correcting "dlang" to "golang"...)

On another note, wonder what happens when personal and professional lives can interfere. The profile of Yann Collet in the blog post is the above link and I can't help but think, why am I on faceook looking at a baby's picture on someone's profile instead of Github or at least LinkedIn? Seems like something he might want to keep private (and yes, I'm totally assuming here, I don't know his preferences)

Now, I know you can restrict the stuff you post to friends only instead of public like he did, but it is still something people(and facebook for its employees) should consider if they want their facebook profile to become their professional contact page.

The numbers are so dramatically different that I ran several different tests, but those results showed the same rough results. I used default command-line options for both tools, and both created very similar compression ratios.

Note that LZFSE has a somewhat different goal, however: it's designed to be the most power-efficient compression algorithm out there, in other words on mobile devices LZFSE optimizes for bytes-per-watt rather than bytes-per-second. Zstandard, on the other hand, runs multiple pipelines and such--it's banking on having a server-class processor to run on.

Edit: hardware is a 2013 MacBook Pro, pretty fast flash storage, and 2 cores/4 threads. I warmed cache before each run and sent output to /dev/null, so the numbers above are best-case.

"Note that LZFSE has a somewhat different goal, however: it's designed to be the most power-efficient compression algorithm out there, in other words on mobile devices LZFSE optimizes for bytes-per-watt rather than bytes-per-second."

Sure, but is that even relevant? I mean, is there any way that lzfse could possibly be more power-efficient per byte than zstd when zstd is 3-4 times faster for the same compression ratio? According to the docs zstd doesn't have any support for multiple threads right now, so it should be a fair comparison.

I use "fastest to complete == least power usage" as a rule of thumb because of "race to sleep". I suppose that might be thrown off by power usage characteristics varying based on number of cores working? How does one even begin to write code that prioritizes power-efficiency over performance?

It can be. Certain operations are more power efficient than others - subtraction and then a check for negative value instead of comparison, certain vector operations, loop unrolling, using less memory/bus traffic...

>> Are there cases where an algorithm takes 10x the wall clock time to execute, but actually uses less energy on the same chip?

Slower code can be more power efficent. You're just tuning for different results.

I'd imagine this also requires very detailed knowledge of the chip, microcode, etc. Probably hard for x86 but I guess this kinda stuff would be on ARM where the programmer has deep vertical access (like as mentioned, Apple).

Thats what I was thinking. I haven't found any validation or even the rationale behind lzfse's supposed lower power usage. I can think of two things.

1. Apple tried to write a fast and reasonably compressible version of LZ4 thus improving power usage by creating LZFSE since none existed but beaten out handsomely by Zstandard.

2. Following a parent comment, Zstandard might some of the things that are dependent on a highly OoO cpu with lots of caches, extremely good branch predictor that could be significantly slower on an ARM even the apple one despite how good they are. Or they could still be slower but on ARM the gap might not be as big and the decision not as cut and dry as it seems now.

Would love to know what the actual case is from someone involved in LZFSE.

From the bits of testing I've done today, it's phenomenally fast on x86. Much better than gzip (and pigz for that matter) in every metric I think I generally care about: CPU Usage, Compression Speed, Decompression Speed, Compression Ratio.

On other architecture the picture gets a bit murky, it seems to get handily beaten by pigz through what at first blush I'd guess is just sheer parallelism. It's got solid performance, and without a shadow of doubt faster than vanilla gzip. If/as/when I get time, it'll be interesting to dig into why performance is worse there.

Dug in a bit further. On the non-x86 architecture I use, it looks like it's really just straight core performance that explains it. pigz's only advantage there really seems to be the brute force parallelism.

In particular note the huge difference in branches between gzip and zstd on decompress:

I'm not a C programmer, understanding what happened is a bit beyond me but:
1) to compile on linux it needs the -pthread flag passed to it, Makefile is missing that (compiles fine on OS X)
2) decompression over stdin appears to be effectively impossible, still demands in input file. Compression over stdin works fine.

This is an awesome blog post that is very well written, but the lack of incompressible performance analysis prevents It from providing a complete overview of zstd.

Incompressible performance measurements are important for interactive/realtime workloads and the numbers are extremely interesting because they can differ dramatically from the average case measurements. LZ4 for instance has been measured at doing 10GB/sec on incompressible data on a single core of a modern Intel Xeon processor. At the other end of the spectrum is the worst case scenario for incompressible data where performance slows to a crawl. I do not recall any examples in this area, but the point is that it is possible for algorithms to have great average case performance and terrible worst case performance. Quick sort is probably the most famous example of that concept.

I have no reason to suspect that zstd has bad incompressible performance, but the omission of incompressible performance numbers is unfortunate.

A recent compression discussion I saw involved how do compressors fare on uncompressible input? For example, suppose you wanted to add compression to all your outbound network traffic. What would happen if there was mixed compressible traffic along with the uncomressible kind? A common case would be sending HTML along with JPEG.

Good compressors can't squeeze any more out of a JPEG, but they can back off fast and go faster. Snappy was designed to do this, and even implementations of gzip do it too. It greatly reduces the fear of CPU overhead to always on compression. I wonder how Zstd handles such cases?

If Facebook wanted zstd in the browsers, which would make sense for them to be able to reduce bandwidth and improve performance, the patent grant seems to make that impossible: Google would never put zstd in Chrome with such a clause.

I don't understand these things well. But if true this would be really bad. LZ4 is everywhere because it was completely free and I think most of Zstandard's real work was pre-facebook by just Yann alone. Now to have its hand tied because of his job at facebook is the worst thing possible.

I just read the Opus patent summary. It seems like if zstd followed the same license using it wouldn't give facebook any license but if I sue facebook I loose the licese to use zstd. Am I correct in that.

Facebook's license seems superior for those parties who wish to end software patents

It's mostly superior for Facebook. If you're a party that wants to end software patents, but intends to use zstd in any place you might want to interface with a company that doesn't hold the same position, then you're screwed.

If you want to end software patents, and Facebook sues with one (not applying to zstd), you're also still screwed. This is relevant because even if you're against software patents, you can take them out defensively. But this license makes that useless.

Compare to GPL vs LGPL, or how the free software codecs all eventually moved to BSD.

I think for typical JS/CSS/HTML sizes, and decompression times, probably maximum compression ratio, followed by decompression speed is what I'd look for. I don't care too much about compression speed, in the sense that if I have to spend 1 minute compressing JS to crunch it by 10%, but I serve that file a million times, then as long as decompression doesn't negate the gain in network time saved, it's a win.

I guess the other factor for mobile is, besides memory and decompression speed, how do various compression schemes fare battery wise?

Regarding decompression times, I think it's much more important to save transferred network data than to prioritize quicker decompression.

HTTP request times have a long tail; they can get really slow for the many, many people limited to slow connections. Decompression times are going to be much more consistent. Our aim here should be to improve the 10% slowest requests, and you do that by optimizing the actual transfer.

If facebook hopes the new compression algorithm to be a standard, why doesn't it publish an IETF RFC draft? Will it follow OpenDNS way of dnscrypt by open-sourcing the reference implementation without publishing any IETF RFC draft?

The following link points to a fairly good benchmark / tool that showcases the tradeoffs in real life: since (de)compression takes time, what is the fastest way to transmit data at a given transfer speed?

How difficult is this new standard going to be to implement in another language? It seems highly sophisticated -- which is great, of course -- but the cost of that is relying on giants like Facebook to maintain their One True Implementation. For software this is (usually) fine; for a nee standard, it's a problem.

The format itself is documented (https://github.com/facebook/zstd/blob/master/zstd_compressio...) with the intention of other implementations and language bindings being readily available. We also have a zlib-compatible API for easier porting to applications already using Zlib. Our hope is that Zstandard is both easy to use and easy to contribute to.

The majority of the work was done by a single person though. I think by the time facebook joined the majority design was done. Granted he doesn't seem to be like a normal person so doesn't matter that way.

But for compression isn't majority of the language would actually use bindings rather than implementing it natively for performance to make this moot.

I would assume the road to finding the optimal solution and understanding why it worked is much more complicated than the actual code. And a quick look doesn't seem to suggest its a very big code base for the library itself.

turbohf claims to be 4x faster than zlib's huffman coding and 2x faster than FSE and is a generic cpu implementation. Even if claims are only partially true and turbohf is a clean dropin replacement for zlib and licensing were friendly the appeal of zstd drops substantially in my book.

This isn't bikeshedding. Bikeshedding is about quibbling over unimportant details. Names are critically and absolutely important. Lots of great things have been hobbled or ruined by poorly-chosen names. A terrible name can cause something worthy to be ignored in favor of something inferior but with a better name. And you don't need to even be competent in the inner workings of a project to criticize its name or suggest better ones. The people who name cars aren't the same people who design the engine-control algorithms for them.

If you disagree, what do you think of naming your kid "Adolph Hitler [lastname]"? Most people agree that a name like that will cause great harm to a child growing up because of the constant ridicule and ostracization he'd inevitably face. That's an extreme example, but names are important.

Note that I don't think this project's name is horrible, but I don't think it's very good either, and could be a lot better.

I don't see the problem: only 3 members of that namespace are currently claimed (4 out of the 36-member namespace: 7z), so we have room for 23 (or 32) more compression standards before running out. We've been using gzip for, what, 20 years now? Only recently have we gotten xz. At this rate, we won't run out of compression standards using this scheme for roughly 153 years. And after that, we could always start using capital Zs, like the old "compress" standard that used the .Z extension. Or we could go to a 3-letter extension ending in z, such as ".fbz", which gives you 676 more options, and 4507 years. Considering that general-purpose data compression really hasn't moved that much since the DEFLATE algorithm took over, and only recently had any real change with the advent of LZMA (used in p7zip and xz), and perhaps this new zstd (too soon to tell), my time estimates here are probably too short.

There's no way to know what's going to be a common algorithm in the future unless you have a time machine. When DEFLATE was first invented, it wasn't common either, it was brand-new. Now it's everywhere. This new algorithm might become just as ubiquitous in 10 years, or it might turn into the next bzip2, or worse, the next ZOO.