Monday, August 17, 2009

Recently, I've been studying ways of reducing the download size of Javascript applications produced by Google Web Toolkit, while preserving or improving startup time. There are a number of ways to do this, the first of which is to transform Javascript code into a form that is naturally more succinct, while eliminating unused code, or deferring the load of some code until later.

Unfortunately, there is a tradeoff inherent in some of the fancier ways of making JS more succinct that use run-time code generation. This is put to good use in libraries like jQuery. For a large application, like Google Wave, this would increase startup time. Another technique is to teach a compiler even better optimizations for reducing redundant code, work which is ongoing in Google Web Toolkit.

Other attempts to reduce JS size have looked at ways to remove extraneous whitespace, shorten identifiers, and tokenize/pack JS statements, and while the former two are win-win one-way transforms, the latter has the trade-off requirement of being a reversible transform, meaning that it has to be decoded by an additional JS stage after load, a step that hurts startup performance.

Instead, I was drawn to a reversible transform the browser already includes support for: gzip compression, and decided to ask the question: what effect does the large-scale structure of the JS output code have on the DEFLATE algorithm of GZIP which is used to serve up compressed script? The answer it turns out, is substantial.

That leaves us with the built in GZIP compression that most web browsers include support for. The question is, can we improve compression while remaining compatible with the browser's decoder? There is an existing example of improving GZIP by injecting a reversible transform: bzip2 and the Burrows-Wheeler Transform.

Which brings up an interesting idea, can we do something like BWT's sort but for Javascript, in a way that doesn't require an extra pass to 'undo' the sorting?

Deflate: A digression

Before answering that question, it would be helpful to look at some of the restrictions of the deflate algorithm, and how code ordering could affect the outcome.

The deflate algorithm is a combination of two compression algorithms: LZ77 and Huffman coding. Huffman coding is a variable length code technique where symbols are replaced with codes based on their frequency of occurrence. So for example, the most common letter in the English alphabet is the letter e, so replacing e with a short code, but giving q, the least frequent character, a longer code, leads to shorter text overall. Huffman encoding is order independent, so that 'eq' is the same length as 'qe'.

LZ77 on the other hand, is a sliding window compression algorithm based on replacing strings with backwards references to previous strings in the input. For example, the string "this is a test" contains the substring 'is' repeated twice in a row, separated by a space, so that the second occurance of 'is' can be replaced with a length (2 characters, and a backwards distance (-3 positions), called the length-distance pair. The compressor typically scans backwards in the input within a certain window (e.g. 8,192 characters or 32,768 characters) looking for matches and then encoding them as length-distance pairs. The compressor has some freedom as to how hard it will search for a match before giving up (something I'll get to later).

One important effect of the sliding window limit is that if two Javascript functions with common substrings are separated by more than this distance, they cannot be matched.

But DEFLATE has another trick up its sleeve. It encodes the output of the LZ77 algorithm using Huffman encoding, but uses one Huffman tree for the character literals and length codes, and another Huffman tree for the backwards distances.

Which suggests another potential gain is to try and arrange for the backwards distances to be both small, and frequently the same, so as it produce shorter Huffman codes.

A sort with no undo

Thus far, intuition would tell us that if we could rearrange the input in order to bring more closely matched text closer together, we might be able to push up compression ratios, but how to do this without something to reverse the sort? Fortunately, unlike BWT, we are not working on plain text, but machine readable program code. We already know of something that rearranges code and moves it around, it's called a compiler!

We would not want a one-way text compression sort which say, brings Hamlet's prologue, climax, and epilogue together and randomly rearranges the rest, but your browser has no problem running your Javascript code if function foo is declared after bar, or before it. Thus, as long as two statements do not have order-dependencies, we can arrange them freely, in fact, all top-level function declarations can be rearranged arbitrarily.

Sort by Clustering

So, we've finally arrived at the point in which we have to devise our sort algorithm. What we want to do is, for any function Foo ensure that the best-match for this function in the whole program appears closest within the sliding window that the compressor will use. But bringing two functions that are most similar together in a greedy algorithm fashion won't necessarily produce the optimal result, as it is possible that by moving function Bar closer to function Foo, you've moved it away from lots of other functions that were good matches as well. As an example, consider these strings:

"Hello World""Hello World is a common test string used."...several thousand strings later including the phrase "hello world" ..."Common test strings used are metasyntactic variables like Foo and Bar.""Variables like Foo and Bar are very common."

If we selectively moved the second string close to the first, we might prevent it from matching the other good matches later, especially if the window fills up.

One idea I started to think about was to repurpose Document Clustering techniques towards code. Document Clustering is commonly used in information retrieval systems to find related documents. Typically, a document is encoded using some technique to measure word importance, such as representing each word by its term frequency inverse document frequency. Then, any two documents can be compared by some distance metric, for example, taking the tf-idf weightings of terms as a vector in N-space and computing the cosine between them.

In this case, we'd let each function be a separate document, and the entire program be like the corpus of documents. We'd then choose some encoding to weigh Javascript grammar nodes by importance in a way that would produce good LZ77 matches, and then proceed in a bottom-up clustering fashion. First, we'd construct all the pairs of functions which match best. Pick a function, pair it with its best match, call that Cluster 1. Pick another function, pair it with its best match, call that Cluster 2, and so on. After this procedure is done, pick a Cluster, and find its nearest Cluster (according to some metric) and pair them up in a Cluster of 4 functions. After that's done, pair up 4-Clusters into Clusters of 8, and so on, until the final cluster encompasses the whole program.

What's a good metric?

Ideally, a good metric for comparing two functions would take into account the way the GZIP compressor searches for matches and encodes them. It is an interesting theoretical question, but for practical implementation purposes, I needed something that performs reasonably well, now. One algorithm that is pretty good at finding strings of rough similarity, even in the presence of noise, is the dynamic programming edit-distance algorithm. It's deployed widely in one variant or another in the bioinformatics industry for gene sequence alignment (Smith-Waterman, HMMR, etc), but the version commonly used for general CS work is the Levenshtein Distance.

Results

Taking Levenshtein Distance as my metric, I produced a greedy variant as a patch to the GWT compiler. The greedy variant does not implement bottom-up clustering, but instead, sorts all functions by length first (suggested by GWT team member Lex Spoon), and then performs a linear scan over the sorted functions, picking the best match each time to the previous output. The source code is here. Remarkably, even this simple algorithm produces nice gains? How much? Well, if you do nothing else, it produces a 5-7% gain in GWT's Showcase application when compressed with gzip -9. But there's more that we can do!

Optimizing GWT's Obfuscation

When the GWT compiler is executed in obfuscated mode, it renames every single Javascript identifier in the whole program except for foreign Javascript. Ideally, you want the shortest id possible. Up until recently, GWT limited the first character of an identifier to an alphabet of 32-characters, and for strings of length 2 or more, it used base-64 characters. However, due to a clever patch by GWT community member Andriasyan (Issue #2448), the first character can actually be chosen from a base-54 alphabet. This has the effect of shrinking output size by up to 1.75% prior to compression.

We're not done yet! The GWT compiler has other tricks up its sleeve. It performs the renaming from the bottom-most scopes upwards, letting each scope reuse variable identifiers as they become free. However, it unfortunately did not insure that identifiers were picked in a stable order. Thus, a function of 3 variables could be declared as function foo(a,b,c){ or as function foo(b,c,a){. Obviously, this would lead to suboptimal compression since every function of 3 variables should have the same suffix (a,b,c){. The effect of making obfuscated identifier allocation have a stable sort order combined with the base-54 patch produces an incredible gain of 10.5% when compressed with gzip -9.

Choosing a different GZIP implementation

The deflate algorithm actually gives some leeway to the compressor implementor in terms of how matches are found. ZLIB's implementation on which GZIP is based is actually not the best implementation, although it might be the best patent unencumbered one. Rather, the inventor of the LZMA algorithm has his own DEFLATE implementation in his 7-zip utility, which produces 4% better output than gzip by my estimates.

Combining base-54/base-64 obfuscated identifier encoding, stable sort-order for identifier allocation, my greedy clustering-by-edit-distance sort algorithm, and 7-zip as a gzip-compatible compressor, yields an incredible 21% reduction of the Showcase application. On a large 500k Javascript application, this means an additional 100k bandwidth is saved, with no performance penalty!

Conclusion

A general purpose technique (Cromwell Clustering Transform? (CCT) :)) for compilers to rearrange code for compression efficiency (vs say, cache locality) has been presented, which achieves non-trivial compression efficiency gains in Javascript output from the GWT compiler. Some of these techniques can also be applied to hand-written Javascript as well and included in 3rd party JS minification utilities.

Addendum

From reading a description of the algorithm, it may be hard to visualize. Here is sample output from GWT's Showcase application:

@Bart, There is a pure Java LZMA provided by the 7-zip author, but it won't be compatible with GZIP in browsers. However, the 7-zip utility does contain a GZIP/DEFLATE utility that produces better compression than gzip. There is also something out there called "gzip 1.2.4-hack" IIRC, that implements improvements to the gzip-compressor.

@Tom, Yes, it will be in GWT 2.0.

@nFriedly, Not a typo, BWT is the "Burrows Wheeler Transform" used as part of the BZIP2 compressor.

@Lee, Interesting question. Something to look at. GWT does have a CSS optimizer in ClientBundle.

@zachleat, YUI Compressor is a minifier and obfuscator. Like GWT has long done, it strips extraneous whitespace and other characters, while obfuscating identifiers. It does not however, introduce code-motion to increase compression efficiency.

If you compare GWT output to the original source like these JS tools do, the compression ratio can be as high as 100 to 1 (Not a typo). See my GWT Extreme/GWTQuery presentation for an example of reducing 30k of code to less than 1,000 bytes. That's because GWT can obfuscate global symbols as well as prune all fields, method, and objects which are never referenced, as well as inlining and removing those that are referenced.

Wow, I have vague dreams about how GWT could have some kind of UBEROBF mode that recognizes repeated process chains and somehow figures out how to build prototypes that reuse a finite set of generic data processing algorithms by standardizing parameter / filed names...

@Lee: I just recently put together a small Java application that will attempt to optimise CSS files for gzip compression. Although it doesn't reorder the CSS rules (as doing so could have many unwanted side effects without knowledge of the DOM), it does ensure that the order of properties and their values is consistent throughout the document.