Contents

In this chapter, we are careful to distinguish between 3 kinds of software:

"data compression software", the implementation of algorithms designed to compress and decompress various kinds of files -- this is what we referred to as simply "the software" in previous chapters.

"executable files" -- files on the disk that are not merely some kind of data to be viewed or edited by some other program, but are the programs themselves -- text editors, word processors, image viewers, music players, web browsers, compilers, interpreters, etc.

"source code" -- human-readable files on the disk that are intended to be fed into a compiler (to produce executable files) or fed into an interpreter.

Executable files are, in some ways, similar to English text -- and source code is even more similar -- and so data compression software that was designed and intended to be used on English text files often works fairly well with executable files and source code.

However, some techniques and concepts that apply to executable software compression that don't really make sense with any other kind of file, such as:

some kinds of code size reduction may locally make a subroutine, when measured in isolation, appear run slower, but improve the net performance of a system. In particular, loop unrolling, inlining, "complete jump tables" vs "sparse tests", and static linking, all may make a subroutine appear -- when measured in isolation -- to run faster, but may increase instruction cache misses, TLB cache misses, and virtual memory paging activity enough to reduce the net performance of the whole system.[5]

various technologies for reducing disk storage, such as storing only the source code (possibly compressed) and a just-in-time in-memory compiler like Wikipedia: Tiny C Compiler (or an interpreter), rather than storing only the native executable or both the source code and the executable.

Selecting a machine language or higher-level language with high code density[16]

various ways to "optimize for space" (including the "-Os" compiler option)[17]

Some very preliminary early experiments[22] give the surprising result that compressed high-level source code is about the same size as compressed executable machine code, but compressing a partially-compiled intermediate representation gives a larger file than either one.

Many data compression algorithms use "filter" or "preprocess" or "decorrelate" raw data before feeding it into an entropy coder. Filters for images and video typically have a geometric interpretation. Filters specialized for executable software include:

Instead of decompressing the latest version of an application in isolation, starting from nothing, start from the old version of an application, and patch it up until it is identical to the new latest version. That enables much smaller update files that contain only the patches -- the differences between the old version of an application and the latest version. This can be seen as a very specific kind of data differencing.

The algorithm used by BSDiff 4 uses using suffix sorting to build relatively short patch files[23]

Colin Percival, for his doctoral thesis, has developed an even more sophisticated algorithm for building short patch files for executable files.[23]

"disassemble" the code, converting all absolute addresses and offsets into symbols; then patch the disassembled code; then "reassemble" the patched code. This makes the compressed update files for converting the old version of an application to the new version of an application much smaller.[24]

Several programmers believe that a hastily written program will be at least 10 times as large as it "needs" to be. [25][26]

A few programmers believe that 64 lines of source code is more than adequate for many useful tools.[27]

The aim of the STEPS project is "to reduce the amount of code needed to make systems by a factor of 100, 1000, 10,000, or more." [28]

Most other applications of compression -- and even most of these executable compression techniques -- are intended to give results that appear the same to human users, while improving things in the "back end" that most users don't notice. However, some of these program compression ideas (refactoring, shared libraries, using higher-level languages, using domain-specific languages, etc.) reduce the amount of source that a human must read to understand a program, resulting in a significantly different experience for some people (programmers). That time savings can lead to significant cost reduction.[29] Such "compressed" source code is arguably better than the original; in contrast to image compression and other fields where compression gives, at best, something identical to the original, and often much worse.

↑"Java: Trees Versus Bytes" thesis by Kade Hansson. "A Compact Representation For Trees ... is the most dense executable form that he knows of. ... The creation of compact tree representations ... achieving better compression and developing faster codecs will no doubt be a growth area of research in coming years. One such project in this area may lead to an eventual replacement for the Java class file format."