Design

The Enduring Challenge of Compressing Random Data

By Mark Nelson, Guest Editor, November 06, 2012

Not all files are compressible. The challenges are figuring out how small you can possibly make random data and how simple the algorithm can be.

Ten years ago, I issued a simple challenge to the compression community: Reduce the size of roughly half a megabyte of random data — by as little as one byte — and be the first to actually have a legitimate claim to this accomplishment.

Ten years later, my challenge is still unmet. After making a small cake and blowing out the candles, I thought this would be a good time to revisit this most venerable quest for coders who think outside the box.

Some History

In George Dyson's great book on the early history of electronic computers, Turing's Cathedral, he describes how much of the impetus for computation in the 40s and 50s was from the US military's urge to design better fission and fusion bombs. A powerful technique used in this design work was the Monte Carlo method, which relied on streams of random numbers to drive simulations.

The problem then, as now, is that coming up with random numbers is not always an easy task. John von Neumann was intimately involved in all this, and is famously quoted as having said:

Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.

Since my tax dollars paid for those numbers, I thought it only fair that I make it the basis of my challenge. I took the decimal digits, converted them to a base two number, stored it in AMillionRandomDigits.bin, and challenged the world to compress it. (The original USENET post was followed with a more findable blog posting that contains a bit more detail.)

Ten years later, there have been no serious entrants, although there are a few dedicated souls who are continuing to attack the problem. Unfortunately for all who are working on it, it seems that those RAND scientists back in the 50s did a really, really good job of scrubbing those numbers.

The 2012 Edition

For a few different reasons, it is time to close down the older versions of the contest and issue an updated 2012 challenge. Nothing much has changed, but over the years, I've bumped into a few points of confusion that I can clear up; and in addition, I would like to slightly widen the contest's scope. Most importantly, the comments section of the previous contest is way out of hand, and issuing an update lets me close that stream down and open a new one.

For the 2012 edition, I am actually giving entrants two possible ways to win the prize. The first challenge is essentially a reprise of the original, with minor tweaks and updates. The second poses a more difficult problem that is a superset of the first.

Meeting either challenge brings you worldwide fame, a cash prize of $100, and the knowledge that you have defeated a problem that many said was untouchable.

Likewise, both problems are governed by one meta-rule, which seeks to implement an overarching principle: The point of this is to win algorithmically, not to game the contest. I will disqualify any entry that wins through means such as hiding data in filenames, environment variables, kernel buffers, or whatever. Someone can always find a way to win with monkey business like this, but that is beside the point of the contest. No hiding data.

Challenge Version 1: A Kolmogorov Compressor

The original version of the challenge is basically unchanged. Your goal is to find the shortest program possible that will produce the million random-digit file. In other words, demonstrate that its Kolmogorov complexity is less than its size. So the heart of Challenge 1 is the question of whether the file is compressible à la Kolmogorov and standard, general-purpose computing machines.

The interesting part about this challenge is that it is only very likely impossible. Turing, and Gödel before him, made sure that we can't state with any certainty that there is no program of size less than 415,241 bytes that will produce the file. All it takes is a lucky strike. Maybe the digits are a prime? Maybe they just happen to be nested in the expansion of some transcendental number? Or better yet, maybe the RANDians overlooked some redundancy, hidden somewhere in a fifth order curve, just waiting to be fit. There are no telling how many different ways you could hit the jackpot.

However, the dismal logic of The Counting Argument tells us that there are always going to be some files of size 415,241 bytes that are not compressible by the rules of Challenge 1. And of course, it's actually much worse than that — when you cast a critical eye on the task, it turns out that nearly all files are incompressible. But for a given file, we don't really have any way of proving incompressibility.

Rulings

I want everyone to have the best chance possible to win this challenge. The basic rule is that your program file, possibly combined with a data file, must be less than 415,241 bytes. Clarifications on questions that have come up in the past included:

Programs written in C or some other compiled language can be measured by the length of their source, not their compiled product.

Using external programs to strip comments, rename variables, etc. is all fine. It would be nice to have the unmangled source available as your input, with the mangling step part of the submission process.

Programs that link to standard libraries included with the language don't have to include the length of those libraries against their total. Hiding data in standard libraries is, of course, not allowed. (And don't even think of hiding it in the kernel!)

Source code can be submitted in a compressed container of your choice, and I will only count the bytes used in the container against you.

Likewise, any data files can be submitted in a compressed container, and I will only count the bytes used in the container against you.

You own the code, and just because you win, I don't have the right to publish it. If you insist on an NDA, I may be willing to comply.

In general, you need to submit source code that I can build and execute in a relatively standard VM. If you are paranoid and insist on binaries only, we might be able to come to terms, but no guarantees.

The nature of this contest is such that gaming the rules is pointless. You aren't entering in a quest to beat the rules, you are entering in a quest to beat the data.

Your program might take a long, long time to run, but we will have to draw the line somewhere.

If there is anyone who deserves to beat this file, it is Ernst Berg, who has been relentlessly attacking it from various directions for years now. Ernst doesn't seem to be doing this to feed his ego — he'll share his results with all comers, and is always willing to listen to someone's new approach. I consider him to be the unofficial Sergeant-at-Arms of this enterprise.

But Ernst will also be the first to tell you what a harsh mistress the file can be — always taking, but never giving.

Challenge Version 2: A General-Purpose Random Compressor

Challenge 1 is interesting because it is nearly, but not assuredly, impossible. Challenge 2 is more along the lines of troll bait, because it is patently impossible: Create a system to compress and then decompress any file of size 415,241 bytes. In other words, create a compressed file that is smaller than the input, then use only that compressed file to restore the original data.

Unlike Challenge 1, there are no size limitations on Challenge 2. Your compressor and decompressor can be as large as you like. Because the programs have to be able to handle any input data, size is of no particular advantage — there is no data to hide.

This challenge is for the contestant who is sure that he or she has figured out a way to compress the million-digit file, but finds that their program takes 100 MB of space. Okay, that's fine, we shall first see if it can compress the file. It then must be able to correctly compress and decompress a file of the same size. Let's say, on 1,000 different files.

To keep it simple, the files will be simple permutations of the million digit file — scrambled with an encryption system, or perhaps a random number generator, or whatever. The point is, they should have all the same numeric characteristics as the original file, just organized slightly differently.

Again, I will emphasize that Challenge 2 is at its heart provably impossible to beat. No program can compress all files of a given size, and the chances of any program being able to compress 1,000 different files of this length is so vanishingly small that it can safely be ruled out, even if every computer on earth was working on it from now until we are engulfed in flames during Sol's red giant phase.

Conclusion

It seems unlikely that there will be any developments in lossless compression that change the terms of this contest. No doubt I'll reissue the challenge in another ten years or so, but if it is beatable, the tools are already at hand. Good luck to those of you who are tackling the challenge. Beating this will not get you the fame associated with something like the Clay Millennium Prizes, but in the small world of data compression, you will sit alone on a throne of your own making, and deservedly so.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!