CMIX

cmix is a lossless data compression program aimed at optimizing compression ratio at the cost of high CPU/memory usage. It gets state of the art results on several compression benchmarks. cmix is free software distributed under the GNU General Public License.

cmix works in Linux, Windows, and Mac OS X. At least 32GB of RAM is recommended to run cmix. Feel free to contact me at byron@byronknoll.com if you have any questions.

enwik8

Some language modeling benchmarks use enwik8 split into three sets: the first 90% for training, the next 5% for validation, and the last 5% for testing. Models are usually trained using multiple passes over the training set. This is not a standard way of benchmarking compression programs, but the performance of cmix can still be measured using this setup:

File

Original size(bytes)

Compressed size(bytes)

Cross entropy

enwik8

100000000

14955482

1.1964

training set

90000000

13548217

1.2043

test set (no training)

5000000

835351

1.3366

test set (after training)

5000000

693239

1.1092

It was necessary to make a small change to the cmix source code in order to compute "test set (after training)". The code was modified to compress the test set after making a single pass through the training data.

The preprocessing stage transforms the input data into a form which is more easily compressible. This data is then compressed using a single pass, one bit at a time. cmix generates a probabilistic prediction for each bit and the probability is encoded using arithmetic coding.

cmix uses an ensemble of independent models to predict the probability of each bit in the input stream. The model predictions are combined into a single probability using a context mixing algorithm. The output of the context mixer is refined using an algorithm called secondary symbol estimation (SSE).

Architecture

Preprocessing

cmix uses a transformation on three types of data:

Binary executables

Natural language text

Images

The preprocessor uses separate components for detecting the type of data and actually doing the transformation.

For images and binary executables, I used code for detection and transformation from the open source paq8pxd program.

I wrote my own code for detecting natural language text. For transforming the text, I used code from the open source paq8hp12any program. This uses an English dictionary and a word replacing transform. The dictionary is 463,903 bytes.

As seen on the Silesia benchmark, additional preprocessing using the precomp program can improve cmix compression on some files.

Model Prediction

cmix v16 uses a total of 2,011 independent models. There are a variety of different types of models, some specialized for certain types of data such as text, executables, or images. For each bit of input data, each model outputs a single floating point number, representing the probability that the next bit of data will be a 1. The majority of the models come from other open source compression programs: paq8l, paq8pxd, and paq8hp12any.

Every neuron in the network directly tries to minimize cross entropy, so there is no backpropagation of gradients between layers.

The inputs to each neuron (values between 0 to 1) are transformed using the logit function.

Only a small subset of neurons are activated for each prediction. The activations are based on manually defined contexts (i.e. functions of the recent input history). One neuron is activated for each context. The context-dependent activations improve prediction and reduce computational complexity.

Instead of using a global learning rate, each context set has its own learning rate parameter. There is also learning rate decay.