to the Super User community. Well, this seemingly simple question sparked quite a discussion, so I have collated all the comments and answers and come up with what I hope is a definitive answer, along with the why and wherefore of it all.

In order to not only answer this question, but also to understand the answer, we first need to know how both compression and encryption work.

Let’s start with compression.

We’ve all come across compression in some form or other, but actually compression and how does it work?

Well, the actual meaning of compression in the English language is:

… the result of the subjection of a material to compressive stress, which results in reduction of volume as compared to an uncompressed but otherwise identical state.

In other words, squashing something to make it smaller. And this can be applied to data as well as physical objects. No, data compression doesn’t mean physically pushing the magnetic particles in your hard drive closer together (though that would be a neat trick), but to change the data so it takes up less space but can still be interpreted in the same way.

Compression can be loosely grouped into two major categories: Lossless and Lossy.

Lossless compression is any compression system where, if you take a file and compress it, then decompress it, you get exactly the same file out. So, if you compress the phrase “The quick brown fox jumped over the lazy dog” and then decompressed the results, you’d end up with “The quick brown fox jumped over the lazy dog”.

Lossy compression, on the other hand, doesn’t give you back exactly what you put in, but instead returns an approximation of it. For example, if we were to compress the quick brown fox phrase with lossy compression and then decompress it again we might end up with something like: “Quick fox jumped over lazy dog” – the meaning of the phrase is still the same, but it’s not the exact same phrase.

So what would be the point of lossy compression then? It sounds pretty useless doesn’t it? Not so! Are you listening to some music at the moment? Is it an MP3? You’re listening to music compressed with lossy compression. There’s an awful lot of extra information in a piece of music that you don’t hear (without very good equipment and sensitive hearing), so why keep it? And the same with pictures: JPEG images are compressed in a lossy way (examine them closely and you can see what are known as “compression artifacts” – subtle blockiness and speckles that you only notice if you look close).

It means that you can compress highly complex data to a much smaller size.

But how does compression work?

Well, there are many many different ways of performing compression, and the mathematics behind some of them is very complex, so we won’t go into it here. The simplest form of compression though, called “Run Length Encoding” is very simple to understand and demonstrate, so I’ll give you a quick overview.

(This is just a convenient string of numbers for showing you how this works)

Now, you notice you have groups of numbers there. In Run Length Encoding those groups of numbers are simply replaced with one number, plus a count of how many there should be:

[5]1[2]3[4]4[4]3[3]87[3]879[8]32

Already it looks shorter, and the data can be interpreted just as before. It can be taken a step further though – you see in the middle you have [3]87[3]87 ? Well, that’s also a repeated phrase – not of the same number, but of the same sequence of numbers. So why not write that bit as 2 ? 2 runs of 3 8s and a 7? We end up with:

It doesn’t look much of a difference when written out, but that’s just because of the space taken up by the brackets. Take those away and mark the run numbers a different way, and you get:

5123444323879832

Quite a considerable saving in space, yes? That is basically how most lossless compression works – finding patterns and replacing them with something smaller that represents that pattern. So you can see how you get to a point (as above) where you can’t compress the file any further – there are no more patterns to be found, and trying to compress a file like this often ends up bigger than when you started.

So now we know about compression, but what about encryption?

Encryption is the process of transforming information (referred to as plaintext) using an algorithm (called cipher)
to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key. The result
of the process is encrypted information (in cryptography, referred to as ciphertext).

So encryption takes something understandable and makes it gibberish, only understandable if you have the key to it – kind of like a box with a padlock on it.

Most encryption systems basically work on the principal of combining the source data with some known data (called the key) and passing it through some mathematical formula to create something that can’t be broken without having that known piece of data. Cracking the cypher without that first bit of known data is quite a hard task and usually involves, like in compression, looking for patterns. Patterns in encrypted data can indicate such things as the size of the original key, and can even reveal clues as to what the key may have been. Trying different decryption methods with different keys and looking for known valid data (like English words) is also a common way of trying to decrypt data without the key. This is known as a ‘Brute Force’ attack.

So the more you can ensure that the output of your encryption contains no patterns, the harder it will be to decrypt.

Encryption is incredibly complex and involved, so we won’t go into any detail here. (Besides, if I told you, I’d have to kill you)

So putting those together, what does that mean?

Well, we have:

Compression searches for patterns and replaces them with smaller tokens representing those patterns

Encryption obfuscates the data ideally creating an output with no discernible patterns in it

If the encryption is done properly then the result is basically random data. Most compression schemes work by finding patterns in your data that can be in some way factored out, and thanks to the encryption now there are none; the data is completely incompressible.

More important: compression adds entropy. Adding entropy is good for your encryption (harder to break with known-plaintext attacks).

Which is basically saying that a compressed file is harder to decrypt through brute force attacks because there should be no (or very little) recognizable data there to confirm if the decryption was successful or not.

So, there is our answer:

If you want to both encrypt and compress files (or any data for that matter) you should compress it first and then encrypt it.

4 Comments

Nice post! Thanks for all the awesome info. Do you know then how encrypted ZIP files work? Encryption seems to be built into many encryption formats like zip, rar, 7z, etc. Do these usually compress and then encrypt, or somehow do both at once

Well, nicely said, but when trying to think a bit out of the box (which is often necessary), you can find out that this definitive answer of yours is not generally true. If you are still interested in this particular question, I recommend you to read an article from Klinc et. al. “On compression of data encrypted with block ciphers” (it is from 2009 btw.). If there is a will, there is a way;).

I’m glad that you specifically used the words pattern and gibberish. RLE isn’t really used much these days for anything of import, so while it helps demonstrate the issue, it is a bit contrived. Using the terms “pattern” and “gibberish” makes it clear what is actually happening in a more generally applicable way: that compression hinges on finding patterns and that encryption destroys patterns.