When bruteforcing a password (e.g. the common attacks on DES), where you have ciphertext only, you need a way to assess whether a decrypted plaintext is the right one. I believe the EFF DES machine does this by checking if the chars are printable. Of course, this only works for ASCII files, not things like images.

I'd like to measure the entropy (observed 0th order byte level) and see if it's above a threshold that can be attributed to randomness.

Is that a good method?

Is there another well known method?

Where can I find tables that indicate given a randomly generated message of size X, there's p probability that it's entropy would be < ? (These types of tables are used, for instance, in chi-squared testing)

In response to D.W:I have 5 ciphertexts all of 1 KB and no information about them.I was able to decrypt one by dictionary attack; it is plaintext ASCII. I imagine the others are also. If that were the case, it's easy to automatically distinguish valid decrypts from nonvalid (check that all chars are printable, not < 0x20 or > 0x80). Of course, the others might not be ASCII files, but until I try, I have no way of knowing.

But, I'm interested in this method in general. For instance, when looking at a sample output of an unknown cryptosystem, it would be a good way to assess if there are any obvious statistical problems. Or, when reverse engineering binary applications, and looking at data, and trying to assess if it's binary data or encrypted. I also find the question mathematically of interest even aside from applications.

I think you should tell us exactly what situation you have. What ciphertext do you have? What do you know about the plaintext? Is this a real question, or a theoretical one? If it is a theoretical one, please also tell us why you are asking.
–
D.W.Sep 18 '11 at 4:39

2 Answers
2

Well, from your previous questions, I'm assuming that your writing a utility to brute-force decrypt a password protected file (encrypted with a certain encryption utility), and you're looking for a way to determine whether your trial decryption is plausible.

Normally, when an attacker attempts to decrypt something, he has some idea about what it is (why else is he investing the effort), and even if that guess doesn't give him an entire plaintext block, he partial information about what it is (be it an IP packet or a Word document) will help him recognize it.

Thomas states that the usual assumption is that you do have a full plaintext block; that may be the assumption you make when you're designing a cipher (actually, you assume the attacker has a lot more than that), but that's not always true for all attackers (for example, yourself).

Your suggestion (run a Chi-Squared test on the byte frequencies to see if it is consistent with the bytes being generated randomly (with the idea that decrypting the data with the wrong key will end up with random looking plaintext) isn't a bad idea; whether it is the right idea depends a great deal on whether it will actually pick up the plaintext (without the sensitivity needing to be so high that you are deluged with false hits). This is a real concern; I suspect that (say) ZIP files have fairly even byte distributions; the Chi-Squared test would be hard pressed to identify those).

Hence, my suggestion would be to rely on a series of tests, each designed to pick up a particular file format (as well as a general purpose test that has a shot at recognizing file formats you didn't anticipate).

Some ideas for these tests might be:

Checking if the MSbits of the bytes are mostly the same. This will catch anything in a text format (and yes, a Chi-Squared test would also catch it, but you'd need to decrypt a lot more data to run that test)

Check to see if the start of the is consistent with a specific file format (for example, check the magic number that appears in a ZIP file header or a JPEG file header).

Your Chi-Squared based test is a decent general purpose test, which has a good shot at picking up most file formats that aren't heavily compresssed. If I were doing this, I'd pick a simpler but related characteristic (such as "is there a byte value that occurs more than N times"); I have no good feel as to which would work better in practice.

Great answer! Yes, the chi sqr won't catch compressed data. But it's still very useful to have a general purpose test, not only for brute forcing but for other types of analysis and reverse engineering.
–
S. Robert JamesSep 18 '11 at 2:53

The usual assumption is that the attacker knows a full plaintext block; that's what the EFF DES-cracking machine uses. That machine knows exactly 8 consecutive plaintext bytes and the corresponding ciphertext block; it stops when it finds a matching key. Since there are 256 possible DES keys, and 264 possible 8-byte blocks, chances are high that there is only one matching key (i.e. when the EFF machine stops on one key, it is the right key).

That's not an unreasonable assumption; see for instance this response to a similar question (on security.SE) for more details on that (yeah, I am self-promoting).

As for entropy: entropy is a measure of what some data could have been. It makes no sense to talk about "the entropy of a given message": the entropy is a property of the process which generates messages, not of any message itself.

When we say "the entropy of a given messsage" we mean: assuming the source is picks bytes independently, randomly, at the frequency found in the message, what would its entropy be.
–
S. Robert JamesSep 18 '11 at 2:52

1

@Thomas Pornin: no, the EFF DES-cracking machine was not restricted to known plaintext. According to the designer (thanks WayBackMachine) "The DES Key Search Machine uses a sieve-and-check search process that can find keys even when little is known about the plaintext. Each chip processes two separate ciphertexts and contains a 256-bit vector specifying which bytes can appear in the plaintext -- making it possible, for example, to find a key if the input message is simply known to consist of ASCII text."
–
fgrieuSep 19 '11 at 11:18