A Friday Puzzler: Un-XORing Two Plaintexts

I’ve been reading Cryptography Engineering by Bruce Schneier, Niels Ferguson, and Tadayoshi Kohno, on the theory that someone who writes about privacy and surveillance as much as I do ought to have somewhat more detailed understanding of how modern cryptosystems work, even if I’m never going to be competent to work with the actual code. At one point, the authors mention a potential problem with certain kinds of ciphers. Stream ciphers work by combining a secret cryptographic key with a (supposedly) unique number—a random shared string or a “nonce“—to generate a “keystream.” The keystream is then XORed with the plaintext message to produce the encrypted ciphertext.

For the non-computer-geeks: that just means that for every bit in the sequence of ones and zeroes that makes up the plaintext, if the keystream has the same value in that position, then the corresponding bit of the ciphertext will get written as a 0, and if they have different values in that position, the corresponding bit of the ciphertext gets written as a 1. (This corresponds to the logical operation “exclusive or”: It outputs a 1, meaning “true,” just in case one or the other but not both of the inputs is true.) So, for instance, the capital letter “A” is normally encoded as the binary string: 01000001. A lowercase “z” is represented as 01111010. If you XOR them together, you get: 00111011. If you XOR in the “z” again, you get “A” back out… but that assumes you know at least one of the two original pieces of the puzzle: There’s a vast number of different ways to XOR two bytes together to produce 00111011.

In theory, there should be no way to reverse the process without knowing the keystream, which requires knowing the cryptographic key. But there’s a loophole: If the designer of the system messes up and allows that supposedly-unique “nonce” to be reused, then you end up with two messages that have been encrypted (XORed) with the same keystream. That doesn’t tell you what the keystream is. But if an attacker knows which two messages have been encrypted with the same keystream, he can just XOR those two cyphertexts together. The result is to mathematically cancel out the key, and give you the same result as if you’d just XORed the two original plaintexts together. Once you’ve got this, Schneier et. al. warn, an attacker will often be able to easily reverse the process and decompose that into the two original messages—provided the original messages aren’t just random gibberish, but something that exhibits patterns, like written English. But they didn’t bother explaining exactly how this could be done, so I ended up spending 15 minutes doodling on a legal pad trying to suss out how an attack would work. Even some of my geekier friends seemed to think it wasn’t possible when I floated the question on Twitter—and for some cases, it won’t be. For instance, if the two original messages are identical—meaning they have the same value at every bit position—then the result of XORing them is always going to be a string of zeroes, which makes it obvious the two initial messages were identical, but doesn’t give you any hint at the content of the messages.

Special cases aside, though, there definitely are some generally viable strategies for decomposing a file generated by XORing two messages—let’s assume they’re ordinary written English in ASCII characterformat—back into the original pair of texts. How would you go about it? I’ll update the post with the solutions that I came up with (or found online) later this weekend.

Update: I’m pleased, though not at all surprised, to see that I have a bunch of very smart readers who came up with basically all the strategies I did, and in some cases stated them with a good deal more sophistication than I could have. You’re probably better off just reading the comments, but I’ll summarize the basic ideas below the fold.(1) Exploit Headers, Padding, & Formatting: I didn’t suggest any particular document structure beyond written English, but realistically you’d want to look for signs of some standard header information that would be at least partly identical (and of some characteristic length) for a pair of e-mail messages, HTML pages, word documents, etc. The strictly identical and overlapping parts will be represented as strings of zero-bytes in the XOR file, which may be of some help to the extent that you can make educated guesses about what you’re looking at from the length and pattern of the overlapping strings.

Even more useful, once you’ve got some evidence that you’re dealing with a particular kind of structured document, you can try XORing in different characteristic header strings like “Content-Type: text/html” or “Received-From:” in different positions and seeing if and where it spits out something intelligible. I’m assuming throughout that you’ve got frequency tables that allow a search algorithm to quickly assess whether a given output string is likely to appear in written English (or a given type of structured document) or just random gibberish. For strings like “Received-From” that you’d expect to be in approximately—but not exactly—the same position in the document, you might just precompute the strings you get in the region of overlap when you XOR it against itself offset by one or more characters. Then you can just quickly scan for matches with your precomputed “signatures”.

You can apply the same trick in the body if you think you’re dealing with a particular type of structured document. If you think at least one of the original plaintexts might use HTML, for instance, you’d XOR in long predictable strings like:<a href="http://
at every position until it spits back something intelligible.

On the other end, if you’ve got messages of different lengths, check to see if one of them was padded out with zeroes (or some other characteristic padding string): XOR some padding characters in at the end of the file and see if what comes back looks intelligible. If you XOR in bunch of zeroes and get out “Sincerely, Agent Smith” you’re probably on to something.

(2) Exploit ASCII: The ASCII character set systematically distinguishes between spaces and most common punctuation marks (which all begin 001), capital letters (010), and lowercase letters (011). There are some fairly uncommon punctuation marks that share initial bits with the letters, but we can ignore that initially for practical purposes. This turns out to give us a really useful shortcut for learning something about the structure of the two texts. Two of the same type of character XORed together, of course, are always going to start 000 (i.e. no difference), and most of the time that’s going to be two lowercase letters. But you can infer that a byte starting 010 means a space or punctuation mark XORed against a lowercase letter, 011 a space or punctuation against a capital letter, and 001 means a lowercase letter overlapping a capital. Since it’s going to be the most common punctuation by far, you probably want to default to treating an isolated 010 byte as a space. That gives you likely word breaks, which are handy for making your dictionary attack more efficient. Not all of them, mind you—they’ll be concealed when they coincide—but enough to be useful.

You can also look for characteristic sequences. If a sentence break in one document hits the middle of a word in the other, you’d have a sequence of bytes beginning 010, 010, 001—whereas a sequence beginning with 010, 010, 000 is more likely to be a punctuated wordbreak (comma, semicolon, colon) within a sentence falling across a word.

To see how this would be helpful in practice, imagine you’ve got:I love cheese!!
XORed with:You should too.
The first three bytes of the XORed file are going to be:00010000 01001111 00011110
How convenient, that looks like a one-letter word (or a contraction) in one of the original plaintexts right at the start! Immediately you’d want to try (or realistically, have your analytic algorithm try) XORing a capital “A” and a capital “I” in the first two positions. The first gives you the rather unpromising “Qo” in the first two places, which can be safely discarded, but “Yo” is rather more promising. Word frequency analysis helps you continue: You’ve got two characters of the same type next, and then another 010 byte. An English sentence with two sequential one-letter words would be unusual, but the string “You” followed by a space is extremely common, and testing it gives you the highly plausible complement “I lo”—and so on. Obviously, I’m somewhat artificially describing the process as a human codebreaker would approach it—the computerized attack would replace our intuitive sense of plausibility with a probability assignment based on the frequency with which possible strings of characters appear in English in different positions in a word or sentence. (The sequence “ing” appears with relatively high frequency at the ends of words, far less often at the start.)

(3) Whack It With a Dictionary: Aided by reasonable guesses about the format of the plaintexts and the locations of (some) word breaks and punctuation generated in the first two steps, you fill in the gaps by XORing in test words in every position—starting with the 500 most commonly used words in English—and once again recording the ones that generate output sequences also found frequently in English. When XORing in a word generates a possible word fragment, you test the various ways to complete it in the appropriate spot. In other words, if you plug in “the” and get out “vac”, you check whether the subsequent bytes generate anything interesting for XORing in “uum”,”ant”, “ation”, “illate”, and so on.

If you want to optimize the speed of a dictionary attack, you can also adjust the word list you’re using as you get probable hits. Ideally, you’d have a frequency table of uncommon words whose occurrence tends to be correlated, so that (for instance) your algorithm knows that if it has found that XORing in words generates the sequence “litigation,” it prioritizes testing various legal terms that would otherwise be lower on the list, or at least pump the same word back in as a search input. The word “encryption” may be pretty rare in written English, but if you know it appears once in a given text, the probability of a second occurrence rises dramatically.

None of this guarantees that the original texts will be retrievable: As the limit case of identical plaintexts shows, there just may not be enough information remaining in the XORed file. But for most texts—different enough to leave traces of that difference, but similar enough to exploit some common regularities—these strategies should get you quite a lot. Probably there are others I’ve missed.

16 responses so far ↓

For plain ASCII text of mostly English language copy, you want to start slicing and dicing a dictionary. XOR “”aardvark” with “abalone” and go from there, varying words as well the overlapped word portions, and searching through the ciphertext for the results. Actually you probably want to use a different dictionary ordering, tailored to the expected content of the messages you’re attacking (unless your target likes to discuss obscure animals that start with “a”). If you have more than one ciphertext to attack, you can amortize your attack with a table of stored XOR fragments, much like unsalted password hashes can be attacked with rainbow tables. You can also develop heuristics, e.g. expect to find “Dear Sir:” near the beginning and “Sincerely yours,” toward the end. Once you’ve found a likely overlap (that isn’t an exact overlap of two words of the same length), start trying to grow from both ends.

I don’t think you can prove anything about this process, but I would expect once you’ve done it a few times you should have some nice polished algorithms. Of course, if one of the texts was cryptographically random this would be a one-time pad, which cannot be attacked in this fashion.

For the purposes of this exercise you’ve specified ASCII, but if the file had any additional structure (which any file type besides ‘text/plain’ has) you could probably leverage that. For instance, we’d expect web pages to have “” toward the beginning, and a number of other common tokens distributed throughout. You can tune your dictionary ordering to the type of file you hypothesize you’re seeing.

I think if are trying to completely get back the plaintext and have nothing to go on but the XOR of the two messages it’s pretty tricky, though there’s definitely stuff you can do. (For example, assuming it’s in ASCII, you can find places where both have the same letter, or capital and lower-case versions of the same letter. You can also see where a lot of punctuation is, since all letters have a leading digit of 1 in the binary, which would XOR to 0 almost all the time, whereas periods would lead to that spot XORing to 1. If periods are by far the most common 0-leading symbol, then when you see a leading 0 you can assume that’s a period and get a guess for the letter that must line up with it that will be right a lot of the time. If you get enough clues like this, you might be able to get back the messages, though it’d be a tricky puzzle. I’m not sure if there are algorithms that do it automatically.)

But in crypto, we worry about much more specific situations. Maybe you have some idea of what is being said, but not a lot. You could try out some options and see which ones make the other message also look like English. You can also get all sorts of partial information. Just being able to figure out that two messages are the same can help a lot. Lots of programs might repeatedly send a given message when nothing is wrong, but then change their message when something noteworthy happens, and you’d be able to tell when the message changed.

Nice idea Adam, but if this is really ASCII then all values are <= 0x7f, so the high-order bit of all characters is 0. Of course in that case your analysis works identically for the second-highest-order bit, but you're probably better off assuming 0x20 space than 0x2e period for your most common punctuation. There are probably loads of other little heuristic tricks like this to try. The best combination probably depends on the target.

Assume that every one that leads with a 00 is punctuation XORed with a lowercase letter (there is punctuation that begins with 1 but those symbols, like @, \, _, are pretty rare). Assume that any one that leads with 10 is a capital letter XORed with a lowercase letter, and every one that leads with 01 is a capital letter combined with punctuation.

Then we look for two punctuations in a row, followed by a capital letter. That indicates a period (or exclamation point or question mark, much less likely), space, and beginning of a new sentence, and tells us two letters of the other text. We also look for punctuation, capital letter, punctuation. That indicates ” I “. Lowercase letter, punctuation, punctuation, lowercase letter indicates letter, comma or colon or semicolon, space, letter.

Do a sanity check on all punctuation. First try space. If that gives you an unlikely character like bracket, slash, carrot, underscore, etc, then try others. Apply this to check the above patterns too.

XOR the word “the” in every possible slot. If it creates a coherent string on the other side, make note of it. Do the same thing with any word that you expect in one or both of the messages, such as a name.

Assume any “0000000” is two XORed spaces unless you have reason to believe otherwise.

Here’s a very rough sketch. This assumes the plaintext is just english words and spaces–adding punctuation isn’t difficult in principle but makes it more complicated.

Start with a dictionary of English words, sorted from most to least common. Take the first word, add a space, and xor it with the beginning of the bit string. If the result is gibberish, move on to the next word. Otherwise, if the result looks like it could be the start of the second plaintext, then make a list of all words that could “complete” the fragment of plaintext and recursively repeat the above procedure on that list.

Continue this procedure in recursive function until you get to the end of the string (producing a pair of English plaintexts) or you run out of words (in which case there’s no possible English plaintexts)

Another approach, perhaps less efficient but more theoretically pleasing, would be to anneal a Markov random field. Think of the ascii values of the n-byte messages as nodes of a 2 x n lattice, and use a corpus to define the horizontal node potentials according to a simple Markov model of English (where the likelihood of each letter is a function of its k surrounding letters). The vertical edge potentials are determined by the fact that the XORs of the two nodes in each column are given. Let it run for a few minutes, and see what you get.

Ed-
These are the same solutions (albeit more rigorously phrased) that I came up with sitting and thinking it over for a half hour; I can’t believe they’re not floating around out there in the public domain already.

Indeed, all of these are sort of floating around the ether, and have been for some time.

There’s a really excellent description toward the end of Cryptonomicon of how it’s possible to attack even a “one-time pad” if the random noise you’re using for encryption isn’t truly random. The methods there, involving letter frequency, also work for attacking normally encrypted documents. In fact, they work much better.

(I didn’t participate earlier because I’m sort of a cryptography hobbyist and a longtime reader of Schneier et al, so I have a library of methods in my head that I’d use in the scenarios given already)

It actually may be pretty hard to figure out what two messages that are XOR-ed together mean. Here is an example where they figured it out based on the fact that the second message was a re transmission with slight differences:

British cryptographers at Bletchley Park had deduced the operation of the machine by January 1942 without ever having seen a Lorenz machine, made possible because of a mistake made by a German operator.
Interception

Known by Y Station operators used to listening to Morse code transmission as “new music”, originally Tunny traffic interception was concentrated at the Foreign Office Y Station operated by the Metropolitan Police at Denmark Hill in Camberwell, London. But due to lack of resource at this time (~1941) it was given a low priority. A new Y Station, Knockholt in Kent, was later constructed specifically to intercept Tunny traffic so that the messages could be efficiently recorded and sent to Bletchley Park.[17] The head of Y station, Harold Kenworthy, moved to head up Knockholt. He was later promoted to head the Foreign Office Research and Development Establishment (F.O.R.D.E).
Code breaking

On 30 August 1941, a message of some 4,000 characters was transmitted from Athens to Vienna. However, the message was not received correctly at the other end, so (after the recipient sent an unencoded request for retransmission, which let the codebreakers know what was happening) the message was retransmitted with the same key settings (HQIBPEXEZMUG); a forbidden practice. Moreover, the second time the operator made a number of small alterations to the message, such as using abbreviations, making the second message somewhat shorter. From these two related ciphertexts, known to cryptanalysts as a depth, the veteran cryptanalyst Brigadier John Tiltman in the Research Section teased out the two plaintexts and hence the keystream. Then, after three months of the Research Section failing to diagnose the machine from the almost 4,000 characters of key, the task was handed to mathematician Bill Tutte. He applied a technique that he had been taught in his cryptographic training, of writing out the key by hand and looking for repeats. Tutte did this with the original teleprinter 5-bit Baudot codes, which led him to his initial breakthrough of recognising a 41 character repeat.[18][9] Over the following two months up to January 1942, Tutte and colleagues worked out the complete logical structure of the cipher machine. This remarkable piece of reverse engineering was later described as “one of the greatest intellectual feats of World War II”.[9]

After this cracking of Tunny, a special team of code breakers was set up under Ralph Tester, most initially transferred from Alan Turing’s Hut 8. The team became known as the Testery. It performed the bulk of the subsequent work in breaking Tunny messages, but was aided by machines in the complementary section under Max Newman known as the Newmanry.[19]

Not sure if someone else mentioned this, but something that helps with this in ascii is that no letter will xor with itself to make another letter. However, spaces xor with letters will flip case, so that can help you discover positions of known letters and guaranteed spaces.

A bit late to add this but your comment that a message XORed with itself leaves all zeroes … That isn’t the same as reusing the key because both the plaintext and ciphertext are identical.

The late Dr Tutte was the founder and head of the Department of Combinatorics and Optimization, Faculty of Mathematics at the University of Waterloo.http://www.class-central.com. His paper on the Lorentz is on the uwaterloo.ca site somewhere.

Thanks for another informative blog, such a perfect approach, for you an Award-Winning iPhone, Android, Web App Design & Development team dedicated to your success. Tell us about your project, and we’ll get back to you.