I'm trying to learn more efficient coding so I'm doing some challenges. I have a challenge where I'm given some hex data. I know that hex data is XOR encoded with a four character long ASCII key. The only thing I know about the clear text data is that it contains "a lot of '#' and space characters".

I don't know what they mean by a lot and there are many possibilities that end up with more than 50 hash tags and spaces, though most of them have less than 10 of each. So I figured the only way I could do this would be to narrow down the number of possibilities.

In the following code I'm iterating through every possible combination of 4 character ASCII using the Cartesian product (keys). I'm using a namedtuple to store the number of hashtags and spaces for each key that has more than 10 of either. I'm hoping this would help me determine good potential candidates to manually sift through.

1 Answer
1

The very first thing I do when I'm running into performance issues with my code is answer three questions:

How fast does my code need to be?

How fast is it now?

What is slowing me down?

You're the only one who can provide an answer to the first one, so I'll pass on that for now. The second and third ones can be answered simultaneously while profiling your code, using cProfile. I saved your code as "bruteforce_xor.py", so by running

C:\Users\Dannnno\Desktop\> python -m cProfile bruteforce_xor.py

I was able to find out how long it takes to run in total, and what is actually taking so long. This was the output:

I don't have any! I let this run for a while and it never finished, which shouldn't come as a surprise once you finish reading below.

Before we apply the (lack of) results of our profiling, lets look at some general improvements.

Imports

In general, each import deserves its own line. This helps with readability

Whitespace

You generally want whitespace around operators, commas, etc. Exceptions include the = sign for default values and keyword arguments, and for precedence.

On the flip side, don't put them on the outside of method calls, like

possibles.append( Possible(key, hcount, scount) )

instead, use

possibles.append(Possible(key, hcount, scount))

Naming

Some of your variables could be more clearly named. Instead of data, I'd call that chunks. Instead of Possible I'd use Candidate, and possibles -> candidates. Instead of for k in keys just type for key in keys. Things of that nature.

Regex

I question the regex you're using... The regex '....?' will match any 3 or 4 character chunk, because you've made the last one optional. Based on your comments, you only want 4 character chunks.

Itertools

You use itertools some of the time, but not always. Use it for zipping and mapping as well. That being said, I don't really see the point of

map(bytearray, [word, key])

Usually I prefer reserving higher order functions for cases where I don't actually know how many elements there are. In this case

bytearray(word), bytearray(key)

seems much cleaner

Your speed is going to be very, very hard to improve if you keep using a brute-force algorithm.

string.printable consists of 100 ASCII characters. When you get the cartesian product of that with itself (4 times) you get 100,000,000 (!!!) combinations. Then you iterate over each of those combinations.

Inside of that iteration, you then iterate over every "word" in your data. Using re.findall('....?', blah) you're going to generate 1889 chunks of length 4, and 1890 chunks of length 3, for a grand total of 3779 chunks. This means that you're essentially doing 377,900,000,000 (!!!!!) iterations of your loop.

That's bad. That's really bad. The number one thing you absolutely must try to improve is that - find another method that reduces how many times you look at everything. I don't have a good suggestion for that - I don't know enough about your problem or what (if any) expectations you can have for the input to provide a good way to filter that. However, I can give you some other suggestions.

Use lazy evaluation

Like I mentioned earlier, don't call zip and map. Instead defer the calculations to when you actually need them using itertools.izip and itertools.imap.

Don't repeat calculations

You transform every chunk you have into a bytearray every single time you go through your loop. Instead of repeating that calculation, why don't you convert them first? Same for your cartesian products (except here I'd probably convert your string.printable and not the cartesian products themselves)

Finally

Bitwise operations are generally executed in constant time so there isn't a whole lot you can do about that. All I might suggest is using an LRU cache, but honestly I don't see that speeding up the process.

As a final piece of food for thought, this (poorly timed) code runs (on my admittedly slow netbook) as follows:

import time
start = time.time()
# 377,900,000,000 is too big for an arg to xrange in Python 2.7k
for j in xrange(1000):
for i in xrange(377900000):
j ^ i
diff = time.time() - start
print diff

Output:

None, again. I ran this for 5 minutes on (again, admittedly crummy netbook) and it couldn't finish. This is the equivalent number of iterations with a single constant time operation.