That's true, but don't forget that on standard S-boxes the non-polynomial coefficients are 3A-congruent with the inverse configuration. You can't just generalise this across all sub-expressions, as Lehey proved. So you are right for some specific constraints on standard tightly-bound S-boxes, but that is not applicable in this case. You really have to focus on the general Booleans for any improvement. And 17% isn't too far from the theoretical limit of 22% either. So kudo's for the team!

We were not the first to generate and try to optimize Boolean expressions for the S-boxes. Other researchers worked on this before, starting 1997 when Eli Biham wrote his classic paper on bitslice DES. 17% is our improvement compared to those previous results. To me, it is impressive that after 14 years and numerous attempts by others, including successful ones, it was still possible to improve on the previous best results by as much as 17% at once. My gut feeling is that further improvements, while definitely possible, will be more limited. But the again, some people I spoke to had thought that our 17% was not possible.

Who cares what some CPU can do. The real question is how fast does it run on a modern GPU? Generally speaking encryption cracking and such is way faster on a GPU due this problem being highly parallel in nature.

TFA has nothing to do with the target platform. If you can improve the algorithm itself by 17%, then it doesn't matter if that means (33/0.83)MB/s on a CPU or (333/0.83)MB/s on a GPU.

Granted, some optimizations may adapt worse (or better!) to the instructions available on a GPU than they do on a CPU, but the task of password cracking inherently parallelizes almost perfectly, so 17% just means 17%, period.

3DES can offer up to 112 bits of security; even your GPU could check one key per nanosecond with 10000-way parallelism, you are looking at something to the effect of 16464665330209 years of work to do a brute force key search.

Actually, a lot of people care about CPUs. I spoke to someone from a penetration testing company the other day. They run a lot of password hash cracking. And they have 10x more CPUs (used for other purposes as well) than GPUs (bought specifically for password cracking). Given that performance of DES-based crypt(3) on GPUs is by far not as impressive as it is for other hash types, they typically test this sort of hashes on CPUs and not GPUs.

Here are some specific performance numbers for DES-based crypt(3) on GPUs (for comparison, recall that we're reporting over 20M c/s on a CPU):

oclhashcat-plus is reported to achieve 55M on ATI HD 5970, only 25M on NVidia GTX570 at 1600 MHz core clock, 310M on 8x ATI HD 6970, 181M on 7x NVidia GTX580 (1594 MHz). The numbers for oclhashcat-lite are very similar (57M, 26M, 297M, 181M, respectively). These are off the hashcat website. This does not use our new S-boxes yet (I expect that future versions of *hashcat tools will).

Notice how the number for high-end NVidia is on par with that for our CPU, and for ATI is less than 3x better. Of course, GPUs do have an advantage, but it still does make sense to use CPUs as well, which a typical organization has more of and doesn't need to spend extra time to deploy, install drivers for, etc.

Now, our new S-boxes and other optimizations will provide better performance. Per discussions with a tripcode cracker author, I expect all the way up to 400M c/s on ATI HD 5970, which is close to its theoretical peak speed (approx. 80% of it per some estimates). This is a 20x improvement over our figure for the Core i7 CPU, which is significant. (There's a little room for improvement on the CPU as well, though - specifically, if we pre-generate or runtime-patch the code for each salt as opposed to using pointers at runtime like we do now. This kind of optimization is assumed in the 400M figure for the GPU. So with both having the optimization, the GPU's advantage will be less than 20x.)

Curiously, 400M c/s for 25 iterations of DES will mean that a single ATI HD 5970 with proper code will be able to crack 56-bit DES keys in just 42 days on average.

So, yes, GPUs have an advantage, and we have contributed to that as well.

Very cool. Any chance of using Intel's AES-NI instructions on the i5 and i7 to shortcut some of the work for the cpu version? Or are they too specific to AES? AES-NI can run AES (I forget exactly which one) at something like 3 GBytes/sec on an i7.

(And, of course, programming it directly into a FPGA, but that's a separate topic).

AES-NI is definitely too specific to AES, not reasonably reusable for DES. Yes, we have achieved a speed for DES comparable to that of AES with AES-NI.

We're actually considering building a password hashing method on top of something like this, where bitslice DES has the advantage of being scalable to arbitrary SIMD vector widths and not requiring specialized instructions for efficient implementation. DES is also FPGA-friendly (more so than AES), and we have a project to implement password hashing for authen

Yah, that's me (DragonFly). You mean for the default password encryption? We switched the password encryption to sha-256 by default as one of the google code-in projects. I wasn't particularly involved but I'm sure you are correct. All I can say is that it was better than what we had before:-)

I guess SHA is turning out to not be quite as secure for its key-length as RSA was/is.

My personal position on password files is that it doesn't matter how good the encryption is. Social, physical, and psychologic

I have so many comments on what you wrote that I don't dare to post them.:-) But I'll say a few things:

Password policies still make sense to me when combined with modern (salted and stretched) password hashes, particularly for large user databases where each account is of relatively little value (your Sony example applies here). Rather than absolutely require certain character classes, users should also be given the option to use longer passphrases, where the number of required character classes can

Bitwise operations are not an issue. Besides, we have versions of our S-box expressions that primarily use "bit select" instructions, such as ATI's BFI_INT - these work on PowerPC/AltiVec in JtR 1.7.8, but I think they will see even more use on high-end AMD/ATI GPUs (this is what we primarily had in mind).

The real issue is register pressure (bitslice DES needs a lot of registers) and memory latencies. In our S-box expressions, we tried to minimize not only gate count, but also the number of registers needed

Why did it take so long to find a shorter boolean expression? Aren't there programs that take in a truth table and churn through all the expressions that can generate it? And isn't the S-box I/O size pretty small to begin with?

Determining whether two boolean circuits are equivalent is a famously difficult problem to solve. In fact, it can be reduced to SAT (http://en.wikipedia.org/wiki/Boolean_satisfiability_problem) which was the very first algorithm that was shown to be NP-complete, which in general means it is impractical to solve on real computers.

Huge SAT problems are routinely solved on computers. In fact the CPU in your computer was probably formally tested in software by solving a huge SAT problem. NP-completeness does not necessarily mean that a problem can't be solved in practice even if it is huge. Complexity theory does provide an approximation to what is tractable, but it isn't all that accurate.

Sure, but you're not being given an arbitrarily large input size; it's something like 8 input/8 output at most. The traveling salseman problem isn't horrendously intractable if you only have e.g. 10 cities.

6-to-4 is large enough that you can't realistically find a perfect solution (the absolute smallest gate count) on present computers and given present knowledge. You can do it for 5-to-1, though. Also, generic Boolean expression minimization tools produce relatively poor results for DES S-boxes; specialized algorithms are the way to go. IIRC, I tried Espresso - http://en.wikipedia.org/wiki/Espresso_heuristic_logic_minimizer [wikipedia.org] - in late 1990s. It couldn't even get close to Matthew Kwan's results from 1998, wher

Both are no competition for AES at all. Also, 3DES is still of dubious security and DES can be brute-forced at home today. So this is really non-news, as it does not hold any relevance for people that care about having secure encryption.

NTMLv1 was the problem and it hasn't been used unless talking to very old servers in a very long time. You could always turn off NTLM password storage (since XP at least) and in Vista and Windows 7 it is off by default.

NTMLv2 uses a challenge response system and so you can't offline crack it in the same way.

NTMLv2 uses a challenge response system and so you can't offline crack it in the same way.

John the Ripper, in the -jumbo versions (community-enhanced), includes support for cracking of both NTLMv1 and NTLMv2 challenge/responses - see NETNTLM_README in the documentation and NETNTLM_fmt.c and NETNTLMv2_fmt.c in the source tree.

Slow? DES used to be slow prior to bitslicing. The 33 Gbps figure I mention is on par with that for AES using specialized instructions, but without reliance on such instructions. Sure, 3DES is 3x slower. But even for 3DES we get around 10 cycles/byte on one CPU core, which is on par with AES without specialized instructions. That said, data encryption with DES/3DES is in fact not the primary intended use for our results. We realize perfectly well that people want to hear "AES" these days.

AES is a pretty good algorithm, but I don't think it is good enough precisely because of the side channel weaknesses. I would imagine that another round of competition within the crypto community is necessary right after the SHA-3 competition. Currently, 3DES is probably a better choice for many functions than AES because of the possible side channel attacks.

Regarding the hashing, you are probably better off choosing Skein. It comes with it's own cipher (Threefish), which would also allow some more interest

Question: given that you can't really prove the security of any encryption algorithm, why isn't enciphering data with multiple algorithms and independent keys more common? Is it too difficult to do or perhaps too slow and impractical?

It's slow, impractical (we're having enough problems with protocols) and may not offer the same security as simply adding more rounds or complexity to existing algorithms. It's likely to take more memory (think embedded or smart card) as well. It may not help at all against many side channel attacks. And as I said, most of the time it's not the algorithm that's the problem. It's the system that it is deployed in that's vulnerable, not the algorithm itself.

I do understand your arguments. The primary problem I have with it is that as it can be brute-forced, you need to take a lot more care when using it as building block. True, if you really understand what you are doing, it can be used securely in a number of ways. Using it as part of a hashing-scheme is valid. However, DES is entirely unsuitable for its primary purpose, namely encryption, today. And when doing hashing, it competes against a number of hash functions as well. The article was lacking these limi

Nowadays, if the block size is 8 bytes, you can be pretty sure it is DES or triple DES. Keeping the algorithm secret has never worked well in cryptography.

Decrypting single DES is now so easy* that the attacker would probably try it as one of the first things. DES is considered only good enough for real time crypto where offline analysis is not feasible or useful.

So if I encrypt with DES, then run a trivial extra transform (eg. ROT12(destext)), it won't be exactly DES and just running undes(ciphertext) won't decrypt it.

My point isn't to be annoying. It's that there could be value to DES between parties that know it's "DES + ROT12", or whatever trivial function is added, despite DES itself being without real protection value. Especially since DES is now so fast to run.