Date: Mon, 28 Jun 2010 13:15:55 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: faster DES on Atom
Hi,
I think Dango-Chu should have posted this a year ago, but since he did
not, I figured that I should do it myself and better late than never -
just to keep the "JtR knowledge" in one place.
It turned out that for Atom CPUs it is beneficial to change JtR's
bitslice DES SSE2 assembly code to use plain SSE instructions (not
SSE2), because those are one byte shorter, which helps the decoder
and/or caching. I had expected this to turn out to be the case on some
CPU, but when working on this code in 2006 I only encountered CPUs that
executed both SSE and SSE2 versions of the code at the same speed (AMD
CPUs) and those that executed the SSE2 code much faster (Intel CPUs),
which is why the decision to go with SSE2 only was made.
The switch to plain SSE is trivial to make - just replace all
occurrences of five SSE2 instructions with their SSE equivalents in
x86-sse.S and/or x86-64.S. This can be done with the following command
(uses recent GNU sed):
sed -i 's/movdqa/movaps/; s/pandn/andnps/; s/pand/andps/; s/por/orps/; s/pxor/xorps/' x86-sse.S x86-64.S
According to Dango-Chu's benchmarks, this provides a 10% speedup for
32-bit builds, but only a 0.5% speedup for 64-bit builds - both on an
Atom, indeed. The numbers could be different on other Atom CPUs, and
indeed they're very different on non-Atom CPUs - e.g., there's a 2x
slowdown from the same change on a Core i7 (just tried).
With JtR 1.7.6+, you may additionally need to edit this check in x86-64.h:
#if defined(__SSE2__) && \
((__GNUC__ == 4 && __GNUC_MINOR__ >= 4) || __GNUC__ > 4)
#define DES_BS_ASM 0
[...]
The purpose of this check is to disable the assembly code in favor of
gcc-generated SSE2 code when gcc 4.4.0 or newer is being used. To force
the use of plain SSE instead of SSE2, yet compile with gcc 4.4+, you'll
need to override this check. This might not result in any improvement,
though, not even on an Atom, because the speedup from plain SSE for a
64-bit build, as measured by Dango-Chu, was negligible (see above).
The original blog post by Dango-Chu, in Japanese:
http://dango.chu.jp/tripper/20090429.html#p01
The same change might also be beneficial on some other CPUs - anyone
with a Pentium M?
Alexander