The news that Intel will include an AES instruction set in a
range of new products was warmly welcomed by the cryptographic community.
Not only does this provide additional resistance to a range of
side-channel attacks, but it also provides huge performance benefits.
We implemented in assembly all the SHA3 candidates that could
benefit from the new Intel AES instructions set and herein summarize the results.

summary

performance comparison details

Performance estimates for all SHA-3 candidates considered below
come from Nehalem measurements on a Core i7 920 processor clocked
at 2.67 GHz with GNU/Linux Debian running a

.
Since the new Westmere CPUs have been released, we were able to check these estimates with a
Core i5 540M processor clocked at 2.53 GHz, running Ubuntu with a

2.6.34-amd64

kernel. As
one can notice, our estimates were pretty accurate.
For those algorithms that solely use the AES round in its entirety, we
give the number of AES rounds/Byte and also the cost, which is computed as
the number of cycles/AES round. In general terms the lower
the cost, the more efficiently the AES round is being used with respect to AES-NI.

implementation (using 8 xmm registers)∝ implementation and specification mismatch, see here× Grøstl, though not using a full AES round, benefits from AES-NI (figures collected by using the ebash supercop package on our Core i5)

Click on the algorithm name to download its assembly implementation.
See there for instructions on compiling and running the code.

While we tried to implement each of the above algorithms
as efficiently as possible, improvements might be
possible. If you have other implementation ideas that you would
like to share with us or obtain better figures with a
supporting public assembly implementation, do not hesitate to
get in touch, we will update this page accordingly.

methodology

As the Westmere processors (the first with AES-NI) were not yet available at the time, we have proposed a
new methodology that can be used to get an accurate emulation of AES-NI.
Our methodology relies on the fact that Westmere (formerly Nehalem-C) and
Nehalem processors share the same micro-architecture. This means that if we can
find suitable instructions patterns that behave exactly as AES-NI instructions,
we will get very good estimates for the future performance of AES-based SHA-3
candidates on a Westmere processor, but using a Nehalem processor.
Previously, a substitution instruction
was proposed
for Westmere processors. However this substitution does not exhibit the correct
behaviour for Westmere and can give misleading results.
In a companion paper, we provide a particularly accurate replacement
instructions pattern for aesenc and we explain how to derive it from publicly
available information only. The corresponding slides have been presented to the Asiacrypt 2009 conference.

compiling and running

package structure

For each candidate, the package consists in one or two subfolders,
depending on the implemented versions (256- and/or 512-bit versions).
Each subfolder contains three main files:

nist_api.h:
the NIST API, basically from the
candidate's package, sometimes modified for compatibility with
our implementation of the compression
function.

wrap.c:
the main routine with two modes:
a benchmark mode that uses the "replacement instructions"
in the compression function (see our companion
paper) to be run on Nehalem, and an "emulator" mode that uses the AES-NI
genuine instructions to be run on Westmere or inside an emulator on other processors.

compiling

To compile a candidate, a Makefile is present at the root of the
candidate's folder. There are five (or three, depending on the hash length
variants) compilation toggles in the Makefile:

westmere256 and westmere512:
compiles the Westmere AES-NI version of the algorithm. Two binaries are generated:
• A first one to check the implementation correctness with
Intel's SDE emulator
or by genuine execution on Westmere CPUs.
• A second one to perform speed measurements on Westmere.

nehalem256 and nehalem512:
compiles the replacement versions of the algorithm
for benchmarking purposes (estimates on Nehalem).

clean:
removes compiled files.

running

The benchmark binaries are run with a single argument, a number
which represents the repeat factor:

./westmere_binary name_of_file_to_be_hashed

The Westmere correctness checking binaries can be either run on Westmere CPUs, or through
Intel's SDE:

sde -- westmere_binary name_of_file_to_be_hashed

measurements

In order to get stable measurements, it is recommended to shut down all
services. To lower the impact of the remaining process, we also
set the CPU affinity and maximize its priority (this requires root privileges).
Although the number of samples needed to get stable results depends on the
code and the surrounding "OS noise", we found that about
108 samples
give stable and clean results for all the candidates.

Finally, timings may also vary depending on the BIOS settings
for the Core i7 920 and the Core i5 540M. The results given here have been
obtained with:

Genuine frequency and QPI (no overclocking)

Hyperthreading off

Turbo Boost off

compiling with gcc

It is also possible to compile our source codes with gcc insead of icc (gcc supports AES-NI in 4.4 and above).
For this, set the CC variable in the Makefile to read:

CC=gcc -funsafe-math-optimizations

Compiling without the above flag will not give proper results.
This is related to the Nehalem AES-NI replacement nature: the mulps instructions might
indeed generate floating point exceptions at run time (because of uncontrolled overflows),
leading to huge penalties during the handling of the exceptions.
The gcc compilation flag "-funsafe-math-optimizations" changes the mxcsr register so that
the overflow FPU/SSE exceptions are masked, see
gcc documentation.

The icc compiler does not need special care as an equivalent flag "-fp-model fast"
is implied by -O1 or -O3.
Another workaround is to
manually set
the mxcsr register at the beginning of the loop measuring the compression function.

Finally, switching to gcc also implies, for some candidates, to change icc's