Overview

SHA1 is rather unique among message digests in that it has
an excessive amount of parallelism available. The paper SHA:
A Design for Parallel Architectures? shows that with a typical
32-bit RISC instruction set you would need seven single-cycle ALUs and
26 registers to achieve a minimum critical path length of 160 cycles (2.5
cycles per byte).

Upon analysis of the SHA1 specification it was discovered that a
portion of the computation is very well suited to a SIMD architecture
such as IA32 with SSE2. By taking advantage of this it is possible to
come significantly closer to the minimum critical path described
in the above paper.

Performance Data

Dec 18, 2004 NOTE: This data is out of date! For updated data, see
this note.

Using a specialized timing harness which iterates over the same buffer
continuously, the following cycles per byte results are achieved (lower
is better) on four IA32 processors supporting SSE2:

message size

Xeon/P4

Pentium-M

Opteron

Efficēon

64

16.2

13.9

9.16

6.80

256

14.1

13.4

8.73

6.09

1024

13.6

13.4

8.65

5.95

8192

13.7

13.3

8.59

5.90

For reference, here is SHA1 data from OpenSSL 0.9.7c. Note that this
is not exactly comparable to the above data due to API differences
(especially in the 64-byte message case!), but is indicative of the
general performance of integer-only implementations of SHA1:

message size

Xeon/P4

Pentium-M

Opteron

Efficēon

64

67.5

69.2

42.2

48.3

256

31.1

26.3

16.9

19.8

1024

22.1

15.6

10.6

12.7

8192

19.5

12.5

8.73

10.7

For a real world, in-system test, sha1sum from GNU coreutils 5.0
was modified with this new implementation of SHA1. The input is 256MB
of /dev/zero, and the numbers are in cycles per byte (lower is better):

Note that if W is aligned to 16-bytes, and t
is a multiple of 4 then this has a very clear mapping onto a 128-bit
SIMD architecture. Each of the 5 columns of W above
are 128-bits of data. The only glitch to calculating all 4 of
W[t]..W[t+3] in parallel is the last line which itself
depends on W[t].

In the implementation, the first four steps are calculated in parallel
using SIMD hardware. The fifth step is also calculated using SIMD
hardware even though it represents very little parallelism.

Define prep[t] = K[t] + W[t], and since the
K[t] are 32-bit constants it should be clear that we can
calculate prep[t]..prep[t+3] in parallel after we have
calculated W[t]..W[t+3].

The calculation of prep[t] is all that is performed in the
SIMD hardware. The remainder of the calculation is performed in ALU
hardware as it is in a traditional SHA1 implementation. The dataflow
for a single block can be envisioned like so:

message block -> SIMD -> local stack -> ALU -> update hash

All 80 steps are completely unrolled, and the SIMD computation
precedes the ALU computation by 12 steps so that prep[t]
is available (on the stack) when the integer computation requires it.
This "software pipelining" is continued around the loop back edge --
when there are more blocks to be processed, the SIMD portion of the
pipeline begins prior to the hash update of the previous block completing.

There are other peephole optimizations present in the code such
as refactoring some f_t functions, and replacing some
SSE2 operations with others to improve parallelism on various SSE2
implementations.

The implementation is in C using Intel's SSE2 intrinsics, and the
Intel C compiler is used to compile the code. There are over 1200 x86
instructions in the resulting code, which would make hand scheduling
a nightmare. In theory the code can be compiled by modern GCC, however
the code trips
an internal compiler error (bugid 11627).

The end result is very well suited to a wide-issue processor such
as Transmeta's
Efficēon. Efficēon has two 64-bit memory units, two 32-bit
ALUs, and two 64-bit SIMD units (plus others not relevant here), and it
can issue to all 6 in every cycle. On average 3.5 pipelines are fed
an instruction in every cycle of this SHA1 implementation, and many
cycles issue to all 6 pipelines. There are several unimplemented
enhancements possible in Efficēon's dynamic translator (known as
CMS) which would increase the average issue width beyond 4 on this
workload.

Even after this SIMD "factoring" this SHA1 implementation remains
ALU-bound. In the Efficēon translated code there are approximately
340 instructions issued to each of the ALUs, while only 250 are issued
to each of the SIMD units, and 100 to each memory unit. However there
hasn't been any further insight into portions of the computation which
could be moved to the SIMD side of the machine.

There remain 160 integer rotation instructions, and of the 4 CPUs
mentioned earlier, only Opteron has more than one integer shift unit.
Xeon is penalized by an unpipelined integer rotation instruction --
the shift unit is in use for 4 cycles for each rotation which yields
a 160*4 = 640 cycle minimum critical path for 64 bytes, which accounts
for 10 of the ~14 cycles/byte result.

Xeon is further limited by a 3 μop per clock issue width, and issue
port contention between the integer shift unit and SIMD operations.
Pentium-M has similar issue width and port contention restrictions.
This code is somewhat unlike normal code in that it has available
both integer and SIMD operations in nearly every clock -- these
processors favour balances which are heavily weighted to integer
or to SIMD/FP but not both. Further details can be found in the IA-32
Intel Architecture Optimization Reference Manual.

Future Work

I'd like to make a patch versus OpenSSL, however I'm presently
wrestling with three problems:

There are few compilers which can do the code justice. Intel C++
Compiler verions 7.0 and 7.1 do a fine job with the code -- v7.1 was used
for the results presented above. Intel made a rather unfortunate decision
in the v8.x to "optimise" away all FPU/SIMD->ALU data movement using
MOVD/PSRLDQ instructions. This is an attempt to hide the long latency
for such data movement on the P4 -- but their compiler does not take
into account the fact that my code is already software pipelined to hide
this latency. There is no way around this, and it affects performance
by at least -30%.

GCC version 3.4 and later can compile this code, but do not generally
produce code which is better than the integer-only code. There appear
to be many reasons for this inefficiency, but so far only a couple PRs
have been opened to address it -- I'm presently waiting for the latest
-fnewra support to be merged forward onto the mainline.

In theory it should be possible to use a single code path for
both endian inputs with essentially no penalty. (Note that the numbers
quoted above include a byte-swap on the entire input -- which is required
because SHA1 is big-endian, and IA32 is little-endian.) This theory is
based on the observations earlier about how many ALU operations there
are in relation to SSE2 operations. In practice I've managed to make
a merged codebase like this but it is underperforming the always-byte-swap
version. The problem appears to be the compiler not mixing code streams
well enough for p4/p-m/k8 (efficēon doesn't mind so much because
it reschedules regardless).

OpenSSL's API expects to pass the input to the digest function
one block at a time. For larger messages it's desirable to begin
the SSE2 processing of the next block prior to finishing the ALU
processing of the current block.