Friday Sep 05, 2008

Recent discussion about whether AMD's upcoming SSE5 instructions
can be used to significantly accelerate (5X) AES can be seen here.
Given they don't seem to provide dedicated AES instructions, its
tricky to see how this can be readily achieved -- especially
given for AES-CBC even special purpose AES instructions only provide
around 6X.....

Wednesday Aug 13, 2008

The busstat tool can be a useful performance tuning aid, allowing
one to drill into the load an application is placing on the memory
sub-system. However, one note of caution, on the T2/T2+ the
bank_busy_stalls counters should not be used, as erroneous results
are returned – makes it looks like the application is causing
bank busy stalls, even when this is not the case. Future revs of
busstat are aware of this, but in the interim, this is a performance
counter to ignore when tuning your app.

Friday Jul 11, 2008

On a recent rev of Nevada, I just ran some ECC (elliptic curve
cryptography) ubenchmarks, comparing a UltraSPARC T2 using the HW
crypto accelerators and a 3GHz Xeon:

These numbers are for ecdsa operations. The Xeon #s are from
openSSL speed (optimized compilation), while the T2 #s are generating
from interacting directly with the framework. These numbers are for
binary curves using Galois field operations.

While cryptography is typically viewed as computationally
intensive (and so less well suited to CMT processors), software
implementations of common cryptographic algorithms can be readily
optimized to excel on CMT processors. Current software
implementations have been optimized for traditional processors, with
multiple lookup tables sized to all fit in a processors small level-1
caches. It is this use of multiple small tables that leads to the
high computational overheads associated with most cryptographic
implementations -- due to the significant arithmetic operations
needed to manipulate access to the tables and recombine the results
from the various tables.

For example, consider the Kasumi
algorithm, which is essential to 3G mobile telephony. In Kasumi, a
block is 8-bytes, the key is 128-bits (although it is expanded to a
1024-bit key schedule before use), and processing consists of 8
rounds per block. While a variety of operations are performed per
block, the most costly operation is termed FI and consists of the
following (in C notation):

where in
and subkey are two-byte variables, S9 is a 512-element lookup table
and S7 is a 128-element lookup table. This operation is performed
three times per round, for a total of 24 times per block. Each FI
operation requires 22 instructions (for SPARC), for a total of 576
FI-derived instructions per block. Given the abundance of logical and
shift operations, it is apparent that superscalar processors will
perform this function very well, with an Instructions Per Cycle (IPC)
of 2.5 or more. In contrast, the Niagara single-strand IPC is around
0.65. Further, due to the compute intensive nature of the code, the
single Niagara strand uses around two thirds of the processor core's
issue resources. As a result, performance does not scale as
additional Kasumi threads run on a core.

To overcome this
problem, the implementation can be optimized to radically reduce the
instruction count. A reduction in instruction count may be achieved
by replacing large parts of the FI function using a large lookup
table. In the original Kasumi code, the 16-bit elements are divided
into two smaller elements, one 7-bits and one 9-bits. These smaller
elements are processed independently and the results combined. While
this ensures that the lookup tables are small, significant logical
and arithmetic operations are required to split the 16-bit elements
and later recombine the smaller 7-bit and 9-bit elements back into
the 16-bit elements. Significant computational saving may be achieved
by processing an entire 16-bit element at once, using large lookup
tables, as shown below:

t0 = LT0[in]; t0 = t0 \^
subkey; in = LT1[t0];

The new lookup tables (LT0 and
LT1) are now much larger, each being composed of 65536 2-byte
elements. Note that the lookup tables are constant, may be
precomputed, and are independent of the keys. However, using this
approach, the FI function now only requires five instructions, a four
times reduction from the original implementation. Further note that
in both the optimized and the original code, the lookup table
accesses are dependent and cannot be performed in parallel or
prefetched in advance.

The lookup tables that once fitted
in the L1 cache are now much larger and will now largely reside in
the L2 cache -- this instruction count reduction has been at the
expense of increased memory stalls, but here we are laying to the
strengths of a CMT processor. As a result, it would appear that the
performance of the code will remain largely unchanged, having traded
increased instruction count for increased memory stalls. This
optimization technique is beneficial for at least two reasons. First,
MT (multithreading) performance is improved. For the initial
implementation, due to the large computational requirements of the
algorithm, as additional strands are leveraged, aggregate core
performance improves very little. Given that a single strand is
capable of consuming almost all of a processor core’s
resources, as additional VT/SMT strands are leveraged, these strands
rapidly start to deprive the other strands of resources, and the
aggregate core performance is improved very little. In contrast, in
the optimized version, the strands spend most of their time stalled
waiting for accesses to the lookup tables to complete and consume a
much smaller fraction of a processor core’s resources. As a
result, as the number of strands is increased, performance scales
almost linearly. Indeed, for Niagara, per-core Kasumi performance is
around 8 times the performance of a single strand, and per-chip
Kasumi performance is close to 64X single-strand performance. Indeed,
single-core Kasumi performance is around 1.3X the performance of a
single-core of a 3GHz Xeon processor.

Monday Jun 30, 2008

I've been gradually expanding the crypto wiki (which can be found
here);
adding additional info and some code examples. Please let me know
what additional information would be useful to add, how the wiki
could be improved, and even add your own thoughts....

Monday Jun 23, 2008

In my recent CommunityOne
Microparallelism presentation, one of the cases studies discusses how
to convert high ILP code on superscalar processors into the TLP
implementations on CMT processors. The case study is discussed with
reference to the SPARC implementation of SHA-1, which I wrote several
years ago. The code, tuned for sun4u processors, can actually be
found in OpenSolaris here.
The message expansion portion of the SHA-1 computation is performed
in parallel with the compression function portion using the VIS
instructions. The SIMD nature of the VIS instructions is not
leveraged, merely the fact that they allow integer operations to be
performed on the FP pipelines. As a result, the IPC on a UltraSPARC
IV+ processor is increased from around 2 to almost 4 --
improving performance by over 1.7X...

On
CMT processors, such as T2, this doesn't deliver optimal performance.
However, given the low inter-thread synchronization costs, one can
consider performing these two portions of the SHA-1 computation using
two threads:

On the UltraSPARC T1, T2 and T2plus
processors, there is a cryptographic accelerator per each core, such
that an 8-core processor provides 8 accelerators. The algorithms
supported by these accelerators vary with processor and are
illustrated in the following table:

Algorithm

UltraSPARC T1

UltraSPARC T2/T2Plus

Public-key algorithms

RSA, DSA, DH

ECC, ECDSA, ECDHA

X

Symmetric algorithms

RC4

X

DES, 3DES

X

AES-{128,192,256}

X

Cryptographic hashes

MD5

X

SHA-1

X

SHA-224/256

X

The public-key operations are performed
by the accelerator's modular arithmetic unit, while symmetric cipher
and cryptographic hash operations are performed by the accelerator's
cipher and hash unit (CHU). The UltraSPARC T1 accelerators are
composed of just a MAU, while the UltraSPARC T2/T2plus accelerators
have both MAU and CHU, both of which can operate in parallel. The
accelerators operate at the core frequency (in parallel with the
core) and are capable of delivering cryptographic performance that is
typically an order of magnitude better than can be achieved on
traditional processors in software, as is illustrated in the
following table:

Algorithm

UltraSPARC T1 (1.2GHz)

UltraSPARC T2/T2Plus (1.4GHz)

RSA-1024

20,000 sign
operations/sec/chip (8-core)

37,000 sign
operations/sec/chip (8-core)

AES-128-CBC

X

44Gb/s/chip
(8-core)

SHA-1

X

32Gb/s/chip
(8-core)

This article describes how to code your
application such that it can leverage these hardware accelerators.
Many important applications will already leverage the UltraSPARC
hardware accelerators, either directly out-of-the-box or with minimal
configuration. These include; the Sun Studio webserver, the Apache
webserver, KSSL and IPsec to name but a few. More details of how to
configure these applications are provided in a Sun cryptographic
blueprint [1].

Using
the UltraSPARC hardware cryptographic accelerators

Access to the cryptographic
accelerators is controlled by the Solaris Cryptographic Framework.
For non-privileged applications, access is via the userland
cryptographic framework (UCF), while for kernel modules (such as KSSL
or IPsec) access is via the kernel cryptographic framework (KCF).
This article focuses on the userland cryptographic framework.

The remainder of this article focuses
on how to interact with the UCF directly and indirectly via JCE,
OpenSSL and NSS.

Direct
interaction with UCF

For PKCS#11 compliant applications,
libpkcs11.so is the gateway to the UCF, and its just a simple matter
of linking against this library [located in /usr/lib]. Given the
fairly widespread use of the PKCS11 interface, especially with
respect to traditional off-chip cryptographic accelerators (such as
Sun's SCA6000 card), many applications already leverage PKCS#11. If
an application doesn't already use the PKCS#11 interface, it is
pretty straightforward to modify the application, with documents
showing example implementations readily available [3].

Offload via
OpenSSL

If the application uses OpenSSL for its
cryptographic requirements (and many do), access to the accelerators
can be achieved by using a version of OpenSSL that has been modified
to support the PKCS#11 engine. A patched version of OpenSSL is
supplied with Solaris 10 and is located in /usr/sfw/lib, allowing
application compilation as follows:

For operations that are to be
offloaded, it is necessary to restrict use to the EVP_ functions and
explicitly indicate the use of the PKCS11 engine; something like the
following works for bulk ciphers (the process for RSA is similar):

ENGINE
\*e;

ENGINE_load_builtin_engines();

e
= ENGINE_by_id("pkcs11");

ENGINE_set_default_ciphers(e);

EVP_CIPHER_CTX_init
(&ctx);

EVP_EncryptInit
(&ctx, EVP_des_cbc (), key, iv);

EVP_EncryptUpdate
(.....);

PKCS#11 engine patches are available
from OpenSSL.org for a number of different versions of OpenSSL, if
the version of OpenSSL that ships with Solaris isn't suitable [4].

Offload via
JCE

For applications that utilize the Java
Cryptographic Extensions (JCE), the application should simply be
configured to utilize the SunPKCS11-Solaris provider. Accordingly, in
order for applications to use the hardware accelerators
automatically, it is just necessary to ensure that
sun.security.pkcs11.SunPKCS11 is configured as the first
provider in $JAVA_HOME/jre/lib/security/java.security
file.

The SunPKCS11-Solaris provider can also be explicitly selected as
follows:

String
provider = "SunPKCS11-Solaris";

Cipher
aescipher = Cipher.getInstance("AES/ECB/NoPadding",
provider);

It should be noted that the
SunPKCS11-Solaris provider currently only offloads a subset of the
chaining modes supported by the hardware, so make sure that the
chaining mode and padding mode are supported [5]. The modes supported
by the hardware accelerators are illustrated in the following table:

Cipher

Supported chaining modes

AES

ECB, CBC, CTR

DES/3DES

ECB, CBC,
CFB64

Offloading via
NSS

In order for NSS to use the hardware
cryptographic accelerators, the Solaris cryptographic framework
should be added as a provider for NSS. This is achieved by modifying
the appropriate NSS security databases. As an example, the following
illustrates how firefox can offload RSA operations to the hardware:

The
use of the mechanism option indicates that the Solaris Cryptographic
Framework should be the default provider for RSA operations [6].

Observability

When operations are submitted to the
cryptographic framework, the cryptographic framework will, as
appropriate, route processing for these operations to the Niagara
cryptographic provider (ncp)
device driver for public-key operations, and the Niagara-2
cryptographic provider (n2cp)
device driver for symmetric cipher and cryptographic hash operations.
These device drivers then perform the actual offload to the hardware
accelerators and return the results to the framework. The interaction
between these drivers and the cryptographic frame is controlled via
cryptoadm.

kstat can be used to
provide insight into the cryptographic operations that ncp and n2cp
are handling, as follows:

kstat -m ncp | less

kstat -m n2cp | less

Additionally, cputrack can
be utilized to determine the activity of the hardware accelerators
directly (use cputrack -h to determine which counters to
track).

Concluding
thoughts

Cryptographic processing overheads are
finding their way into an ever wider array of applications as
security becomes ever more important. By providing on-chip hardware
cryptographic accelerators, the UltraSPARC processors can vastly
reduce these overheads, and in many situations enable respectable
performance even when operating securely.

Via the Cryptographic Framework Solaris
provides a simple way via which applications can leverage the
benefits of the UltraSPARC hardware accelerators, while continuing to
ensure application portability

Wednesday May 21, 2008

I've started looking at how to leverage
the UltraSPARC T2 hardware cryptographic accelerators to improve
MySQL performance and there are a couple of interesting
opportunities;

SSL is used to secure
communication between a potentially remote MySQL client and the MySQL
server. One option is to modify the appropriate SSL libraries to use
the T2 hardware accelerators where appropriate -- pretty
straight forward. Another option that I'm currently investigating is
trying to use the Solaris Kernel SSL proxy (KSSL). KSSL already uses
the UltraSPARC T2 HW crypto accelerators, and so could be a very
elegant solution to offloading MySQL SSL processing.

A variety of operations are
supported by MySQL to secure database information, such as
aes_encrypt() and des_decrypt(). Support for DES and SHA1 are also
provided. Again, it is fairly straight forward to modify this code
to use the T2 hardware accelerators were appropriate.

Tuesday May 20, 2008

Just been playing with crypto on a 2-way UltraSPARC T2plus
(Victoria Falls) system. The system, with 16-cores running at 1.2GHz
was running Nevada 89, and my crypto microbenchmarks scaled very
nicely. Able to hit HW peak performance (~6.8GB/s/system with
AES-128-CBC), with suitable object sizes and a bunch of threads. More
details as time allows.

Wednesday May 14, 2008

I've just been experimenting with Java
Cryptographic Framework (JCE) on the UltraSPARC T2 processor and it
is important to remember which algorithms/modes/padding are supported
for offload to the cryptographic hardware. While the UltraSPARC T2
processor supports most common chaining modes, offloads from JCE
occur via the SunPKCS11-Solaris provider. The supported
algorithms/modes/padding are somewhat more restrictive and are listed
here.
If a none supported mode is specified, the operation will not be
offloaded to the HW, but will be performed in software.

Thursday May 08, 2008

As stated in an earlier entry, when running on an UltraSPARC T2
processor, applications using the Java cryptographic extensions (JCE)
should (when applicable) automatically leverage the on-chip
cryptographic accelerators.

Following a recent conversation with a Java Guru, you should check
the following, if you experience problems:

Java on Solaris automatically sets SunPKCS11-Solaris
(which calls into the Solaris Crypto Framework) as the default
security provider, so you need to do nothing.

This begins
from some version of J2SE 5.0. You can go look at the
${java.home}/lib/security/java.security file. There should be one
line look like:

Wednesday Apr 30, 2008

I typically witter on about crypto
performance at the microbenchmark level, but I was recently browsing
the SPECweb05 results and I was impressed to see how the T2 performs,
especially on the Banking workload, which is 100% HTTPS:

Pretty Impressive! So a single-socket
UltraSPARC T2 processor provides equivalent performance to 4-socket
x64 systems containing Quad-core processors! On a per socket basis, T2
outperforms the competition by over 2.7X!

Now, this performance leadership is not
all down to the HW crypto support – I'm sure the onchip NICs,
and abundance of threads help somewhat too. However, the
cryptographic overheads associated with HTTPS are pretty significant
– RSA ops for session establishment and then RC4 and MD5 (these
are the algorithms used for SPECweb2005 anyway) operations to secure
and authenticate the subsequent traffic. In fact, looking at the
following figures:

Figure 1: Relative costs in an HTTPS transaction for different file sizes.
Referenced from here

Figure
2: Typical breakdown of overheads for SPECweb2005 banking

it is apparent that a significant
proportion of the total application-level overheads are associated
with cryptographic processing. Its therefore not surprising that
providing HW support to accelerate cryptographic processing provides
a significant performance advantage to the UltraSPARC T2 processor on
SPECweb05 banking...

Its nice to see that the good microbenchmark numbers actually translate into
significant gains at an application level....

Monday Apr 14, 2008

The other day I was looking for a C
code example of illustrating how to leverage the softtoken key store
when directly interacting with the Solaris crypto framework. There's
substantial documentation available but I couldn't find a basic
example. So here's what I concocted:

where the first operation updates the
passphrase required to access the keystore. If the keystore doesn't
exist, the keystore is first created. The default passphrase is
“changeme”. The second operation creates a 128-bit AES
key and installs it in the keystore – the label associated with
the key is “test_key”. The third operation displays the
contents of the keystore, so it is possible to confirm that the key
has been created correctly.

2. Use the AES key in the keystore from
a application that is directly using the Solaris crytpto framework:

About

Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.