Introduction

The
forthcoming Intel Pentium 4 processor (code-named Willamette)
will
feature a new set of SIMD
instructions
that improve the capabilities of both the MMX
and SSE instruction sets. The key benefits of
SSE2
are that MMX instructions can work on 128-bit data blocks, and that SSE
instructions now support 64-bit floating-point values. Extending the width of
MMX parallel computations
puts Intel’s integer SIMD processing capabilities on a par with
Motorola’s
AltiVec,
used in the Macintosh G4 series: in the next section we will analyze
the
performance benefits of doubling the data block size and the effort
required
to turn old MMX code into shiny new SSE2 code.The original SSE
instruction set worked
on 32-bit floating-point data elements, processing 4 of them in
parallel
(4x32 = 128 bit). This approach is finely tailored to 3D games engines,
which perform lots of matrix by vector multiplies: the SSE multiplier
can
multiply a 4-elements vector by a row of a 4x4 matrix with a single
instruction,
yielding an effective 4x speed-up. The benefits of SSE accelerated
geometry
setup are likely to fade in the near future, thanks to the new
generation
of graphics boards that feature hardware-assisted triangle setup and
lightning,
but there is a long list of multimedia and scientific applications that
could be greatly enhanced by parallel floating-point computations.
Current
RISC processors, such as the Digital Alpha, still offer better FP
performance
than x86 CPUs, even Athlons at 1 Ghz, and therefore they are the ideal
platform to run scientific simulations. As this kind of software often
performs computations on large data sets in a regular order, we can
reasonably
state that SSE instructions could be successfully applied and close the
performance gap between x86 and RISC processors.Unfortunately, some of
them require the
extra 64-bit precision that current SSE instructions do not support.
The
lack of 64-bit support should not be blamed on Intel designers: the
main
target for SSE is mainstream multimedia software, especially 3D games,
where the precision difference between 32-bit and 64-bit FP
computations
would be hardly noticeable. However, Intel has always showed great
interest
in the scientific field: as an example, consider the Pentium processor,
whose FP unit was much more powerful that the integer unit making it a
strong contender for several applications, such as CAD.SSE2 is designed to fix
this problem:
it supports both 32-bit and 64-bit floating point values, but
keeping
the data block size fixed to 128-bits means that SSE2 instructions can
only process two 64-bit data values in parallel. Even if the potential
speed-up halves from four down to two, it is still compelling, as it
enables
a level of performance that normal FP code cannot match until 3+ Ghz
processors
come around. What’s more, peeking at the Pentium 4 microarchitecture
reveals
that the performance gain achieved by using SSE2 could actually be much
greater than 2x, as the scalar FP unit suffers latencies that are much
longer than on the P6 core, while the SSE2 unit is streamlined to offer
blazing speed. The conclusion is that developers may be forced to use
SSE2
instructions to effectively harness the FP power of the Pentium 4, and
that the speed of current FP-intensive applications should be
disappointing,
considered the 2.0+ Ghz core frequency.