The new Xeon 45nm (or Xeon 54xx series) arrived in the
lab weeks ago, so a server CPU benchmark update is overdue. However, there is
another reason we decided to write this article. When AMD launched their newest
quad-core chips, we could only give you a preview of its performance - not a
full review. We did not fully understand many of the performance aspects of the
new AMD architecture, so we decided to delve a little deeper.

We will only focus on performance in this article,
as our primary goal is to get an idea where Barcelona (AMD's quad-core), Harpertown
(Intel 45nm quad-core Xeon), and Clovertown (Intel quad-core 65nm Xeon) stand. To
do so, we performed a minimal profiling of each of our benchmarks and we used several
new micro benchmarks that will tell you a lot more than some real world benchmarks
can. If you like understanding the benchmarks out there a bit better, dig in.

The new 45 Xeon

The new 45nm Xeon 54xx series, aka "Harpertown",
is still based on the Core architecture, but it has been tweaked a bit. We have
already discussed those improvements in detail here and here, so we won't discuss them in detail
again, but here's a quick overview:

The Radix 16 divider is the most interesting of
these improvements. Dividing involves a repetition of subtractions, tests "if-it-fits"
and shifts. If you can do this with four bits at a time instead of two, this means
that you can cut the number of these iterations required to get your result in half.
The square root calculation is similar and also benefits from these
improvements. While divisions and square roots are rather rare in common software,
they have a very significant performance impact. Contrary to the more "popular"
instructions, they are not pipelined and the latency of these instructions is high.
For example, a floating-point multiply takes five cycles and the "Clovertown
Core" architecture can finish one every two cycles thanks to pipelining.
However, a floating-point division takes no less than 32 cycles and cannot be
pipelined at all.

Nanotechnology
is here: a core with 820 million transistors and you can fit at least two of them
in one coin. (Photo by Tjerk Ameel)

Besides being an improved "Clovertown",
the new Xeon is also a marvel of nanotechnology with no less than 410 million transistors
on a die of only 107 mm². Two die make one quad-core Xeon 54xx. Considering
that this CPU is close to behemoth CPUs like Itanium and Power 6 in SPECint
performance, the new Intel is a formidable adversary for AMD's newest
quad-core.

There is more than the CPU of course. The new CPU works on the old "Bensley/5000P
chipset" platform, though we were not able to get it running on our
P5000PSL Intel motherboard despite the fact that we applied the BIOS update
that came out early this month.

There is also a new HPC/workstation platform for
the Intel Xeon thanks to the Seaburg chipset, which features an improved snoop filter.
Besides reducing the snoop traffic, the new chipset should also be able to extract
more bandwidth out of the same FBDIMMs. The support for DDR2-800 FBDIMMs should
bring another performance boost but our current test platform is only stable
with DDR2-667 FBDIMMs.

The Opteron 2360SE - the Facts

Getting back to AMD, the new quad-core chip still
lives under veil of secrecy. Quite a few rumors and myths are going around and we
investigated them one by one so we could be sure that you only get the facts.

Fact 1: The B2 stepping does not have
a much faster memory controller than the B1 stepping

The controller found in stepping "2"
might be a tiny bit faster, but we have not found any significant difference. Our
Stream benchmarks were only a tiny bit faster on the 2.5GHz (Stepping 2) than on
the 2GHz (Stepping DR-B1) and so were the latency numbers.

The 2.5GHz Barcelona is a
newer stepping than the 2GHz sample we tested earlier.

Fact 2: The 2.5GHz review sample is running at
1.2V; it is not overclocked

CPU-Z reported that the chip was running at
1.5V, while the 2GHz quad-core was running at 1.2V.

Power measurements show that the BIOS of our
ASUS board is accurate, but CPU-Z is not. The last evidence is of course the laser
marking on the quad-core Opteron.

AMD is capable of producing 2.5GHz quad-core,
but not in large quantities at this time. The 2.5GHz part should arrive at the end
of this year, with large quantities expected in the first quarter of 2008.

Fact 3: the memory controller always
runs RAM at the rated frequency

In this case, the new quad-core Opteron is
completely different from what we have seen with previous Opteron (and Athlon
64) processors. In the first and second generation Opteron, the memory
controller ran at a divisor of the CPU. This resulted in very odd memory clocks
speeds at times, particularly on odd multipliers. For example, a 2.2GHz Opteron
(11X multiplier) uses a divisor of 7 and ends up running DDR2-667 (333.5MHz
clock) at 314MHz. That gives DDR2-628 instead of 667. In reality, this doesn't have
any major performance impact, and it is only measurable with "Stream-like"
benchmarks. In contrast, Barcelona's memory controller runs at its own frequency
and will run the DIMMs at the rated speed.

Fact 4: By running the Northbridge at
a lower speed, the new quad-core loses a bit of performance but saves power

The core of the Opteron 2360 runs at 2.5GHz, but
the L3 cache runs at Northbridge frequency, which is 2GHz. It seems that AMD's engineers
felt that running the L3 at core frequency would not have resulted in significantly
higher performance, but significantly higher power dissipation. From another
point of view, given a certain power envelope, running the Northbridge and the
L3 cache at higher frequencies would result in lower core frequencies.

Fact 5: The L3 cache was a good choice,
but…

The L3 cache does increase latency of accessing
the main memory but decreases the average latency seen by the CPU. This leads
to the question of whether the relatively slow L3 cache is really an advantage.
The L3 cache has a latency of 43 cycles (2GHz) or 48 cycles (2.5GHz), but it's
still quite a bit faster than system memory, which takes about 130 to 170
cycles to access.

In addition, it has one main advantage for server
workloads. If more than one core is accessing a cacheline in the L3 cache, it will
remain in the L3. If not, the L3 cache will behave like a fully exclusive cache:
it will send the cacheline to the L1 and throw out the cacheline to make place for
a "victim" of the L2. This allows relatively fast sharing of data between
threads, which is important for large code footprint applications like database
applications and others. For single-threaded applications, it looks like they
get a 2.5MB L2 cache, although with an average latency of about 20 cycles.

Still, there is no doubt that the L3 cache of
Barcelona could have been a bit bigger to score even better in the larger database
benchmarks such as TPC. We have to guess that a larger L3 cache became a victim
(pun intended) of the already large 283 mm² die size. Still, a 44 cycle latency
(and more) is rather disappointing for only 2MB of L3 cache.

Fact 6: Dual-Link is possible with AMD
2xxx Opterons

Several readers asked us how it was possible that
our ASUS KFSN4-DRE board linked our 2350 CPUs with two instead of one
HyperTransport point-to-point connection, as the 23xx Opteron supports only one
coherent HyperTransport link. However, the constraint is not the number of
links but actually the number of coherent responses that are supported. Our
ASUS board does feature twice as much bandwidth for CPU-to-CPU traffic (snoop,
access to remote memory etc.)

Words
of Thanks

Many people gave us assistance with this project,
and we would of course like to thank them.

Most of the applications of the server and HPC
world are well multi-threaded and scale nicely with more cores. With the exception
of some rendering engines, this also means that our hard working quad-core CPUs
will require quite a bit more bandwidth when they are processing these multi-threaded
applications. In our previous review, we found out that:

Barcelona's L2 cache is 50 to 60% faster
than the older Opteron (22xx). So each core gets at least 50% more L2 bandwidth.

Each Barcelona's L2 cache is almost as fast as the
shared L2 cache of a similarly clocked 65nm Core based Xeon.
The Intel Xeon has a big advantage of course: its L2 cache is 8 to 12 times larger!

Barcelona's single-threaded memory bandwidth is 26% to 50% better than
the older Opteron and almost twice as good as what a similar Intel Xeon gets.

The problem is of course the word "single-threaded".
Those bandwidth numbers are not telling us what we really want to know: how well
does the memory subsystem keep up if all cores are processing several heavy threads?

We only had access to the Intel and GCC compilers,
and we felt we should use a different compiler to create our multi-threaded stream
binary. GCC would probably not create the fastest binary and Intel's compiler might
give the Core architecture too many software prefetch hints (or other tricks that
might artificially boost the bandwidth numbers). Alf Birger Rustad helped us
out and sent us a multi-threaded, 64-bit Linux Stream binary based on v2.4 of Pathscale's
C-compiler. We used the following compiler switches:

-Ofast -lm -static -mp

We tested
with one, two, and four threads. "Two CPUs" means that we tested with
four threads on dual dual-core and eight threads on dual quad-core. "2 CPUs"
also means that we used only one CPU in the 1-4 threads setting and we only used
a second CPU in the "2 CPUs" setup.

Note that clock speeds do not really matter, except
for the Socket-F Opteron. Although we did not include this in the graph above (to
avoid color chaos), the clock speed of the socket-F Opteron only matters for the
single-threaded bandwidth numbers. Look at the table below:

AMD vs. AMD Multi-threaded
Stream

1 Thread

2
threads

2 CPUs

Dual Opteron 2212
2.0

5474

6330

12220

Dual Opteron 2222
3.0

6336

6472

12664

Difference
3GHz vs. 2GHz

16%

2%

4%

Dual Opteron
23xx

6710

8232

16614

Difference
Opteron 23xx vs. Opteron 22xx

23%

30%

36%

With one thread, the 2GHz Opteron 2212 is
clearly not fast enough to take advantage of the bandwidth that DDR2-667 can
deliver. However, once you make both cores work, this is no longer the case. The
Opteron 23xx numbers make clear that the deeper buffers really help: each
quad-core has about 30% more bandwidth available than the dual-core. That
should be more than enough to keep twice as many cores happy.

The graph above also quantifies the platform
superiority that many ascribe to AMD. Likewise, it confirms that the new Intel
platform has a much better memory subsystem thanks to the Seaburg chipset. To understand
this we calculated the bandwidth numbers, with the "Bensley + Clovertown"
platform representing our baseline 100%.

AMD vs. Intel Multi-threaded
Stream

1 Thread

2
threads

4
threads

2 CPUs

Opteron
23xx

232%

207%

150%

308%

Xeon 54xx + Seaburg + 800MHz
RAM

164%

225%

158%

172%

Xeon 54xx + Seaburg + 667MHz
RAM

159%

196%

128%

138%

If you use two CPUs, the Opteron 23xx has no less
than 3 times the amount of bandwidth compared to the "old" 65nm Xeon.
However, it is much less likely that bandwidth will be a bottleneck for the "new"
Xeon 45nm as it has 40% to 60% more bandwidth (with the same kind of memory) compared
to the "old" Xeon. If necessary, you'll be able to use 800MHz FBDIMMs
that will offer more bandwidth (9GB/s versus 7.7GB/s).

It becomes clear why even a 3GHz Xeon 5365 is
not able to beat AMD in SPECFP2006rate: running eight instances of SPECFP2006 is
bandwidth limited.

The memory subsystem, latency

To
understand the memory subsystem of the different CPUs, we also need to look at
latency. We have noticed that many latency measurement benchmarks are inaccurate
when you have two CPUs running, so we tested with only one socket filled. Below
you can see the numbers for a stride of 128 Bytes, measured with the CPU-Z 1.41
latency test.

CPU-Z
Memory Latency

Data size
(kB)

Opteron 2212
2.0

Opteron 2350

Opteron
2360SE

Dual Xeon
5472
(DDR2-667)

Xeon
E5365

4

3

3

3

3

3

8

3

3

3

3

3

16

3

3

3

3

3

32

3

3

3

3

3

64

3

3

3

15

14

128

12

15

15

15

14

256

12

15

15

15

14

512

12

15

15

15

14

1024

12

44

48

15

14

2048

114

44

48

15

14

4096

117

111

121

15

14

8192

117

113

126

242

215

16384

117

113

125

344

282

32768

117

113

126

344

282

The quad-core
Opteron had to make a compromise or two. As the 463 million transistor chip is
already 285 mm² in size, each core only gets a 512 KB L2 cache. That means that
in some situations (>512 KB) the old 90nm Opteron 22xx is better off as it has
access to a very fast 12 cycle L2 cache while the Opteron 23xx has to access a rather
slow 44-48 cycle L3 cache.

Note
also that the 2.5GHz Opteron 2360 "sees" a slower L3 cache than the
2350: 48 cycles versus 44. The memory controller seems to be ok: the slightly higher
latency compared to the Opteron 22xx series is a result of the fact that the Opteron
23xx cores have to check the L3 cache tags, while the Opteron 22xx doesn't have
to do that. Notice that memory latency of the on-die memory controller is still
far better (+/- 60 ns) than what the Seaburg or Blackford chipset (+/- 70-90 ns)
can offer to the Xeon Cores.
We have encountered situations where Barcelona's memory controller accesses the memory with much higher latencies (86 ns and more) than the Opteron 22xx but we have to study this in more detail to understand whether this has a realworld impact or not.

Native quad-core versus dual
dual-core, part 2

Cache2Cache
measures the propagation time from a store by one processor to a load by the other
processor. The results that we publish are approximately twice the propagation time.

We noticed
that running our Cache2Cache benchmark (see here and here) gives results that are more
accurate if you measure the results on the same die with only one CPU, and then
measure the results from one CPU die to another one with two CPUs. Cache2cache quantifies
the delay that a "snooping" CPU encounters when it tries to get up-to-date
data from another CPU's cache.

Cache coherency ping-pong
(ns)

Same die,
same
package

Different die,
same
package

Different die,
different
socket

Opteron 2350 - Stepping
B1

127

N/A

199

Opteron 2360SE - Stepping B2

107

N/A

199

Xeon E5472
3.0

53

150

237

Xeon E5365
3.0

53

150

237

The Xeon
syncs very quickly via its shared L2 cache (26.5 ns), but a bit slower from the
first CPU to the third one (75 ns). AMD's native quad-core design is a bit faster
in the latter case (53.5 ns with the 2360 SE). The difference is slightly less when
you have to sync between two sockets (99.5 ns versus 118.5).

Floating-point Analyses (Linux 64-bit)

AMD's
newest quad-core behaves pretty weird when it comes to floating-point applications.
In some FP intensive applications (CINEBENCH and LINPACK for example) a 2GHz
quad-core cannot even keep up with Intel's older 2GHz 65nm quad-core CPUs; in
other applications it is a close match (3ds Max, POV-Ray); finally, in
applications like zVisuel's 3D Engine and SPECfp, Barcelona is clearly faster
clock-for-clock than the older generation. Our aim is to understand this
situation a little better and to see what the 45nm Xeon 54xx can achieve.

To understand
this we first tested with two synthetic, but completely opposite FP benchmarks:

LINPACK,
which calculates on massive matrices

FLOPS, which fits in an 8 KB L1 cache

Let us start with LINPACK. LINPACK, a
benchmark application based on the LINPACK TPP code, has become the industry
standard benchmark for HPC. It solves large systems of linear equations by using
a high performance matrix kernel.

We used
a workload of square matrices of sizes 5000 to 30000 by steps of 5000, and we ran
four (dual dual-core) or eight (dual quad-core) threads. As the system was equipped
with 8GB of RAM, the large matrixes all ran in memory. LINPACK is expressed in GFLOPS
(Giga/Billions of Floating Operations Per Second).

We used
two versions of LINPACK:

Intel's
version of LINPACK compiled with the Intel Math Kernel Library (MKL)

A fully optimized "K10-only" version for AMD's quad-core

The "K10-only"
version uses the ACML version 4.0.0, compiled using the PGI 7.0.7. We used the
following flags:

pgcc
-O3 -fast -tp=barcelona-64

The graph above may come
as a surprise to a quite few people. At the lower matrix sizes, AMD's quad-core
is even a bit faster with the "Intel version" than with the specially
optimized version. Only while calculating with the larger matrices does the
heavily tuned version pull ahead. The K10-only version of LINPACK is about 6% faster,
and the most important reason for that improvement is the ACML library of AMD. However,
it is clear that the Intel MKL and compiler are not slowing the AMD core down when
it is running LINPACK.

There
is more. At first sight, the AMD 2360SE scores seem rather poor: just a tiny bit
faster than the 2.33GHz quad-core of Intel. However, the Intel CPU scales rather
poorly with clock speed: a 3GHz Clovertown is only 6% faster than a 2.33GHz one
while the clock speed advantage is 28%. The Barcelona core however scales 19% from
a 20% clock speed boost. The new Seaburg platform cannot help here: a 3GHz Xeon
E5365 was capable of 57.1 GFLOPS, while it got 57 GFLOPS with the older chipset.

Intel's
clever compiler engineers have already found a way around this, as the newest release
of their LINPACK version is quite a bit faster on both Clovertown and Harpertown.
The LINPACK score increases to 70 GLOPs for the Xeon 5472 3GHz (60.5 in our test)
and 63 for the Xeon E5365 3GHz (57 in our test). Unfortunately, we don't have
any data on what has changed, so we decided to freeze our benchmark code for
now.

Raw FPU power: FLOPS

Let us see if we understand floating-point
performance better when we look at FLOPS. FLOPS was
programmed by Al Aburto and is a very floating-point intensive benchmark. Analysis
show that this benchmark contains:

70% floating-point
instructions

Only 4% branches

Only 34% of instructions are memory instructions

Benchmarking with FLOPS is not real world, but isolates the FPU
power. Al Aburto states the following about FLOPS:

"Flops.c is a 'C' program which attempts to estimate your systems
floating-point 'MFLOPS' rating for the FADD, FSUB, FMUL, and FDIV operations based
on specific 'instruction mixes'. The program provides an estimate of PEAK MFLOPS
performance by making maximal use of register variables with minimal interaction
with main memory. The execution loops are all small so that they will fit in any
cache."

FLOPS shows the maximum double precision power that the core
has, by making sure that the program fits in the L1 cache. FLOPS consists of
eight tests, and each test has a different but well-known instruction mix. The most
frequently used instructions are FADD (addition, 1st number), FSUB (subtraction, 2nd number), and FMUL (multiplication, 3rd number). FDIV is the fourth number in the mix. We focused on four tests (1, 3, 4, and 8) as the other tests were very similar.

We compiled three versions:

Gcc
(GCC) 4.1.2 20070115 (prerelease) (SUSE Linux)

Intel C Compiler (ICC) version 10.1 20070913:

An x87 version

A fully vectorized SSE2 version

Let us restate what we
are measuring with FLOPS:

Single-threaded
double precision floating performance

Maximum performance not constrained by bandwidth bottlenecks or latency delays

Perfect clock speed scaling

This allows us to calculate the FLOPS per cycle number: we divide
the MFLOPs numbers reported by FLOPS by the clock speed. This way we get a good
idea of the raw floating-point power of each architecture. First we test with the
basic '-O3' optimization.

Don't you love the little FLOPS benchmark? It is far from a real
world benchmark, but it tells us so much.

To understand this better, know that the raw speed of the AMD out
of order x87 FPU (Athlon K7 and Athlon 64) used to be between 20% and 60% faster
clock for clock than any P6 based CPU, including Banias and Dothan. However, this
x87 FPU advantage has evaporated since the introduction of the Core architecture
as the Core Architecture has a separate FMUL and FADD pipe. (Older P6 architectures
could execute one FADD followed by one FMUL, or two FP operations per two clocks
at the most). You can see that when we test with a mix of FADD/FSUB and FMUL, the
Core architecture is slightly faster clock for clock.

The only weakness remaining in the Core x87 architecture is the
FP divider. Notice how even a relatively low percentage of divisions (the 4th number in the mix) kills the performance
of our 65nm Xeon. The Opteron 22xx and 23xx are 70% faster (sometimes more)
when it comes to double precision FP divisions. However, the new Xeon 54xx closes
this gap completely thanks to lowering the latency of a 64-bit FDIV from 32 cycles
(Xeon 53xx) to 20 cycles (Xeon 54xx, Opteron 23xx). The Xeon 54xx is only 1% to
5% slower in the scenarios where quite a few divisions happen. That is because the
Opterons are capable of somewhat pipelining FDIVs, which allows them to retire one
FDIV every 17 cycles. The clock speed advantage of the 45nm Xeon (3.2GHz vs. 2.5GHz
maximum at present) will give it a solid lead in x87 performance.

However, x87 performance is not as important as it used to be. Most
modern FP applications make good use of the SSE SIMD instruction set. Let us see
what happens if we "vectorize" our small FLOPS loops. Remember, FLOPS
runs from the L1 cache, so L2 and memory bandwidth do not matter at all.

When it comes to raw SSE performance, the Intel architectures are
3% to 14% faster in the add/subtract/multiply scenarios. When there are divisions
involved, Barcelona absolutely annihilates the 65nm Core architecture with up to
80% better SSE performance, clock for clock. It even manages to outperform the newest
45nm Xeon, but only by 8% to 18%. Notice once again the vast improvement from the
2nd generation Opteron to the 3rd generation Opteron when
it comes to SIMD performance, ranging from 55% to 150% (!!).

As not all applications are compiled using Intel's
high-performance compiler, we also compiled with the popular open source
compiler gcc. Optimization beyond "-O3" had little effect on our FLOPS
binary, so we limited our testing to one binary.

GCC paints about the same picture as ICC without vectorization.
The new AMD Barcelona core has a small advantage, as it is the fastest chip in 3
out of 4 tests. Let us now see if we can apply our FP analyses of LINPACK and FLOPS
to real world applications.

3ds Max 9 (32-bit Windows)

We tested
with the 32-bit version of 3ds Max version 9, which has improvements that help
multi-core systems but which is not as aggressively tuned for SSE as LINPACK
and zVisuel. We used the "architecture" scene, which has been a favorite
benchmarking scene for years. We performed all tests with 3ds Max's default
scanline renderer, we enabled SSE support, and we rendered at HD 720p
(1280x720) resolution. We measured the time it takes to render ten frames (frames
20 to 29).

As promised,
we profiled our different benchmarks to understand them better. We performed
profiling with AMD's CodeAnalyst; VTune profiling will follow later. 3ds Max runs
four modules when you render:

Render
(46% of the time)

Ray-FX (28%)

Geometry (15%)

Core (11%)

To keep things
simple, we summarized our findings with a weighted average over all modules.

3dsmax
Profiling

Profile

Total

Average IPC (on AMD
2350)

1

Instruction mix

Floating
Point

39%

SSE

12%

Branches

13%

L1 datacache
ratio

0.56

L1
Instruction ratio

0.27

Performance indicators on Opteron
2350

Branch
misprediction

6%

L1 datacache miss

1%

L1 Instruction cache
miss

5%

L2 cache
miss

0%

As you can see,
3ds Max is mostly about floating-point performance with a bit of SSE instructions.
It runs perfectly in the L1 and L2 cache of our CPUs. To make the graph easier to
read we did not report our results in the classic way (rendering time) but expressed
them in images rendered per hour (10 images * 3600 seconds divided by render time).
Higher is therefore better.

The Xeon
5472 is about 8% faster than its older brother and widens the gap from the AMD
Armada. We included quite a few results of older tests. This benchmark focuses
on the CPU; chipset and RAM choices don't impact performance much. Interestingly,
the Opteron 2350 is about as fast as four 2.4GHz single-core Opterons. Thus, in
software with a "small dash" of SSE, the new architecture is about
20% faster. If we extrapolate our AMD quad-core results to 3GHz, the result would
be about 59 images per second, which indicates that AMD's newest is about 10% slower
than Intel clock for clock. That is no real surprise anymore: FLOPS showed us that
the raw x87 FP and SSE power of AMD's latest architecture is slightly lower than
the newest Xeon. It also can only overpower the Xeon 53xx if there are enough divisions
involved. AMD's Barcelona architecture will only show a real advantage in
bandwidth limited FP situations such as SPECfp2006 and many HPC applications.

Software Rendering: zVisuel (32-bit Windows)

This benchmark is the zVisuel Kribi 3D test, which is exclusive
to AnandTech.com and which simulates the assembly of a mechanical watch. The complete
model is very detailed with around 300,000 polygons and a lot of texture, bump,
and reflection maps. We render more than 1000 frames and report the average FPS
(frames per second). All this is rendered on the "Kribi 3D" engine, an
ultra-powerful real-time software rendering 3D engine. That all this happens at
reasonable speeds is a result of the fact that the newest AMD and Intel architectures
contain four cores and can perform up to eight 32-bit FP operations per clock cycle
and per core. The people of zVisuel told us that - in reality - the current Core
architecture can sustain six FP operations in well-optimized loops. Profiling
for Barcelona architecture is not yet complete, so we did our best with
CodeAnalyst 2.74 for Windows. We only profiled the non-AA benchmark so far.

ZVisuel Kribi3D
Profiling

Profile

Total

Average IPC (on Opteron
2350)

1

Instruction mix

Floating
Point

31%

SSE

35%

Branches

6%

L1 datacache
ratio

0.63

L1 Instruction
ratio

0.22

Performance indicators on Opteron
2350

Branch
misprediction

8%

L1 datacache
miss

1%

L1 Instruction cache
miss

1%

L2 cache
miss

0%

This is a very different engine than the scanline-rendering
engine of 3ds Max. SSE instructions play a very dominant role, and the zVisuel Kribi
3D benchmark gives us a view on how the different CPUs perform on
well-optimized SSE applications. While the application seems to run almost perfectly
from the L2 cache, this seems to be a result of well-tuned, predictable access to
the memory. We noticed that hardware prefetching and the new Seaburg chips help
this benchmark a lot:

Zvisuel Intel Platform Performance
Comparison

CPU

HW
Prefetch on

HW Prefetch
disabled

Difference

Dual Xeon E5365 3.0
(Blackford)

99.9

87.7

14%

Dual Xeon E5365 3.0
(Seaburg)

110

104.2

6%

Dual Xeon E5472 3.0
(Seaburg)

124.8

110

13%

Let us see all the results.

Although we haven't done a detailed analysis,
we can assume that the "Super Shuffle Engine" and "Radix-16"
divider that Intel has implemented in the Xeon 5472 is paying off here. AMD Opteron
2360 SE at 2.5GHz can overtake the best Xeon at 65nm, but the new Xeon has a tangible
lead. A silver lining to the cloud hanging over AMD is that the Opteron 23xx series
scale perfectly with clock speeds: compare the 2GHz with the 2.5GHz results. Still,
Intel has the advantage when it comes to SSE processing.

The results with AA show that the memory
subsystem of the Xeon 53xx is a major bottleneck, but the new Seaburg chipset
has made this bottleneck a bit smaller. The result is a crushing victory for the
latest Intel architecture. Enough FP testing, let us see what Barcelona can do when
running typical integer server workloads.

64-bit
Linux Java Performance: SPECjbb2005

If you are not familiar with SPECJbb2005, please read our
introduction to it here. SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation)
evaluates the performance of server side Java by emulating a three-tier client/server
system with emphasis on the middle tier.

The BEA JVM uses memory more aggressively,
making more use of the assigned memory. A heap size of 2GB would probably result
in the JVM gobbling up too much memory, which could result in errors or poor performance
on our 8GB system. That is why we lowered the JVM heap size from 2G to 1.8 G. We
also applied slightly more aggressive tuning to the BEA JVM, as their customers
are more interested in squeezing out the last bit of performance.

We also used four JVMs per system. The reason
is that most IT departments emphasize consolidation today, and it is very rare that
one JVM gets eight cores. We fully disclosed our testing methods here. However, note that you cannot
compare the results below with our previous findings as we use newer JVMs now.

Below you can find the final score reported by
SPECjbb2005, which is an average of the last four runs.

It took us hours, but we managed to complete one
run of SPECjbb with faster 800MHz DDR. It shows with four JVMs that the memory
subsystem on the Xeon systems is a clear bottleneck. It is also important to note
that SPECjbb is one of the most sensitive benchmarks to memory bandwidth. If
SPECjbb improves 6% when we use 800MHz ram instead of 667MHz, this is probably
about the maximum boost the new Xeon will get from slightly faster 800MHz
memory.

Let us see how the systems fare with the BEA JRockit
JVM.

BEA uses some clever compression techniques to
alleviate the pressure on the memory system. However, it is a bit funny to see how
the hardware prefetching gets in the way. While it boosted performance by 14% in
our zVisuel Kribi 3D, it is now slowing down the SPECjbb benchmark by 14%. When
we disable it, the Xeon 5472 takes the lead.

The Xeon 3GHz looks like a sports car with his
wheels deeply entrenched in the mud: the raw power keeps spinning on the lack of
memory bandwidth. A 3GHz chip is hardly faster than a 2.33GHz chip, and here
AMD's quad-core at 2GHz beats Xeon.

64-bit MySQL (Linux
64-bit)

MySQL has released version 5.1.22, which supposedly
can scale up to eight CPU cores. That would be a huge improvement considering that
all versions earlier than 5.0.37 could only make good use of two CPU cores. In our
experience, this new binary scales well up to four cores, but eight cores are easily
20% slower than four core systems. Thus, we tested with a maximum of four cores.

Please note that these results cannot
be compared with our earlier MySQL results. MySQL v5.1.22 is a completely
different binary than v5.0.2x we tested previously. Although it still doesn't scale
beyond four cores, it is up to 70% (!!) faster than v5.0.26 that came standard with
our SLES 10 SP1. For smaller servers with four cores, MySQL is once again an
ultra fast database.

Moreover,
the third generation of Opterons absolutely loves this new MySQL version. At 2.5GHz,
it is just as fast (the margin of error is up to 4%) as the mighty Xeon 5472 at
3GHz. As we failed to profile MySQL in depth (CodeAnalyst for Linux still has some
quirks), we cannot pinpoint the exact reason why the Opteron 23xx is so good at
this. The MySQL database is mostly limited by synchronizing the locks, so we suspect
that the slightly faster cache coherency syncing on the Opteron 23xx might be one
of the reasons AMD's latest performs so well.

WinRAR 3.62
(Windows 32-bit)

WinRAR 3.62 is a completely
different kind of workload.

WinRAR 3.62
Profiling

Profile

Total

Average IPC (on AMD
2350)

0.36

Instruction mix

Floating
Point

0%

SSE

0%

Branches

9%

L1 datacache
ratio

1.13

L1
I cache ratio

0.35

Performance indicators (on Opteron
2350)

Branch
misprediction

7%

L1 datacache
miss

4%

L1 Instruction cache
miss

0%

L2 cache
miss

3%

Notice that contrary
to the other workloads we have profiled so far, WinRAR does not run perfectly in
the L1 or L2 cache. Second, notice the huge amount of loads that happen: more than
one per retired instruction.

The massive bandwidth
that Barcelona can offer multi-threaded software pays off here. You can also see
that the Seaburg chipset improves the score of the 3GHz quad-core Xeon by 7%.

Fritz Chess Benchmark (Windows 32-bit)

WinRAR and Fritz Chess are probably less important
to most people than the other benchmarks we ran. However, the reason why we include
them in this article is that their profile is so different from the other applications.
In this way, we get more insight into the different new architectures.

Fritz Chess

Profile

Total

Average IPC (on AMD 2350)

0.99

Instruction mix

Floating
Point

4%

SSE

0%

Branches

17%

Performance indicators on Opteron
2350

Branch
misprediction

12%

L1 datacache ratio

0.65

L1 Instruction
ratio

0.37

L1 datacache
miss

1%

L1 Instruction cache
miss

1%

L2 cache
miss

0%

We have said this before but it warrants
repeating: you'll find a decent amount of complex branches in a chess program.
17% branches is not that extraordinary, but the fact that 12% of those branches
are mispredicted is. If we compare a 2GHz Opteron 22xx with an Opteron 23xx, we
should see if the improvements in branch prediction pay off.

The Opteron 2350 is about 3% faster than the Opteron
22xx, core for core, clock for clock. We believe we can assume that the branch prediction
improvements are minor, as the Fritz chess benchmark runs in the L1 and L2
cache.

HPC

Several of the HPC benchmarks are too
expensive for us to test, but we can get some information from AMD's and Intel's
own benchmarking. According to Intel, the new Intel Xeon 5472 (1.89 score) is about
26% faster than the Xeon 5365 (1.5 score) when running the fluent benchmark. According to AMD, the
Opteron 2350 is about 10% to 60% faster than a 2.33GHz Xeon E5345. That doesn't
give us much comparison data, but at first sight it seems that AMD will be competitive
in Fluent even at lower clock speeds (2.5GHz versus 3GHz).

Intel's own marketing material seems to admit
that a Xeon E5472 with 800MHz memory is just as a fast as AMD's quad-core at
2GHz. AMD's 2.5GHz model will surely take the lead in LS-DYNA. Looking at the Fluent
and LS-DYNA benchmarks it appears that AMD will remain very competitive in the HPC
market.

One benchmark where Intel's newest chip really
shines is the Black-Scholes algorithm: as most of the calculations involve divisions,
the new Xeon 54xx chips are about 50% faster than their older Xeon 53xx siblings,
clock for clock. Unfortunately, our compilation of Black-Scholes failed on the quad-core
AMD, so we have to postpone those results for now.

Conclusion

When it comes to floating-point performance, we
feel we can say we have a very good picture of what AMD's and Intel's best are capable
off. The Barcelona floating-point architecture is able to beat the 53xx in quite
a few benchmarks, but the Xeon 5472 shows that AMD's third generation Opteron is
late to the party. Our FLOPS, LINPACK, and rendering benchmarks show that the
Xeon 5472 is at least as good as or better than AMD's latest in raw FP
performance on a clock-for-clock basis.

We have less data on "pure" integer
performance, with the exception of our Fritz Chess benchmark. This benchmark gave
us a first hint that the improvement in integer performance from the Opteron 22xx
to the Opteron 23xx is probably rather small. The single-threaded SpecInt2006 numbers
published by IBM are probably not optimal, but also confirm this:

This indicates that the Opteron 23xx is about
10% faster in integer tasks than the 22xx series. Considering that the best SPECint_rate2006
score of AMD's quad-core at 2.5GHz is 102 while Intel's 5460 (3.16GHz) is already
at 138, we think it is safe to assume that the integer performance of AMD's Barcelona
is still not up to Intel Core levels. The Xeon 5365 at 3GHz is also able to deliver
a significantly higher score (117). This, together with our own benchmark data,
makes us believe that the Xeon 54xx based on the Penryn architecture will beat the
best AMD chips on every aspect of raw processing performance: integer, legacy x87
FP, and SIMD (SSE). It is clear now why Intel's CPUs are so dominant in desktop
and workstation workloads.

Add to this a significant clock advantage: there
is already a 3.2GHz Xeon 5485 (150W). If you prefer a less power hungry CPU, Intel
can provide a 3GHz 5472 that is still clocked 20% higher than what AMD will be able
to deliver 2 to 3 months later. Although the 3GHz models are quite pricey
(>$1000), you can already find a 2.5GHz quad-core Xeon for $316. That's the
same price as a 1.9GHz Opteron 2347 chip. There is little doubt in our mind that
a 2.5GHz Xeon is faster in almost every application we can think off, so Intel's
newest Xeon does have the price/performance crown as well.

While AMD loses quite a few battles, the war is
far from lost. The server/HPC situation is entirely different from the desktop scene
where the Core 2 Quad overpowers the Phenom in almost
every benchmark. There is more to server and HPC performance than simple raw processing
power. Intel's flagship still has an Achilles heel: the platform it is running on
has higher latency and much lower bandwidth than AMD's platform. Once you
really stress all those cores with many threads, AMD's platform starts to pay
off.

Look at the summary of our benchmarking
below. (Blue numbers mean Intel is faster; green show a victory for the AMD
chip).

AMD vs. Intel Performance
Summary

General
applications

Opteron 2360SE vs.
Xeon
E5365

Opteron 2360SE vs.
Xeon
5472

WinRAR
3.62

23%
faster

6%
faster

Fritz Chess
engine

24%
slower

26%
slower

HPC
applications

LINPACK

4% slower*

9%
slower*

3D Applications

3DS Max
9

19%
slower

25 %
slower

zVisuel 3D Kribi
Engine

7%
faster

14%
slower

zVisuel 3D Kribi Engine
(AA)

2%
slower

23%
slower

Server
applications

SPECjbb
(Sun)

28%
faster

11%
faster

SPECjbb
(BEA)

12%
faster

12%
slower

MySQL

14%
faster

Equal

* Faster LINPACK binaries from Intel were available
at the time that we finished this article.

To put it in car terms, our SPECjbb, LINPACK,
and MySQL benchmarks have shown that Intel's "powerful CPU engines" sometimes
have problems putting the "massive torque" to the "wheels".
You may feel for example that using four instances in our SPECjbb test favors
AMD too much, but there is no denying that using more virtual machines on fewer
physical servers is what is happening in the real world. Intel's best have a solid
lead over AMD's quad-core in rendering benchmarks, but some HPC, Java and MySQL
benchmarks show that the 2.5GHz Barcelona is able to keep up with (or come close
to) a 3GHz Xeon 5472. That is impressive, on the condition that we finally see some
higher clocked Opteron 23xx chips in commercially available servers.

We still cannot draw any solid conclusion on
the server performance of AMD's quad-core as no MS Exchange, SAP ERP, TPC-C, or
TPC-H results have been published. In fact, with the exception of the SPECjbb and
MySQL numbers in this article, all server benchmarks on AMD's third generation Opteron
are MIA. This situation will probably continue for a few more months as most of
these benchmark results traditionally come from OEMs and not AMD.