Friday Jan 22, 2010

Customers
would love to have their performance levels linked to their hardware.
But more often than you think, they migrate from System X (designed
10 years ago) to System Y (fresh from the oven) and are surprised with the performance improvements.
In the past two years, we have completed many successful migrations from F15k/E25k servers to new Enterprise Servers M9000. Customer have reported great improvements in throughput and response time. But what can you really expect and what percentage of the improvement is actually due to the operating system enhancement ?
Can the recent small frequency increase on our SPARC64 VII chipset be at all
interesting ?
The new SPARC64 VII 2.88Ghz available on our M8000 and M9000 flagships
propose no architectural change, no additional features and a modest
frequency increase going from 2.52 Ghz to 2.88 Ghz - for a ratio of
1.14. We could stop our analysis there and label this change
'marginal' or 'not interesting'. But my initial testings showed a
comparative OLTP peak throughput to be way higher than this
frequency-based ratio.

What
happened ?

A
passion for Solaris

Most of the long term Sun employees have a passion
for Solaris.
Solaris is the uncontested Unix leader and include such a huge amount
of features
that when you are a Solaris addict, it is difficult to get in love
with another Operating System. And Oracle executives made no mistake
: Sun has the best UNIX kernel & performance engineers in the
world. Without them, Solaris would not scale today to a 512 hardware
thread system (M9000-64).

But of course, Solaris is a moving target. Every
release brings its truck load of features, bug fixes and other
performance improvements. Here are critical fixes done between
Solaris 10 Update 4 and the brand new Solaris 10 Update 8 influencing
Oracle performance on the M9000:

In Solaris 10 Update 7 (05/09), we enhanced
MPXIO as well as the PCI framework (cr=6449810 and others) and
improved thread scheduling (cr=6647538). We also enhanced Mutex
operations (cr=6719447).

Finally, in Solaris 10 Update 8 , after long
customer escalations, we fixed the single threaded nature of callout
processing (cr=6565503-6311743). [This is
critical for all calls made to nanosleep & usleep.]
We also improved the throughput & latency of the very common
e1000g driver (cr=6335837 + 5 more) and optimized the mpt driver
(cr=6784459). We cleaned up interrupt management (cr=6799018) and
optimized bcopy and kcopy operations (cr=6292199). Finally, we
improved some single threaded operations (cr=6755069).

My initial SPARC64 VII iGenOLTP
tests were done with Solaris 10 Update 4. But I could not test the
new SPARC64 VII 2.88Ghz with this release because it was not
supported ! Therefore, I had to compare the new chip performance to
SPARC64VII 2.52Ghz using each S10U4 and S10U8. We
will see below that most of the improvements are not coming from the
frequency increase but from Solaris itself.

Chips
& Chassis

Please find below , the
key characteristics of the
chips we have tested:

Chips

UltraSPARC IV+

SPARC64 VI

SPARC64 VII

SPARC64 VII (+)

Manufacturing

90nm

90nm

65nm

65nm

Die size

356 sq mm

421 sq mm

421 sq mm

421 sq mm

Transistors

295 million

540 million

600 million

600 million

Cores

2

2

4

4

Threads/core

1

2

2

2

Total threads

2

4

8

8

Frequency

1.5 Ghz

2.28 Ghz

2.5Ghz

2.88Ghz

L1 I-cache

64 KB

128 KB/core

512 KB

512 KB

L1 D-cache

64 KB

128 KB/core

512 KB

512 KB

On-chip L2

2 MB

6 MB

6 MB

6 MB

Off-chip L3

32 MB

None

None

None

Max Watts

56 W

120 W

135 W

140 W

Watts/thread

28 W

30 W

17 W

17 W

Note on (+):
The new SPARC64 VII is not officially labeled with a plus sign in
order to reflect the absence of new features.

Now,
here is our hardware list. Note that to avoid the need for a huge
Client system, we ran this iGenOltp workload in a Console/Server
mode. It means that the Java processes sending SQL queries via JDBC
are running directly on the server tested. While this model was
unusual ten years ago in the era of Client/Server, it is more and
more commonly found today in new customer deployments.

Servers

E25k

M9000-32

M9000-32

M9000-32

Chip

UltraSPARC-IV+

SPARC64 VI

SPARC64 VII

SPARC64 VII+

# chips

8

8

8

8

Total hardware threads

16

16

32

32

Frequency

1.5 Ghz

2.28 Ghz

2.52 Ghz

2.88 Ghz

System Clock

150 Mhz

960 Mhz

960 Mhz

960 Mhz~

RAM

64 GB

64 GB

64 GB\*

64 GB\*

Operating System

Solaris 10 Update 4

Solaris 10 Update 4

Solaris 10 Update 4 & 8

Solaris 10 Update 8

Console system

Storage

SE9990V

X4240

[shared]

64 GB cache

Opteron quad-core

25 TB

2x2.33Ghz

200 Hitachi HDD

15k RPM

8x2Gbit/s

Note on (~):While the system clock has not changed, the new M9000 CMUs are
equipped with an optimized Memory Access Controller labeled MAC+. The
MAC+ chip set is critical for system reliability, in particular for
the memory mirroring and memory patrolling features. We have not
identified performance improvements linked to this new feature.

Note on (\*):Those domains have 128GB total memory. To compare apple-to-apple,
64GB of memory are allocated, populated and locked in place with my
very own _shmalloc tool.

Chart

The iGenOLTPv4
workload is a Java-based lightweight OLTP database workload.
Simulating a classic Order Entry system, it is tested in stream mode
(I.e no wait time between transactions). For this particular
exercise, we have created a very large database of 8 Terabyte total.
This database is stored on the SE9990V using Oracle ASM. We query 100
million customer identifiers on this very large database in order to
create an I/O intensive (but not I/O bound) workload similar to the
largest OLTP installations in the world. (Example : the E25ks running
the bulk load of Oracle internal applications). The exact throughput
in number of transactions per second and average response times are
reported and coalesced for each scalability level. For this test,
we used Solaris 10 Update 4 & 8, Java version 1.6 build 16, and the
Oracle database server 10.2.0.4

Performance
notes :

In peak, the new
SPARC64VII 2.88Ghz produce 1.10x OLTP throughput compared to the
2.52Ghz on S10U8.

But compared to the
2.52Ghz chips on S10U4, the ratio is 1.54x and compared to the
SPARC64 VI it is 2.38x.

For a customer
willing to upgrade a E25k equipped with 1.5Ghz chips, the throughput
ratio is 4.125 ! It means that we can easily replace a 8 boards E25k
with a 2 boards M8000 for better throughput and improved response
times.

As expected, Oracle OLTP improvements due to the new
SPARC64VII chip are modest using the latest Solaris 10. However, all
the customer already in production using previous release of Solaris
10 will see throughput improvement up to 1.54x. Most likely, this is
enough to motivate a refresh of their system. And all E25k customers
have now a very interesting value proposition with our M8000 and
M9000 chassis.