Oracle Blog

Richard Elling's Weblog

Monday Oct 13, 2008

Reliability, Availability, and Serviceability (RAS) in the Sun
SPARC Enterprise T5440 builds upon the solid foundations created for
the Sun SPARC
Enterprise T5140, T5240,
and Sun Fire X4600 M2
servers. The large number of CPU cores available in the T5440 needs
large amounts of I/O capability to balance the design. The physical
design of the X4600 M2 servers was a natural candidate for the new
design – modular CPU and memory cards along with plenty of slots
for I/O expansion. We've also seen good field reliability from the
X4600 M2 servers and their components. The T5440 is a excellent
example of how leveraging the best parts of these other designs has
resulted in a very reliable and serviceable system.

The trade-offs required for scaling from a single board design to
a larger, multiple board design always impact reliability of the
server. Additional connectors and other parts also contribute to
increased failure rates, or lower reliability. On the other hand, the
ability to replace a major component without replacing a whole
motherboard increases serviceability – and lowers operating costs.
The additional parts which enable the system to scale also have an
impact on performance, as some of my colleagues have noted. When
comparing systems on a single aspect of the RAS and performance
spectrum, you can miss important design characteristics, or worse,
misunderstand how the trade-offs impact the overall suitability of a
system. To get a better insight on how to apply highly scalable
systems to a complex task prefer to do a performability
analysis.

The T5440 has almost exactly twice the performance capabilities of
the T5220. If you have a workload which previously required four
T5220s with a spare (for availability), then you should be able to
host that workload on only two T5440s, and a spare. Using benchmarks
for sizing is the best way to compare, and we can generally see that
a T5440 is six times more capable than a Sun
Fire V490 server. This will complete a comparable performance
sizing.

On the RAS side, a single T5440 is more reliable than two T5220s,
so there is a reliability gain. But for a performability analysis,
that is contrasted with the fewer numbers of T5440. For example, if
the workload requires 4 servers and we add a spare, then the system
is considered performant when 4 of 5 servers are available. As we
consolidate onto fewer servers, the model changes accordingly: for 2
servers and a spare, the system is performant when 2 of 3 servers are
available. The reliability gain of using fewer servers can be readily
seen in the number of yearly service calls expected. Fewer servers
tends to mean fewer service calls. The math behind this can become
complicated for large clusters and is arguably counter-intuitive at
times. Fortuntately, our RAS modeling tools can handle very
complicated systems relatively easily.

We build availability models for all of our systems and use the
same service parameters to permit easy comparisons. For example, we
would model all systems with 8 hour service response time. The models
are then compared, thusly

System

Units

Performability

Yearly Services

Sun SPARC Enterprise 5440 server

2 + 1

0.99999903

0.585

Sun SPARC Enterprise 5240 server

4 + 1

0.99999909

0.661

Sun SPARC Enterprise 5140 server

4 + 1

0.99999915

0.687

Sun Fire V490 server

12 + 1

0.99998644

1.402

In these results, you can see that T5440 clearly wins the number
of units and yearly services. Both of these metrics impact total cost
of ownership (TCO) as the complexity of an environment is generally
attributed to the number of OS instances – fewer servers generally
means fewer OS instances. Fewer service calls means fewer problems
that require physical human interactions.

You can also see that the performability of the T5x40 systems are
very similar. Any of these systems will be much better than a system
of V490 servers.