Recently, industry trade journals
have announced Oracle's the ability to migrate Virtual Machines (VMs)
safely and securely from one execution environment to another on
SPARC T-based servers. Actually, the OracleVM Server for SPARC Data
Sheet claims what might be considered rather run-of-the-mill for
state-of-the-art server virtualization. Even assuming basic
virtualization capabilities, the architecture of the Sun-cum-Oracle
SPARC T processors do not lend themselves well to “tacking on”
virtiualization and expecting low-overhead performance. If stuck with
either unportable Solaris source code or SPARC binary images and one
simply needs the code to run, albiet degraded, using OracleVM to host
an old Solaris 8 VM on an SPARC T server could be a temporary
solution. However, as for OracleVM and an SPARC T being a powerful
virtualized consolidation platform, one needs to think twice. Here is
why:

When considering server consolidation
it soon becomes clear that without good planning, including adding
multiple paths to network connections and storage spindles, having
more than sufficient RAM (roughly, closing one's eyes and at least
summing up what each stovepipe server used) and choosing an appropriate
target execution architecture, problems soon arise. Modern
conventional processors such as IBM POWER and Intel x86 were designed
for wide instruction execution capability using compilers that extract
maximum Instruction Level Parallelism. Effective execution of wide
ILP code requires a processor design including advanced
branch prediction, large out of order execution windows, large and
extremely fast caches, enhanced with the ability to execute more than
one execution thread per core -- simultaneously. A single 8-core IBM POWER7 processor, for example, can execute 32 threads all in the same
clock tick. The latter is in contrast with Oracle's SPARC T
processors that at any time execute only one thread per clock per
core regardless of the number of tread contexts that are pending.

Modern, multi-threaded operating
systems such as, AIX, Solaris, Linux, even Windows keep track of each
thread context status, switch between threads on the order of
milliseconds, enable prioritized thread preemption, task management
and its switching, and paging in/out of virtualized memory. It is not a
coincidence that processors that have cleaver branch prediction,
large and sophisticated caching, fast clocks, large memory spaces,
not only have the highest performance (such as seen on www.spec.org,
etc) but are successful platforms for server consolidation.

If we reduce the execution clocks in
these successful processors by a half, reduce the cache sizes by four
or eight times, eliminate the L3 cache completely, reduce instruction
execution width to one, remove any branch prediction, can we expect
spectacular server consolidation performance with these nearly
chocked processors? Not a chance! This characterizes the SPARC T
processor. To be fair, there are applications that lends themselves
to a processor that switches available thin thread contexts on L1
cache misses, but those are generally associated with applications
such as specific web farms and functions such the UNIX dd command.

In virtualized environments, system RAM
is extra virtual – operating system (via HW MMU) address
translation and the hypervisor must satisfy and keep track of
multiple VM address translation. Threads should not switch on an L1
cache miss, but rather when the VMs demand it. Even on the latest SPARC T3
processor, with 8KB instruction, 6KB data L1 cache sizes, and 16
cores sharing a 6MB L2 cache, cache thrashing and thread stalling must be
tremendous in a virtualized environment.

In stark contrast, IBM PowerVM,
is not only based on one of the highest performance processor
architectures (IBM POWER7: 32KB L1, 256KB L2 and 8 cores sharing a 32
MB L3), but has the ability to virtualize each core into 10 logical
processor increments – including the four simultaneously executing
threads, create processor pools, cap or uncap logical domains,
migrate domains without a single reported hypervisor security fault.
Go to http://web.nvd.nist.gov/view/vuln/search
and enter powervm, oraclevm, and vmware separately to see the
current results.

There has been quite a lot of talk
about ARM Holdings and the ARM processor lately. Some of this is due
to the pervasiveness of its architecture in many mobile devices, some
of it is due to extensive hype over “new technology” versus “old
technology” – an unfortunate metaphor.

Are we to believe processor designers
who license the rights to the ARM processor technology are going to
“one up” traditional server processor architectures simply
because they started out with a stripped down, energy-efficient CPU? Let's take a look at why not!

Benchmarks results specifically
targeting these low-power processors have begun to be published.
Many of these benchmarks are based on the Dhrystone benchmark, run on
8088-class processors back in the 1980s! Performance for this class
of processor is usually measured in DMIP (Dhrystone Millions of
Instructions/Sec), roughly based on a VAX780 MIP. These benchmarks
are a far cry from industry standard benchmarks such as the SPEC or
TPCC warehouse database suits, etc. Before one starts yelling, how can
one expect the ARM class of processors to do well on these
benchmarks? One cannot simultaneously reject so-called “old
technology” while extolling the wonders of 30 battery hr hand held
tablet processor in micro servers. It would indeed be interesting to see
SPECint2006 results for these processors, but none seem to exist. The
same for a tpcc result? It is noteworthy that a dual core 1.6 GHz
Atom processor generates about 8000 DMIPS and dual core Cortex A9
about 4000. This means that if Intel had to drop its clock to say
1GHz to be in the same heat dissipation range as the Cortex A9, they
would have “similar” performance – in a single socket
environment.

In reality, “new technology” (ARM)
and “old technology” (Intel, AMD, IBM POWER) are two different
technologies, neither chronologically distinct. If we expect to see a
farm of micro servers each with 100 ARM or ARM-like Systems on a Chip
in 1U form factors, one should expect they will be running commercial
grade applications, the least of which would be web and database
servers. Would we see a SPECweb2005 result published for 1024 socket
ARM-based micro web server? We had better.

Is one supposed to assume that
designers of Intel x86 or IBM POWER are simply wasting millions of
transistors due to negligence? No! Will the processors in the “new
technology” micro servers use a new way for cache coherency
heretofore unknown to the world? I doubt it. SMP cache coherency use
transistors and utilize bandwidth. As more performance is demanded
from these ARM-class micro servers, processor designers will slowly
be incorporating techniques from “old technology” such as huge
out of order execution windows, complex caches, novel inter-socket
communications, multi-threaded execution and the ability to address
huge memory spaces. All these require complexity, transistors, and
watts. By the time all this has been accomplished the wheel will have
been re-invented again, with these micro servers dissipated about the
same heat as the “old technology” processors. If it takes a
given number of transistors to perform some advanced function such as
wide instruction execution and complex branch prediction, etc., the
ARM-class of processors will not perform such functions while
simultaneously violating the laws of solid state physics.

The hype surrounding this “new
technology” sounds striking familiar to what Sun Microsystems
claimed in the last half of the previous decade regarding its
“disruptive” Niagara “technology”. Sun said Thread Level
Parallelism was taking over the data center, since single thread
(Instruction Level Parallelism) was out of gas. Intel didn't think
so! AMD didn't think so! IBM didn't think so! Sun placed eight very
simplistic SPARC cores on a die with each executing at any given
clock tick, one of up to eight thread contexts. Sun claimed clocks
speed didn't matter because slow memory interfaces and long latencies
determined system throughput, not clock. Sun could claim something on
the order of a watt per [thin] thread context, versus perhaps 25W per
[heavy] thread from its competition. Well, about half a decade later
Sun+Oracle have reached a point where their processors now dissipate
basically the same amount of heat as established Intel, AMD, or IBM
POWER processor, and are considering reducing thread count and
cranking up the clock – to be competitive with their competition.
Sun's [now Oracle's] competition never felt the need to sacrifice
single thread performance, all the while adding cores and real
Simultaneous Multi Threading. The IBM POWER7 now has eight cores,
each capable of executing 4 instruction threads at the same time. A
single POWER7 can execute 32 threads simultaneously at a clock rate
nearly triple that of Oracle's Niagara-based processors. So much for
the hype! Something similar will have to happen with the “new
technologies” such as ARM-class processors in micro servers if they
expect to play with “old technology” big boys.

This one is for the books! It appears
Sun has passed the “report obscure benchmark you do well on
tradition” to Oracle. Sun used to report success with the
Manugistics NetWORKS Fulfillment Benchmark. Good luck finding the
latest results for that “industry standard” benchmark.

While the JD Edwards "Day In the
Life" is an active benchmark, it certainly is not in the
category of industry standard, such as the SPEC suits or TPC-C. It is
so obscure that Oracle didn't bother to provide a direct reference in
their announcement for the reader to make sense of the results. A
googling of “JD Edwards "Day In the Life" benchmark”
produced an IBM white paper that provided the following reference in Appendix B:

For a single socket SPARC T3 to
have 25% better results than a single socket POWER7, Oracle needed
twice the number of cores and four times the threads as the IBM
POWER7.

Oracle compared their just
released SPARC T3-1 results with that of an IBM POWER6, a product
announced almost FOUR YEARS ago. This is very disingenuous of Oracle
and assumes their customers will not bother to check if Oracle is
making apples-to-apples comparisons. Sun used to assume this.

Oracle then compared their new
SPARC T3-1 server results to an IBM x3650M2, 2x2.93 GHz X5570, with
64GB of memory – half the RAM of the Oracle T3 machine. It is
incumbent on Oracle to compare their new machine with a comparably
configured IBM x86 server, that is, one with 128MB, or provide
results for SPARC T3-1 server with 64GB of RAM. Neglecting to do so
will result in more of Oracle's performance claims coming under
increased scrutiny.

Oracle claims
their SPARC T3-1 is 5X faster than the IBM IBM x3650M2. This claim
is not conclusive. A server cannot be 5X faster simply because the
benchmark reports it serviced 5X the number of users. Moreover, the
IBM x3650 M2's response time is .29 secs compared with the latest
SPARC T3-1 of .523 sec. If response time is more important to the
user, the year and half old IBM x3650 M2 with half the RAM and half
the core count is about 2X as responsive as the latest SPARC T3-1
servers. In fact, the amount of available system RAM usually has a
direct relationship on the number of users. It will be
interesting see what the latest IBM X3650 M3 with 128GB of RAM will
have for results.

If core licenses
costs are important, IBM's year old POWER7 750 uses half the number
of cores for about the same benchmark performance as the the just
released Oracle SPARC T3-1 server. This Oracle benchmark
announcement demonstrates that the latest SPARC T3-1 server, at a
minimum, has a business application suit licensing cost of 2x over
that for IBM POWER7 750 servers. Oracle is telling us that SW costs
based on cores could double by using their HW over POWER7 servers from IBM.

Oracle should not
assume that the readers of its benchmark results will believe their
claims without investigation!

IBM Watson not only succeeded in
subjugating its human opponents on Jeopardy but Watson might someday
be the motivator for Sonny and the re-programmed robots
(http://en.wikipedia.org/wiki/I,_Robot_%28film%29)
to move from Chicago, USA to Toronto, CA :-)

All kidding aside, industry pundits
have been seriously speculating about alternate uses for IBM Watson.
While dramatic extrapolations to HAL9000, SkyNet, and I, Robot
abound, less has been said about Watson's immediate usefulness.
Rather than to dwell on the Jeopardy “version” of Watson with its
ninety IBM POWER7 servers
(http://www-03.ibm.com/systems/power/advantages/watson/) clustered together with an aggregate
memory size of 15TB, it may be more useful to look at non-game show
“versions” . We generally know what Watson had to accomplish
within the three second (see previous blog) Jeopardy response time
rule. What may be more interesting is to consider classes of problems
where response time demands are in the order of minutes, where a 15TB
data set is not required, or if a Watson-like construct can aid in
narrowing down solutions to sets of possibilities, rather than a
single exact response – and let humans decide and execute on a
critical choice. Versions of IBM Watson/DeepQA can be architected and
have its data digested as a function of the problem being addressed.
Alternate, larger data set versions or versions requiring a response
in less than a second can be designed. Let's look at some examples
from a 50,000 foot level.

Medical and Health Services

A researcher is confronted with a set
of patient symptoms never learned in school; some symptoms look
familiar to seasoned colleagues, but not all of them at the same
time. Is this a new disease, a mutation, something that may have
always existed but never categorized in a way that was recognized?
Traditional hospital databases can be scanned, results correlated as
best can be, but still nothing definite. This is what might take
place today, assuming these databases where constructed for more
than just billing purposes. A Watson-like derivative could be
designed to ingest patient data with specific annotations allowing
correlations that would greatly enhance the chance of narrowing down
the knowledge required to identify and eventually treat what appears
at first be an uncategorized disease. This capability may be vital
for health services located in rural areas where a Watson-like system
has the proven knowledge of millions of medical experts and studies.
Imagine making a query on a surgical procedure and finding out that a
technique abandoned twenty years ago has a better chance of being
successful than what is used today because of an heretofore
unsuspected interaction from combinations of patient symptoms or new
hormonal balances resulting from subsets of prescribed modern
medicines. Replace bacterial, microbial or ontological diseases,
etc., with determining patterns between psychiatric symptoms and
effectiveness of classes of past treatments – this is an another
variant of a Watson solution. Would Watson completely replace a
doctor, probably not, but it can start off as a trusted advisor and
the role of a doctor may be changed forever.

Financial and Economic Analysis

Pumping through piles of financial and
economic data looking for patterns, uncovering relationships between
seemingly related events already consumes vast amounts of computing
power in such places as Wall St, London, and Kong Kong. The ability
to act on distilled, structured information is generally left to
analysts – except for programmed trading. Programmed trading
systems react faster than humans to prevailing conditions, but lack
the capability to respond to exogenous events outside of its
rule-based model. It acts more like a chess program than Watson on
Jeopardy. The experts on Wall St cannot possibly program all the
rules of the particular game with the hope that combinations of
dynamic market and economic data will hit one of them. A system
designed to dynamically digest unstructured data (examples including
libraries of texts on economics, university lecture notes, radio and
TV programs, blogs, etc) create relationships with static data, and
purposely distribute this information across processing nodes to
minimize redundancy and maximize processing is much more capable of
efficiently ascertaining risk than having a rule for every possible
combination of known financial and economic data. Having a machine
with near instantaneous access to machine learned data from world's
leading economists and financial analysts might make a nice companion
on the trading floor, considering it appears reluctant to bet the
house if it is not confident of a position. Investors Business Daily
has an interesting take on Watson-like capabilities
(http://www.investors.com/NewsAndAnalysis/Article.aspx?id=562978).

And Briefly...

Tech Support and Help Desks

IBM Watson could cannibalize most forms
of current consumer technical support. It could it be worse than what
goes for telephone-based tech support today.

Law, Patent and Trademarks

Not only could a Watson-like capability
minimize filing through existing databases on laws, prior cases,
rulings, hearings, opinions, it could also be used as a method of
testing witness questions. or suggest a series of inquiries and
questions for litigation. It could be used to simulate certain
judges, prosecution and defense lawyers, based on prior cases.

A Watson-like system could generate
questions for a prospective patent claim based on it's ingestion of
the entire patent and trademark database.

National Defense Planning and
Intelligence

The amount of structured data and
especially dynamic unstructured data that can be associated with
military and defense planning is enormous and expanding rapidly.
Military decision support systems augmented by a system that has data
on all previous military campaigns, past and current international
relationships, archives of all international military school
generated data including books, theses, lectures, military doctrines,
etc. Continuous ingestion of real-time data is added expanding
existing relationships. An inquiry that might be forwarded to such a
system might include, “What would Sun Tzu do given the immediate
crisis?”

Other areas include manufacturing,
homeland security, local law enforcement agencies, etc

A clear pattern is emerging here. Tasks
that traditionally involve humans remembering, making intelligent
guesses and informed estimates, even if backed up by filing through
mountains of data, could be greatly enhanced or even replaced by
Watson and its derivatives.

As with most new technologies,
something is gained but something is lost. Most people reading this
used to remember important telephone numbers. Today, with perhaps
hundreds stored in cell phone, the ability to recall telephone
numbers is almost a lost talent. However, nobody seems to be
complaining!

Today, 2/14/2011, the first of three Jeopardy! sessions between the top two Jeopardy! champions and IBM Watson will air on national TV. As each question is asked at lot will be taking place and many of us will be wondering just what is going on inside IBM Watson. Just what is going on?

While IBM Watson's entire execution infrastructure has not be
published, we do know that each compute element consists of a
commercially available IBM POWER 750

The entire interconnected cluster looks like a set of library shelves.

Many of us will wonder each time a question is asked what is going on in the three seconds given to the contestants. As humans, we can more or less understand being a Jeopardy! contestant. Many people will invariably not know the answer in three second but will retort after the correct response is made -- "Oh, I knew that!". Watson is not doing that, although it has been reported that IBM Watson has a good idea of the types of questions and answers that have been previously asked on Jeopardy! In contrast, Watson's POWER7 processors are pumping through 15 TB of data (equivalent to about 200 millions pages of text) at a rate 500 GB/s each, concurrently. But first, Watson has to understand the question. It has to determine verbs, nouns, objects and moreover, nuances in the English language not part generally part of the standard English 101 class. Next, Watson must look for the best answer. What might be the basic applications that are used to accomplish this massive test.

It has been reported that Watson runs on Linux, but also DeepQ&A (Watson's SW application stack) uses Hadoop and UIMA applications. UIMA stands for Unstructured Information Management Architecture, and according to wikipedia, "UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on Apache Software Foundation website." This is an application that intelligently digests and correlates information that otherwise appears amorphous.

Upon reviewing my previous blog entry, https://www-950.ibm.com/blogs/davidian/entry/what_runs_watson_and_why16?lang=en_us, IBM Watson has 4 TB of storage, but has 16 TB of systems-wide memory. Such an architecture suggests an in-memory databases or at least in-memory data structures. Indeed, Watson uses Apache's Hadoop framework to facilitate preprocessing the large volume of data in order to create in-memory datasets. To provide effective CPU scheduling, the file system includes location awareness, that is, the physical location of each node, rack & network switch. Hadoop applications can use this information to schedule work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. The Hadoop file system uses this when replicating data, trying to keep different copies of the data on different racks.

"Watson’s DeepQA UIMA annotators were deployed as mappers in the Hadoop map-reduce framework, which distributed them across processors in the cluster. Hadoop contributes to optimal CPU utilization and also provides convenient tools for deploying, managing, and monitoring the data "analysis process." For more information see: http://www.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=POW03061USEN&attachment=POW03061USEN.PDF&appname=STGE_PO_PO_USEN_WH

When watching Jeopardy! tonight try to keep in mind that for every question, IBM Watson has to, as a minimum, within 3 seconds:

Take the stated question and parse its components

Determine relationships between grammatical elements

Create items that it must look for or relationships that may expand its search

Have Hadoop dispatch work to access information that UIMA has intelligently digested and annotated

2880 POWER7 cores processing through TBs of data looking for the best set of results

DeepQA then determining what it considers the best response, and

Press a mechanical button as do the human contestants and express the answer in English.

IBM Watson http://www-03.ibm.com/innovation/us/watson/index.shtml, the computer that willcompete against the top two Jeopardy! champions on February 14-16, 2011, is constructed using acommercially available computing platform from IBM. The IBM Watson is a massively parallel system based on the IBM POWER7 750 in a standard rack mounted configuration.

IBM Watson is comprised of ninety IBM POWER 750 servers, 16 Terabytes of memory, and 4Terabytes of clustered storage. This is enclosed in ten racks including the servers, networking, shareddisk system, and cluster controllers. These ninety POWER 750 servers have four POWER7 processors,each with eight cores. IBM Watson has a total of 2880 POWER7 cores.

Watson runs IBM DeepQA software, http://www.research.ibm.com/deepqa/deepqa.shtml, which scalesout with and searches vast amounts of unstructured information. Effective execution of this software,corresponding to a less than three second response time to a Jeopardy! question, is not just based onraw execution power. Effective system throughput includes having available data to crunch on. Without an efficient memory sub-system, no amount of compute power will yield effective results. Abalanced design is comprised of main memory, several levels of local cache andexecution power.IBM's POWER 750's scalable design is capable of filling execution pipelines with instructions anddata, keeping all the POWER7 processor cores busy. At 3.55 GHz, each of Watson's POWER7 on-chip bandwidth is 500 Gigabytes per second. The total on-chip bandwidth for Watson's 360POWER7 processors is an astounding 180,000 Gigabytes per second! It is no accidentthat an IBMPOWER7 based technology serves as basic hardware building block for IBM Watson.

If there were an industry standard performance benchmark for playing Jeopardy!, such asspecJeopardy!2011, there would be only one published result.

The Register's
Timothy Prickett Morgan in
http://www.theregister.co.uk/2011/01/14/ibm_watson_jeopardy_dry_run/
referred to IBM Watson's avatar as the evil Skynet. In that article,
Watson beats humans in Jeopardy! dry run,
Morgan noted that Watson is a Linux cluster of IBM POWER7-based p750
servers. For a pundit who mentioned how the Watson Jeopardy event is
perhaps a veiled marketing ploy, he gratuitously added, “Watson QA
software is running on 10 racks of these machines, which have a total
of 2,880 Power7 cores and 15 TB of main memory spread across this
system. The Watson QA system is not linked to any external data
sources, but has a database of around 200 million pages of "natural
language content," which IBM says is roughly equivalent to the
data stored in 1 million books.”

It was stated in several reports that
Watson has some issues with language ambiguity as a challenge.
Perhaps this makes Watson more human-like than we think as
interpreting ambiguity in speech is generally a learned ability.

Myth has it that Discovery One
Spacecraft's HAL9000 computer name in 2001: A Space Odyssey
is a one-letter-shift from the letters IBM as in IBM9000. I suspect
we should start worrying when the next generation of IBM lip-reading,
human-like technology argues with us as in the classic “open the
pod bay doors”... http://www.youtube.com/watch?v=kkyUMmNl4hk

When Sun Microsystems native SPARC
processors were sucking wind, Sun marketing began talking down single
threaded, high-clocked, large-fast cache-base execution environments
in favor of a mythical transformation of most all applications into
thread-rich execution environments. Sun made the term Thread Level
Parallelism [TLP] prolific. Now that Oracle purchased Sun we read
that single threaded, high-clock rate execution is being demanded by
Oracle applications. Changing horses twice mid-stream does not
impress data center managers.

“Oracle has been promising a 3X
improvement in "single strand" performance, which everyone
takes to
mean clock speed.”

and

“...Oracle might be overclocking the
Sparc chips to reach the 5 GHz stratosphere of chip clock speeds.
While this might not be the case, the question we need to be asking
Oracle - and remember, Oracle
doesn't answer questions - is: if not,
why not?
”

Throughout
the 2000s, Sun's customers were expecting explanations for its
traditional UltraSPARC processors lacking in performance. In reality
Sun, via Texas Instruments [TI], was not able to successfully
fabricate, traditional high-clocked, large cache, state-of-the-art
processors. Traditional processors, such as IBM POWER or Intel x86,
were designed to maximize Instruction Level Parallelism [ILP] with
fast single thread execution.

In
mid-2002 Sun purchased Afara, the firm that designed processors with
slow-clocks and simple cores able to maximize the executions of many
threads. TI was able fabricate these processors with simple cores,
small caches, and placed identical copies on a single die. This
created Sun's Niagara processor line, known today as the UltraT1, 2,
3, etc. Sun began its CMT marketing campaign claiming that processor
clocks have reached an asymptote and memory performance was scaling
at 1/3 that of processor clocks, condemning traditional execution to
the dust bin of history. Sun's CMT technology was purported to save
the data center and do so at a low heat dissipation per thread
regime. Sun's argument was that ILP has reached the end of the line,
processor clocking had reached the point of creating unimaginable
power densities, and memory technology was never going to catch up.

Sun's
CMT contrarian market hype was taking place as IBM POWER4, the first
commercial general purpose multi-core processor was setting
performance records and Intel's Xeons were approaching 4 GHz.
IBM's POWER6 hit 5GHz several years ago, and today's IBM System z
(mainframe) processors run at 5.2 GHz. What Sun proclaimed as a
semiconductor technology wall was torn down with cleaver designs by
IBM, Intel and AMD. Sun sacrificed single thread performance as the
cost of keeping a processor line alive. Sun paid the price as it
lost market share. IBM and Intel today have multiple core
processors running multiple simultaneous threads, never having to
sacrifice single thread performance in the interim.

As
we enter this decade it appears that Sun+Oracle plans on cranking up
the clocks on their CMT processors while keeping the core count
constant. In addition, Sun+Oracle appears to be adopting the
capability to dynamically alter the number of threads per core
allowing more of the CPU core to execute the thread (contrary to its
CMT market hype) and enabling more cache per thread! Sound familiar?
It should, considering IBM introduced it earlier last year calling it
Intelligent Threading (see:
http://www.theregister.co.uk/2010/02/08/ibm_power7_chip_launch/page2.html
)

“Rather than butting heads with the
laws of physics in an attempt to quickly burn though a single
instruction stream (stumbling and stalling along the way), CMT
processors do more by allowing multiple threads to execute in
parallel.”

It wasn't that Sun's processors didn't
meet performance expectations due to the laws of physics. Rather,
they failed in meeting the challenge in designing and fabricating
processors given the limits of solid state physics. It appears that
Sun+Oracle are playing catch up again against IBM and Intel –
neither of which waited around for the “laws of physics” to ease
up :-).

For
almost the entire decade following Y2K, Sun Microsystems claimed the
TPC-C benchmark was irrelevant, not representative of the modern data
center and moreover, it cannot be used for sizing. Subsequently, Sun
didn't publish any TPC-C results. This benchmark alienation came just
after Sun claimed its final world record E10K TPC-C results with
UltraSPARC-II processors and just before Sun introduced the
UltraSPARC-III, circa 2001. These actions were not accidents nor is
the recent Oracle+Sun's claim of a TPC-C result of 30,249,688 tpmC
(see: http://blogs.sun.com/BestPerf/entry/20101202_sparc_t3_4_tpc).

The
UltraSPARC-III had a blocking L1 cache, designed to optimize SPEC
CPU95 benchmark execution. The UltraSPARC-III was late enough that
the SPEC CPU95 was retired and replaced by SPEC CPU2000. SPEC CPU2000
had a larger footprint and a different execution pattern than its
predecessor. Throughout the last decade, Sun's UltraSPARC processors
were plagued by poor single processor industry-standard benchmark
results. For Sun publishing any TPC-C results would be very
embarrassing. (I know, as a member of Sun's benchmark council). When
industry standard benchmark results were good Sun would publish them.
When results turned out poor – the benchmark was attacked. When
results became good ”again”, they are published, as was by
Oracle+Sun on December 2, 2010.

While
the TPC-C benchmark could be characterized by light-weight thread
processing representing, “... the principal activities
(transactions) of an order-entry environment. These transactions
include entering and delivering orders, recording payments, checking
the status of orders, and monitoring the level of stock at the
warehouses” (see: http://www.tpc.org/tpcc/default.asp),
this benchmark does provide a relative measure of the ability of a
system to move data with processing capability secondary (The handful
of SQL statements are rather trivial). Rapid data movement with
low-quality processing is a forte of Sun's T1, T2, T3, and T4
processors. Interestingly, it was only after Oracle purchased Sun
that TPC-C benchmarks on Sun SPARC were published again. It was known
as far back as 2005 that the UltraT1 generated relatively good TPC-C
results, but because the TPC-C benchmark was deemed worthless, Sun
could not publish them less be called on the carpet for blatant
duplicity. Oracle must think today's customers have no medium-term
memory, a poor assumption for a database software company.

TPC-C
results come in two flavors, single or clustered. A single result
represents the capability of a single server with its storage. A
clustered result approximates a cumulative sum of all the machines in
the cluster. The larger the cluster, the better the result. Of course
clustering like this has its mechanical and networking asymptotes,
but generally you can pick a desired tpmC and then cluster servers
and storage until that result is achieved. Sun made this argument a
decade ago as a reason to avoid the TPC-C clustered results. In fact,
Sun used to claim that IBM and others had to cluster their servers to
get even publishable results.

Clustered
TPC-C results can be used for certain comparisons. For example: The
latest Sun+Oracle TPC-C result was achieved using a cluster of
twenty-seven servers with 1726 SPARC processor cores. They then
compared the results with the best IBM result which is a cluster of
three, p780 servers with 192 POWER7 cores. Sun+Oracle has a 3X better
result than IBM with 9X cores and 9X servers. The quotient is left as
an exercise for the reader!

Offhand, one could ask to whom has
Oracle been selling its database software? One might wonder what
credit card, supply chain, etc., OLTP database systems have been
doing the past 20 or 30 years!

Given Ellison's theatrics and
hyperbole, it is worth a peruse of industry standard On Line
Transaction Processing (OLTP) as well as Data Warehousing benchmark
results to determine at least the relative “extreme performance”
of Oracle's Exadata product

The accepted independent industry
standard benchmark for OLTP database systems is the Transaction
Processing Performance Council's TPC-C benchmark. TPC-C is one of two
TPC's OLTP benchmarks. “TPC-C simulates a complete computing
environment where a population of users executes transactions against
a database. The benchmark is centered around the principal activities
(transactions) of an order-entry environment. These transactions
include entering and delivering orders, recording payments, checking
the status of orders, and monitoring the level of stock at the
warehouses. While the benchmark portrays the activity of a wholesale
supplier, TPC-C is not limited to the activity of any particular
business segment, but, rather represents any industry that must
manage, sell, or distribute a product or service.” (see:
http://tpc.org/tpcc/default.asp).

Considering Oracle claims Exadata is characterized by “extreme
performance”, one would expect to see Exadata results among the Top
Ten Results by Performance. However, Oracle as no published TPC-C
Exadata results. Looking at All Results, under Oracle, again,
there are no Exadata results. The only Oracle or Sun-related TPC-C
result in almost a decade is for a Sun SPARC Enterprise T5440 Server
Cluster – no Exadata TPC-C results. In contrast, however, IBM has
over 50 published benchmark results.

Sun Microsystems, sold to Oracle last year, rejected the
applicability of TPC-C in representing real OLTP a decade ago. It
championed the development of an alternative OLTP database benchmark,
the TPC-E. “The TPC-E benchmark uses a database to model a
brokerage firm with customers who generate transactions related to
trades, account inquiries, and market research. The brokerage firm in
turn interacts with financial markets to execute orders on behalf of
the customers and updates relevant account information.

The benchmark is “scalable,” meaning that the number of
customers defined for the brokerage firm can be varied to represent
the workloads of different-size businesses. The benchmark defines the
required mix of transactions the benchmark must maintain. The TPC-E
metric is given in transactions per second (tps). It specifically
refers to the number of Trade-Result transactions the server can
sustain over a period of time.” (see:
http://tpc.org/tpce/default.asp).
Oracle has no published TPC-E Exadata results either. IBM has 9
results out of a total of 39 published TPC-E results.

There is no relative measure of Oracle's OLTP claims.

The independent TPC also provides the industry standard Data
Warehousing benchmark, TPC-H. “The TPC Benchmark™H (TPC-H) is a
decision support benchmark. It consists of a suite of business
oriented ad-hoc queries and concurrent data modifications. The
queries and the data populating the database have been chosen to have
broad industry-wide relevance. This benchmark illustrates decision
support systems that examine large volumes of data, execute queries
with a high degree of complexity, and give answers to critical
business questions.” (see: http://tpc.org/tpch/default.asp)

Since the TPC-H benchmark tests Data Warehousing characteristics,
there are multiple database size results, ranging from 100GB to
30,000GB. (see:
http://tpc.org/tpch/results/tpch_results.asp?orderby=hardware).
Oracle has no published Exadata TPC-H results. The latest
Oracle (Sun) result was published over a year ago, and that was for a
single Sun Fire x4600.

While not having published industry standard database benchmarks
for Oracle's Exadata does not preclude this server and storage
combination from having “extreme performance”, it is just that we
have to take Ellison's word for it!

To deign comparability
between the available IBM POWER7 and Oracle's yet-to-be-released
UltraSPARC T3 (Niagara 3) is like juxtaposing a BMW X6 and a school
bus, respectively. Certainly both vehicles transport people, both are made of
metal and burn hydrocarbon fuel – but this is where the comparison
ends. Interestingly, Oracle claims a school bus analogy for their Chip Multi
Threading (CMT) architecture, saying it represents computing
requirements in today's data center. Oracle says it is more efficient
to transport, say, 40 students in a school bus at one time, although
slowly, than to transport 8 groups of 5 students in an X6 running
back and forth at lightening speed. Unfortunately – we don't have
40 students to transport, but perhaps less than 5. A school bus is
an application-specific vehicle, as is Oracle's CMT
application-specific processor architecture.

Oracle's CMT argument also
claims that single, heavy weight thread performance (the BMW X6)
is not as important as the ability to execute multiple,
low-performance threads (the school bus). In contrast, IBM's POWER
and Intel's x86 are designed for general purpose computing
requirements: heavy weight thread processing (ability to execute the
maximum number of instructions/clock) with fast clocks, large
low-latency local caches, branch prediction, and out of order
execution. Today, these general purpose processors also execute many
HW threads simultaneously without having been designed to sacrifice thread execution
quality for thread quantity. One of the few widespread application-specific execution environments demanding the efficient execution
of scores of low-demanding threads is a web server under heavy load.
Another is shuffling around streams of data. UltraSPARC T3-based
systems are good web servers, but are architecturally challenged in heavy processing of that data. Real-life benchmarks speak for
themselves – see my previous blog entry.

Oracle claims that by
doubling the HW thread context count in the UltraSPARC T3 over its
predecessor, the UltraSPARC T2 (Niagara2) overall performance will
double. Any increase in performance could only occur if the
execution environment was thread starved. Alternatively, since few
applications spawn scores of threads, executing that application on a
processor that has double the thread contexts of its predecessor
will not provide any more performance. This is similar to designing a
new school bus that now holds 80 students, but is still only
transporting 5.

Oracle's UltraSPARC T3 and
IBM's POWER7 are both two billion transistor processors and dissipate
about the same amount of heat. As seen in the UltraSPARC T3 die
photograph just below, the processor has sixteen (1.6GHz) cores, each
holding 8 HW thread contexts, providing a total of 128 HW thread
contexts per socket. However, each core only executes one thread at
any given time, if a thread is actually available for that core.
Literature on this topic tends to be blurred – giving the
impression that at any given time all 128 thread contexts are
executing simultaneously. In fact, each core is so simple that even
branch prediction is non-existent, forcing a thread switch on any
cache miss. Cores communicate with a shared 6MB L2 cache via crossbar
switches. The processor has on-board memory, PCIe, Ethernet, and SMP
coherency controllers. With all pumping at full blast, a theoretical
maximum BW of 2.4Tb/sec is achieved but can only be sustained with a
large number of available threads and full bore I/O running.

Oracle UltraSPARC T3

In contrast, the POWER7
(see die below) has eight cores, each with 4 fully simultaneously
executing threads. The POWER7 can execute twice the number of threads
simultaneously as can the UltraSPARC T3. In order to decrease memory
latency and insure the cores are fed with instructions and data, the
POWER7 has a huge, on-board 32MB L3 cache feeding eight dedicated
256KB, 8-cycle latency L2 caches, pumping data into 2-cycle latency
32KB L1 data caches. Combined dual memory and SMP coherence
controllers aggregate 2.9 Tb/sec of BW. The POWER7 has as many
floating point units as threads.

IBM POWER7

IBM's POWER7 is the latest
product in a successful road map of general purpose processors
designed with the horsepower to pound through heavy-weight,
compute-intensive tasks at nearly 4GHz.

It is worth noting that
Oracle's UltraSPARC T3 is curiously missing from the this month's Hot
Chips Conference agenda (see:
http://www.hotchips.org/printableprogram.php),
even though its general availability is set for later this year. At
least two IBM's POWER-related sessions are scheduled at Hot Chips.

On August 17, 2010, IBM continues its roll-out of new POWER7-based systems,
software, and solutions. Register for webcasts: http://www-03.ibm.com/systems/power/advantages/

A clear classification
of processing capabilities is beginning to take shape across CPU vendors, at
least in the enterprise space. Sometimes comments in the blogosphere tend to
overshadow simple benchmark comparisons. We are not discussing some exotic
benchmark run on a hot box, or due to some application-specific characteristic
of a processor, world record benchmarks are claimed in a certain edge case. Look
at what an established benchmark suite SPECint_rate2006, SPECfp_rate2006 (see www.spec.org) as of April 13, 2010 tells us
about a clear class performance distinction forming between enterprise processors
from different vendors:

* Nehalem EX,
at about a 2x performance level capability than Itanium, Sun, and Fujitsu, and

* Itanium, Sun UltraT, and Fujitsu SPARC64.

One might
argue that one should not look at simply one benchmark, or suite to estimate
performance. Note that the SPEC CPU benchmark is a very good initial indicator
of overall systems performance centered from the CPU, its caches,
interconnectslooking outward.

This clear
three house race is very significant for many reasons. Market pressure will
build on the lowest performance class forcing either price cuts or engaging in
risky design and fabrication activity in an attempt to makeup for the
performance shortfall. In the case of Itanium, one would expect to see it sent
to pasture soon, as even Windows has dropped support for most of its products. Itanium’s
architecture tried unsuccessfully to address the issue of not enough execution
capability in RISC architectures. An issue subsequently solved by several generations
of RISC processors. Sun (Oracle) is attempting to modify its current
application specific (highly threaded, low ILP) UltraSPARC-T
architecture that only runs well in lightweight, highly threaded applications
such as web servers or simple OLTP. Sun assumed thread level parallelism would
supplant instruction level parallelism and has had five year to prove it. We
are still waiting. Fujitsu’s current generation of SPARC64 processors were
built on what remained of Amdahl’s s390 clone processor. The rest is history
and solid state physics.

There
appears to be two distinct leaders in the CPU race: IBM POWER and Intel’s x86, both
pulling away from the pack and from each other.

As IBM announces the POWER7 today, Sun-Oracle will be telling their fleeing customer base that IBM finally, albeit after five years, validates Sun’s Chip Multi Threading (CMT) processor architecture.

Sun-Oracle will attempt to equate their five year old, 8-core 4 threads per core CMT processor called the UltraSPARC T1 (Niagara-1) with today’s IBM POWER7. Indeed, the IBM POWER7 has 8 cores and 4 threads per core - but that is where the numerical similarities end and Sun generated FUD (Fear, Uncertainly, and Doubt) begins.

Sun-Oracle will argue that not only does it market a second-generation CMT processor, the UltraSPARC T2 (Niagara-II), with 8 cores and 8 threads per core but that its next incarnation will have 16 cores with similar threading. If one were to set as equivalent the number of cores on a piece of silicon as the test of greatness, one might note, among the multitudes, Intel’s 16-core, multi-threaded network processors, IXP2400/2800 available well before Sun’s CMT, or Octeon’s encryption processor with 16 MIPS64 cores. Alas, Sun would say that both these processors, and many others like them, are application specific.

However, Sun-Oracle’s current CMT processors are also application specific. They only perform well in thread-rich environments. Such environments are typically web servers and lightweight databases where strong single thread performance is not necessary. One need only note the types of benchmarks Sun publishes - and those it does not - for confirmation of Sun’s CMT application specific resonance.

Sun attempted to design and Texas Instruments manufacture a CMT processor that would address both thread-rich and heavy single threaded execution requirements. Internally it was called the ROCK processor. That project ended in failure. Its chief architect left Sun and joined Microsoft last year. The reason Sun attempted to design such a CMT processor is because many of today’s applications still require swift execution of heavy single threads. Sun’s available CMT processors are so poor at executing single threaded code that it doesn’t even publish industry-standard single core benchmarks for their processors. POWER7’s published industry-standard benchmarks speak for themselves.

Both IBM and Intel could have easily designed, manufactured, and marketed processors that were both highly multi-cored and multi-threaded but would have done so by sacrificing the execution quality on a vast array of standard single-threaded data center applications. In contrast, it was not necessary for IBM to sacrifice the execution quality of existing data center applications. IBM evolved its multi-core RISC architecture beginning with its dual core POWER4 in 2001 to today’s 8-core POWER7 with a continual positive impact on data center execution quality and price-performance.

There is no better indication on how divergent multi-core and multi-threaded processor architecture and performance can be than to note that DARPA selected IBM’s POWER7 for its Supercomputing Grand Challenge (see: http://www-03.ibm.com/press/us/en/pressrelease/20671.wss). Sun was dropped from the competition - based on the broken promise of the ROCK processor by 2010. (See: http://m.channelregister.co.uk/2006/11/21/darpa_petascale/). Today’s announced POWER7 is part of DARPA’s Petascale Challenge, not Sun’s non-existent ROCK and certainly not its little brother, Sun’s UltraSPARC T2 processor.