Intel Server Strategy Shift with Sandy Bridge EN & EP

The arrival of the Sandy Bridge EN and EP processors, expected in early 2012, will mark
the completion of a significant shift in Intel server strategy.
For the longest time 1995-2009, the strategy had been to focus on producing a premium processor
designed for 4-way systems that might also be used in 8-way systems and higher.
The objective for 2-way systems was use the desktop processor that later had a separate brand
and different package & socket to leverage the low cost structure in driving volume.
The implication was that components would be constrained by desktop cost requirements.

The Sandy Bridge collection will be comprised of one group for single processor systems
designed for low cost, and one premium processor.
The premium processor will support both the EN and EP product lines,
the EN limited to 2-way, and the EP for both 2-way and 4-way systems,
with more than adequate memory and IO in each category.
The cost structure of both 2-way and 4-way increased from Core 2 to Nehalem, along with a significant boost in CPU, memory and IO capability.
With quad-core available in 1P, the more price sensitive environments should move
down to single processor systems.
This allows 2 & 4-way systems to be built with balanced compute, memory and IO unconstrained by desktop cost requirements.

In other blogs, I had commented that the default system choice for database server
for a long time had been a 4-way system should now be a 2-way since the introduction of Nehalem in mid-2009.
Default choice means in the absence of detailed technical analysis, basically a rough guess.
The Sandy Bridge EP, with 8 cores, 4 memory channels and 80 PCI-E lanes per socket
in a 2-way system provides even stronger support for this strategy.

The glue-less 8-way capability of the Nehalem and Westmere EX line is not continued.
One possibility is that 8-way systems do not need to be glue-less.
The other is that 8-way systems are being abandoned,
but I am inclined to think this is not the case.

The Master Plan

The foundation of the premium processor strategy, even though it may have been forgotten in the mists of time,
not to mention personnel turnover, was that a large cache improves scaling at the 4-way multi-processor level
for the shared bus SMP system architectures of the Intel Pentium to Xeon MP period.
The 4-way server systems typically deployed with important applications that could easily
justify a far higher cost structure than that of desktop components, but required critical capabilities
not necessary in personal computers.
Often systems in this category were fully configured with top line components whether needed or not.

Hence the Intel large cache strategy was an ideal match between premium processors and high budget
systems for important applications.
One aspect that people with an overly technical point of view have difficulty fathoming
is that the non-technical VP's don't want their mission critical applications running on a cheap box.
In fact, more expensive means that is must be better, and the most expensive is the best, right?
From the Intel perspective, a large premium is necessary to amortize the substantial effort necessary to
produce even a derivative processor in volumes small relative to desktop processors.

The low cost 2-way strategy was to explore demand for multi-processor systems in the desktop market.
Servers were expected to be a natural fit for 2-way systems.
Demand for 2-way servers exploded to such an extent
that it was thought for a brief moment there would be no further interest for single processor servers.
Eventually, the situation sorted itself out, in part with the increasing power of processors.
Server unit volume settled to a 30/60/10 split between single, dual and quad processors
(this is old data, I am not sure what the split is today).
The 8-way and higher unit volume is low,
but potentially of importance in having a complete system lineup.

AMD followed a different strategy based on the characteristics of thier platform.
The Hyper-Transport (HT) interconnect and integrated memory controller architecture
did not have a hard requirement for large cache to support 4-way and above.
So AMD elected to pursue a premium product strategy on the number of HT links.
Single processor systems require one HT to connect the IO hub.
Two HT is required in a 2-way system, one HT connecting to IO, and another to the second processor.
Three HT could support 4-way and higher with various connection arrangements.
The pricing structure is based on the number of HT links enabled,
on the theory that the processor has higher value in big systems than in small systems.

What Actually Happened

Even with the low cost structure Intel enabled in 2-way, desktop systems remained
and actually became defined as single processor.
Instead, the 2-way systems at the desk of users became the workstation category.
This might have been because the RISC/UNIX system vendors sold workstations.
The Intel workstations quickly obliterated RISC workstations,
and there have been no RISC workstations for sometime?
Only two RISC architectures are present today, having retreated to the very high-end server space,
where Intel does not venture.

Itanium was supposed to participate in this space, but the surviving RISC vendors optimized at 8-way and higher.
Intel would not let go of the 4-way system volume and Itanium was squeezed by Xeon at 4-way and below,
yet could not match IBM Power in high SMP scaling.
To do so would incur a high price burden on 4-way systems.
One other aspect of Intel server strategy of the time was the narrow minded focus on optimizing for a single platform.

Most of the time, this was the 4-way server.
There was so much emphasis on 4-way that there actually 2 reference platforms, almost to the exclusion of all else.
For a brief period in 1998 or so, there was an incident of group hysteria that 8-way
would become the standard high volume server.
But this phase wore off eventually.
The SPARC was perhaps the weakest of the RISC at the processor level.
Yet the Sun strategy to design for a broad range of platforms from 2-way to 30-way,
(then with luck 64-way via acquisition of one of the Cray spin-offs) was successful
until their processor fell too far behind.

After the initial implementation of the high volume 2-way strategy,
desktop systems became intensely price sensitive.
The 2-way workstations and server system were in fact not price sensitive even though it was thought they were.
It became clear that desktops could not incur any burden to support 2-way capability.
The desktop processor for 2-way systems was put into a different package and socket,
and was given the Xeon brand.

Other cost reduction techniques were implemented over the next several generations
as practical on timing and having the right level of maturity.
The main avenue is integration of components to reduce part count.
This freed 2-way system from desktop cost constraints, but as with desktops,
it would take several generations to evolve into a properly balanced architecture.

The 4-way capable processors remained on a premium derivative,
given the Xeon MP brand in the early Pentium 4 architecture (or NetBurst) period.
To provide job security for marketing people, 2-way processors were then became the Xeon 5000 series,
and 4-way the Xeon 7000 series in the late NetBurst to 2010 period.
In 2011, the new branding scheme is E3 for 1P servers, E5 for 2-way and E7 for 4-way and higher.
Presumably each branding adjustment requires changes to thousands of slidedecks.

At first, Intel thought both 2-way and 4-way systems had high demand versus cost elasticity.
If cost could be reduced, there would be substantially higher volume.
Chipsets (MCH and IOH) had overly aggressive cost objectives that limited in memory and IO capability.
In fact, 4-way systems had probably already fallen below the boundary of demand elasticity.

The same may have been true for 2-way systems, as people began to realize that single processor
systems were just fine for entry server requirements.
For Pentium II and III 2-way systems, Intel only had a desktop chipset.
In 2005-6, Intel was finally able to produce a viable chipset for 2-way systems (E7500? or 5000P) that provided
memory and IO capability beyond desktop systems.
Previously, the major vendors elected for chipsets from ServerWorks.

It was also thought at the time that there was not a requirement for premium processors in 2-way server systems.
The more correct interpretation was that the large (and initially faster) cache of premium processors
did not contribute sufficient value for 2-way systems.
A large cache does improve performance in 2-way systems, but not to the degree that it does at the 4-way level.
So the better strategy by far on performance above the baseline 2-way system with standard desktop
processors was to step up to a 4-way system with the low-end premium processors instead of a 2-way system with
the bigger cache premium processors.

And as events turned out, the 4-way premium processors lagged desktop processors in transitions to
new microarchitectures and manufacturing processes by 1 full year or more.
The 2-way server on the newer technology of the latest desktop processors
was better than a large cache processor of the previous generation,
especially one that carried a large price premium.
So the repackaged desktop processor was the better option for 2-way systems

The advent of multi-core enabled premium processors to be a viable concept for 2-way systems.
A dual-core processor has much more compute capability than a single core and the same for a quad-core over dual-core
in any system, not just 4-way,
provided that there not too much difference in frequency.
The power versus frequency characteristics of microprocessors clearly favors multiple cores for code
that scale with threads, as in any properly architected server application.

However, multi-core at the dual and quad-core level was employed for desktop processors.
So the processors for 2-way servers did not have a significant premium in capability relative to desktops.
The Intel server strategy remained big cache processors. There was the exception of Tigerton,
when two standard desktop dual-core processor die in the Xeon MP socket was employed for the 4-way system,
until a large cache variant was readied in the next generation Dunnington processor incorporated a large cache.
This also happened for the Paxville and Tulsa.

System Architecture Evolution from Core 2 to Sandy Bridge

The figure below shows 4-way and 2-way server architecture evolution relative to single processor desktops (and servers too)
from 45nm Core 2 to Nehalem & Westmere and then to Sandy Bridge. Nehalem systems are not shown for space considerations,
but are discussed below.

System architecture from Penryn to Westmere to Sandy Bridge, (Nehalem not shown)

The Core 2 architecture was the last Intel processor to use the shared bus, which allows
multiple devices, processors and bridge chips, to share a bus with a protocol to arbitrate
for control of the bus. It was called the front-side bus (FSB) because there was once a back-side bus for cache.
When cache was brought on-die more than 10 years ago, the BSB was no more.
By the Core 2 period, to support higher bus frequency, the number of devices was reduced to 2,
but the shared bus protocol was not changed.
The FSB was only pushed to 1066MHz for Xeon MP, 1333MHz for 2-way servers, and 1600MHz for 2-way workstations.

Nehalem was the first Intel processor with a true point-to-protocol, Quick Path Interconnect (QPI),
at 6.4GHz transfer rate, achieving much higher bandwidth to pin efficiency than possible over shared bus.
Intel had previously employed a point-to-point protocol for connecting nodes of an Itanium system back in 2002.
(AMD implemented point-to-point with HT for Opteron in 2003? at an initial signaling rate of 1.6GHz?)
Shared bus also has bus arbitration overhead in addition to lower frequency of operation.
The other limitation of Intel processors up to Core 2, was the concentration of signals on the
memory controller hub (also known as North Bridge) for processors, memory and PCI-E.
The 7300 MCH for the 4-way Core 2 has 2013-pins, which is at the practical limit,
and yet the memory and IO bandwidth is somewhat inadequate.

Nehalem and Westmere implement a massive increase in memory and PCI-E bandwidth (number of channels or ports) for
the 2-way and 4-way systems compared to their Core 2 counterparts.
Both Nehalem 2-way and 4-way systems have significantly higher cost structure than Core 2.
Previously, Intel had been mindlessly obsessed with reducing system to the detriment of balanced memory and IO.
This shows Intel recognized that their multi-processor systems were already below the price-demand elasticity point,
and it was time to rebalance memory and IO bandwidth, now possible with point to point interconnect
and the integrated memory controller.

QPI in Nehalem required an extra chip to bridge the processor to PCI-E.
This was not an issue for multi-processor systems,
but was undesirable for the hyper sensitive cost structure of desktop systems.
The lead quad-core 45nm Nehalem processor with 3 memory channels and 2 QPI ports in a LGA 1366 socket
was followed by a quad-core, 2-memory channel derivative (Lynnfield) with 16 PCI-E plus DMI replacing QPI in a LGA 1156 socket.
The previously planned dual-core Nehalem on 45nm was cancelled.
Nehalem with QPI was employed in the desktop extreme line,
while the quad-core without QPI was employed in the high-end of the regular desktop line.

The lead 32nm Westmere was a dual-core with the same LGA 1156 socket (memory and IO) as Lynnfield.
Per the desktop and mobile objective, cost structure was reduced with integration,
with 1 processor die and potentially a graphics die in the same package,
and just 1 other component the PCH.

The follow-on Westmere derivative was a six-core using the same LGA 1366 socket as Nehalem,
i.e., 3 memory channels and 2 QPI.
This began the separation process of desktop and other single processor systems from
multi-processor server and workstation systems.
Extreme desktops employ the higher tier components designed for 2-way, but are still single-socket systems.
I suppose that a 2-way extreme system is a workstation.
Gamers will have settle for the mundane look of a typical workstation chassis.

With the full set of Sandy Bridge derivatives, the server strategy transition will be complete.
Multi-processor products, even for 2-way, are completely separated from desktops without the requirement
to meet desktop cost structure constraints.
With desktops interested only in dual and quad-core,
a premium product strategy can be built for 2-way and above around both the number of cores and QPI links.

The Sandy Bridge premium processor has 8 cores, 4 memory channels, 2 QPI, 40 PCI-E lanes and DMI
(that can function as x4 PCI-E).
The high-end EP line in a LGA 2011 socket will have full memory, QPI and PCI-E capability.
The EN line in LGA 1356 socket will have 3 memory channels, 1 QPI and 24 PCI-E lanes plus DMI
to supports up to 2-way systems, and will be suitable for lower priced systems.
Extreme desktops will use the LGA 2011 socket, but without QPI.

What is interesting is that the 4-way capable Sandy Bridge EP line is targeted at both 2-way and 4-way systems.
This is a departure from the old Intel strategy of premium processors for 4-way and up.
Since the basis of the old strategy is no longer valid, of course a new strategy should be formulated.
But too often, people only remember the rules of the strategy, not the basis.
And hence blindly follow the old strategy even when it is no longer valid (does this sound familiar?)

This element of a premium 2-way system actually started with the Xeon 6500 line based on Nehalem-EX.
Nehalem-EX was designed for 4-way and higher with eight-cores,
4 memory channels supporting 16 DIMMs per processor and 4 QPI links.
A 2-way Nehalem-EX with 8 cores, 16 DIMMs per socket might be viable versus Nehalem at 4 cores, 9 DIMMs per socket,
even though the EX top frequency 2.26GHz versus 2.93GHz and higher in Nehalem.
The more consequential hindrance was that Nehalem-EX did not enter production until Westmere-EP was also in production,
with 6 cores per socket at 3.33GHz.
So the Sandy-Bridge EP line will provide a better indicator for premium 2-way systems.

The Future of 8-way and the EX line

There is no EX line with Sandy Bridge.
Given the relatively low volume of 8-way systems, it is better not to burden the processor used by 4-way systems
with glue-less 8-way capability.
Glue-less means that the processors can be directly connected without the need for additional bridge chips.
This both lowers cost and standardizes multi-processor system architecture,
which is probably one of the cornerstones for the success Intel achieved in MP systems.
I am expecting that 8-way systems are not being abandoned,
but rather a system architecture with "glue" will be employed.

Since 8-way systems are a specialized very high-end category,
this would suggest a glued system architecture is more practical in terms of effort than a subsequent 22nm Ivy Bridge EX.
Below are two of my suggestions for 8-way Sandy Bridge or perhaps Ivy Bridges depending on when components could be available.
The first has two 4-port QPI switch, or cross-bar or routers connecting four nodes with 2 processors per node.

The second system below has two 8-port QPI switches connecting single processor nodes.

The 2 processor node architecture would be economical, but I am inclined to recommend building the 8-port QPI switch.
Should the 2 processor node prove to be workable,
then a 16-way system would be possible.
Both are purely speculative as Intel does not solicit my advice on server system architecture and strategy,
not even back in 1997-99.

In looking at the HP DL980 diagram, I am thinking that the HP node controllers
would support Sandy Bridge EP in an 8-way system.

There are cache coherency implications (Directory based versus Snoop) that are beyond the scope for database server oriented topic.
There was an IBM or Sun discussion transactional memory.
I would really like to see some innovation on handling locks.
This is critical to database performance and scaling.
For example, the database engine ensures exclusive access to a row, i.e., memory, before allowing access.
Then why does the system architecture need to do a complex cache coherency check when the application has already done so?
I had also previously discussed SIMD instructions to improve handling of page and row base storage,
SIMD Extensions for the Database Storage Engine
(same here).

If that were not enough, I had also called for splitting the memory system.
Over the period of Intel multi-processor systems 1995 to 2011,
practical system memory has increased from 2GB to 2TB.
Most of the new memory capacity is used for data buffers.
The exceptionally large capacity of the memory system also means that it cannot be
brought very close to the processor, as into to the same package/socket.

So the memory architecture should be split into a small segment
that needs super low latency byte addressability.
The huge data buffer portion could be changed to block access.
If so, then perhaps the database page organization should also be changed to make the metadata
access more efficient in terms of modern processor architecture to reduce
the impact of off-die memory access by making
full use of cache line organization.
The NAND people are also arguing for Storage Class Memory, something along the lines
of NAND used as memory.

Comment Notification

Comments

To the extent that you seem to be suggesting that Intel might replace the E7 line (Westmere-EX) with high-end implememtations of Sandy-Bridge EP (E5), I think you're mistaken and missed the main point of E7. In a world where computational power is no longer the bottleneck for the overwhelming majority of enterprise apps (and has not been for years), 'In-memory database' technologies and techniques represent the future.

Intel has replaced the E7 with the E5. The E5 will take the majority of 4-way system, which is the large majority the E7 volume. I am further inclined to think 2-way E7 volume (both system and processor) is higher than 8-way. So what else defines E7 once common 4-way moves to E5.

I really don't think people understand the E7 MCA + RAS to be relavent.

Now that E5 has 4 MC, the difference in memory capacity is not important.

My opinion is that 8-way could be a glued system, ie, a QPI cross-bar.

If MCA+high RAS are important, it could be a once every other year or even every third year release. So E7 on Westmere skips Sandy Bridge, and gets a refresh with Ivy Bridge or even Haswell.

Sure, database engine really needs a complete architecture overhaul, which would reach into both processor and system level. But I do not agree that a 4-way system needs more than 32-64 DIMM sockets. If MS or some one else can do IMDB, they can also do clustered IMDB, which means memory capacity is distributed across many nodes. The general idea is $20K of CPU is paired with about $20K of memory, not 20/200 split.