Intel launched the Xeon 5600 series (Westmere-EP, 32nm) six-core processors on 16 March 2010 without any TPC benchmark results. In the performance world, no results almost always mean bad or not good results. Yet there is every reason to believe that the Xeon 5600 series with six-cores (X models only) will performance exactly as expected for a 50% increase in the number of cores at the same frequency (as the 5500) with no system level bottlenecks. The expectation is that a six-core Xeon 5600 should provide 30%+ improvement over the comparable quad-core Xeon 5500 in throughput oriented tests, particularly OLTP type workloads. Single stream parallel execution plans will probably show less gain, as scaling via parallelism is not a simple matter.

Then two weeks later on 30 March 2010, Intel launched the Xeon 7500 series 8-core processors for 4-way+ systems (and the Xeon 6500 for high-end 2-way systems) with TPC-E results on 4-way and 8-way systems but no TPC-H results. The TPC-E results were exactly what Intel said it was going to be last September at IDF, 2.5X over the previous generation Xeon 7400 series and 2.5X over the contemporary 2-way Xeon 5500 series.

My guess is that Intel wanted it to be clear that the 4-way Xeon 7500 achieved the stated performance objectives of 2.5X over the 2-way Xeon 5500, just in case some slide-decks did not mention which 2-way system the 2.5X claim referred to. Of course, the Intel statement of 2.5X for Xeon 7500 was most probably made with performance measurements already run on proto-type systems. It was probably also felt that the Xeon 5600 series is such a natural choice to supersede the 5500 series that TPC benchmarks were not essential, as there were sufficient other benchmarks to support the claims.

Benchmark Omissions

Earlier, I had commented about benchmark omissions from the quad-core generation on. Below is a summary of processors and systems for which TPC results are published. The Intel Xeon 7500 Processor Product Brief shows 3.03X relative to 7400 for OLTP Brokerage Database, which is TPC-E, but 2022 over 729 is 2.77X. (One of us is on medication.)

In brief, the Intel Core 2 architecture processors were avoiding comparisons against AMD Opteron in TPC-H, except for the 16-way Unisys system, for which there is no comparable Opteron system.

Opteron on the other hand, avoided comparison with Core2 architecture in 2-way systems and TPC-C/E OLTP benchmarks across the board. In the 2-way systems, the Intel old-FSB technology was still adequate, and the powerful Core2 architecture core was enough to beat a 2-way Opteron. There were respectable 4-way TPC-C and TPC-E results for Shanghai. When AMD announced the HT-Assist feature in Istanbul, one might have thought AMD was finally going to be able compete in 4-way OLTP. But there have been zero benchmarks published as of current.

When the 2-way Intel Xeon 5500 processor, based on the Nehalem architecture, came out in early 2009, outstanding results were published for both the OLTP oriented TPC-E and DW/DSS oriented TPC-H. In February 2010, a TPC-C was published as well, even though Microsoft had previously said all new OLTP benchmarks were going to be TPC-E. This result was with SQL Server 2005 for some reason.

There was every expectation with the Xeon 7500 Nehalem-EX, that there would be both OLTP and DW/DSS benchmark results, as Xeon 7500 should produce world-class (and world-record) results in both. It is possible that performance problems were encountered in trying to achieve good scaling over 32-cores and 64-threads in a 4-way Xeon 7500 system. If this is identified as something that can be fixed in the Windows operating system or SQL Server engine, then a change request would be made. I seriously doubt that another processor stepping would be done for this, as Xeon 7500 is already D-step at release.

TPC-H Scaling

It is also quite possible Intel will have to face the fact that 2.5X over the 2-way Xeon 5500 TPC-H SF100 result of 51,000 QphH is not going to be achieved no matter how good Xeon 7500 is at DW. This is because the TPC-H scores is a geometric mean of the 22 queries. There are several small queries in TPC-H, two of which already run in under 1 seconds on the 2-way 8-core Xeon 5570 for SF100, and several that run near or under 2 seconds. There is limited opportunity to continue to improve the performance of small queries with increasing degree of parallelism, as the overhead to setup each thread becomes larger compared to the actual work done be each thread, especially if one also has the give up frequency, dropping from 2.93 to 2.26GHz. It would be helpful to know what the actual frequency is during a performance run with the turbo-boost feature.

It is possible that some marketing putz does not understand this and denied permission to publish perfectly good Xeon 7500 TPC-H results because it did not meet the 2.5X goal. (Along with making a negative ranking and review entry for the person responsible for TPC-H benchmarking due to failing to achieve the 2.5X goal. But lets not grind axes on here. Besides, who said life was fair? It takes exceptional talent to accomplish the impossible. A clever person anticipates impossible problems, and transfers to another group to avoid a sticky wicket).

Achieving 2.5X in the big queries is a more meaningful goal. Achieving 50% better than the 8-way Opteron 6-core TPC-H SF300 or SF1TB would also be a worthwhile accomplishment, if Xeon 7500 were upto the task.

TPC-E Scaling

Finally, a quick comment on Xeon 7500 scaling from 4-way (32-cores, 64-threads) to 8=way (64-cores, 128-threads). In the past, achieving 1.5X scaling with this number of cores would have been a triumph. Given the announcement Microsoft made on Windows Server 2008 R2, on removing the thread scheduler and other impediments to high-end scaling, we were expecting 1.7X scaling. It could be that scaling beyond 64-threads in tricky, because of the 64-thread limit per group(insert correct terminology). Hopefully the 4-way to 8-way to 16-way scaling will improve over time as problems are solved one at a time, while the task master whips his/her draft horses (again, I digress).

Intel Xeon 5600 (Westmere-EP) and 7500 (Nehalem-EX) SKUs

Lets take a look at the Xeon 5600, 7500 and 6500 SKUs. The low-voltage, low power SKUs are omitted. These are fine products for high-density environments, web servers, and utility database. The Line-of-business and DW databases should be on the X models.

Xeon 5600 SKUs

Model

Cores

Threads

GHz

L3

QPI GT/s

Memory

Price*

X5680

6

12

3.33

12M

6.4

1333

$1,663

X5670

6

12

2.93

12M

6.4

1333

$1,440

X5660

6

12

2.80

12M

6.4

1333

$1,219

X5650

6

12

2.66

12M

6.4

1333

$996

E5640

4

8

2.66

12M

5.86

1066

$774

E5630

4

8

2.53

12M

5.86

1066

$551

E5620

4

8

2.40

12M

5.86

1066

$387

X5677

4

8

3.46

12M

6.4

1333

$1,693

X5667

4

8

3.06

12M

6.4

1333

$1,440

* Intel 1k pricing

Xeon 7500 SKUs

Model

Cores

Threads

GHz

L3

QPI GT/s

Memory

Price*

X7560

8

16

2.26

24M

6.4

1066?

$3,692

X7550

8

16

2.00

18M

6.4

?

$2,729

E7540

6

12

2.00

18M

6.4

?

$1,980

E7530

6

12

1.86

18M

5.86

?

$1,391

E7520

4

8

1.86

18M

4.8

?

$856

Xeon 6500 SKUs

Model

Cores

Threads

GHz

L3

QPI GT/s

Memory

Price*

X6550

8

16

2.00

18M

6.4

?

$2,461

E6540

6

12

2.00

18M

6.4

?

$1,712

E6510

4

8

1.73

12M

4.8

?

$744

Before commenting, recall the main differences between the Xeon 5600 and Xeon 7500/6500 series. The Xeon 5600 series (32nm process) has 2 QPI links and 3 memory channels. The Xeon 7500 series (45nm process) has 4 QPI links, 4 memory channel, larger cache per core (for the 24M version, 3M vs 2M) plus extensive reliability features. The 2 QPI links on the 5600 series allows a 2-way (socket) system. The 4 QPI links on the 7500 series allows glueless 4-way and 8-way. My understanding is the 6500 series is the 7500 with only 2 QPI links enable for 2-way systems with 16-cores and 8 memory channels total, at lower frequency than the 5600 with 12-cores and 6 memory channels total, plus the 7500 RAS features.

Intel Xeon 5600 (Westmere-EP) and 7500 (Nehalem-EX) Systems

Now lets looks at system pricing for the 2-way Dell PowerEdge T710 (Xeon 5600), R810 (either 7500 or 6500) and the 4-way R910 (7500). All systems with redundant power supplies, 2x73GB 15K 2.5in drives, 6Gb/s SAS. 4 power supplies in the 4-way

Dell PowerEdge T710 Systems with 2 Xeon 5600 processors

System

Processor

GHz

Cores

L3

QPI

-

Memory

Price

T710

X5680

3.33

6

12M

6.4

1333

72GB 18x4G

$9,974

T710

X5660

2.80

6

12M

6.4

1333

72GB 18x4G

$8,634

T710

X5650

2.66

6

12M

6.4

1333

72GB 18x4G

$8,154

T710

E5640

2.66

4

12M

5.86

1066

72GB 18x4G

$7,474

T710

E5630

2.53

4

12M

5.86

1066

72GB 18x4G

$6,934

For some reason, Dell does not offer the T710 with the second from top X5670 2.93GHz.

Previously, I had argued that processors and systems today were so powerful that the standard practice of buying 4-way systems for critical database server by default be changed to 2-way. What I mean by default is in lieu of proper system sizing analysis.

It may seem strange that I suggest not doing a proper sizing analysis (one of my services as a consultant). But from the sizing analysis I have seen done by other people, the quality of the work was poor and the effort cost more than a pair 4-way systems.

What this means is that the practical solution used to be to buy a 4-way system. Try it out. If it not sufficient, then hire someone (there are many people who can do this) to make it work on a 4-way. If that does not work, consider pruning features until it does work.

So why not just move up to an 8-way or larger system? Because 8-way and larger are mostly NUMA systems. Technically, all Opteron 2-way and up are NUMA. But by NUMA, I really mean systems where there is a large discrepancy between local and remote node memory access. There are very very few people who can do performance analysis on a NUMA system (not those who claim to be able to). Do a search on SQL NUMA to see who has published meaningful material on this matter.

Default System Choice: Intel Xeon 5600

Anyways, the default choice today should be a 2-way system. However, since this is critical system, perhaps there are features from the high-end that we want. I believe this is the rational for the Xeon 6500 from Intel, and the PowerEdge R810 from Dell.

In looking over the T710, R810 and R910, I am inclined to say the effort was not entirely successful, as with many first iterations. The effort definitely deserves merit, and is the proper direction for the future. But it just needs further refinement. Of course, the true measure whether people actually buy the R810 in volume, not just one persons opinion.

The R810 with either X7560 or X6550 just gives up too much frequency for the extra 2 cores per socket, and fourth memory channel. Some environments might want the X7500/6500 RAS features despite this. And there is only a $1400 price difference between the R810 and R910 with 2 sockets populated.

The amount of $1,400 is very small for having two extra sockets available, even though most people never populate sockets after system purchase. It would be nice if you could buy the R910 with 4-sockets populated, but not have to pay the per-socket software licensing until they are turned-on, like in RISC world. (In RISC world, you don't 'pay' for the $25K+ processors until they are activated either. I do not think this is necessary for the Intel Xeon 7500 at $5K each.)

True, the R810 is a 2U form factor compared with 4U for the R910, allowing much higher density. But the assumption was this is a critical database server, for which an extra 2U is not a show stopper. (There are people who get hung up on the latest industry jargon/fads, and forget the job one is making sure your business in running.)

Late Addition - AMD Magny-Cours

AMD Opteron 6176 (Magny-Cours) 2-way 12-core results have been just published, with the HP ProLiant DL385G7. I will add more detail later. The 2-way TPC-E result is 887.38 and the TPC-C result is 705,652. Interestingly, both the HP ProLiant DL370G6 with the Xeon W5580 and the DL385G7 Opteron TPC-C results are on SQL Server 2005. Perhaps the Microsoft mandate to use TPC-E is for SQL Server 2008, hence the C on 2005 was allowed? Also of interest is that the Opteron 6176 TPC-C result uses 125 SSDs instead of hard disks (1300 HDs in the Xeon W5580 result).

Before comparing the Opteron 12-core with Xeon 5500, let us first compare against the previous generation Xeon 5400 quad-core. The 2-way 12-core Opteron 6176 achieved OLTP results higher than the Xeon 5460 by 2.5X on TPC-C and 2.8X on TPC-E. These are very good results for a 3X increase in the number of cores. Now in comparing against the quad-core Xeon 5500 series, the 12-core Opteron is just marginally higher. I am inclined to think much of this is due to the Hyper-Threading capability in the Xeon 5500 series. HT was much maligned in the NetBurst architecture generation. Some people today still blindly regurgitate the advice to disable HT, not realizing this advice was applicable to the old NetBurst and not the new Nehalem architecture processors. At some point AMD may have to admit that implementing HT will be a necessity.

The price for the DL385G7 with 2x6176 processors from the TPC-H report is $1,511 for the system chassis, $1,799 for each processor, $990 for each 8GB kit, and perhaps another $1K for comparable configuration as above. This is very reasonable, except for the memory which seems high. Each 8GB kit should be around $500.

Magny-Cours is comprised of two six-core Istanbul die(?) each with 4x0.5 L2 cache and 6M L3. The Istanbul die size is 346mm2, versus 540 684mm2 for Nehalem-EX with 8-cores and 24M L3. The images below were adjusted to match the die size closely, but there is no assurance that the aspect ratios are correct.

Note: For some reason, I thought I saw Nehalem-EX die size listed at 540mm2. The Intel press release actual says 684mm, so the scaling below is more appropriate:

Late Addition - Dell PowerEdge R910 TPC-E result

Dell published a 4-way Xeon 7560 TPC-E result on their PowerEdge R910. In comparison with the result for the IBM x3850 X5:

Both systems are on Windows Server 2008 R2 EE. The IBM system is also on SQL Server 2008 R2 EE, while the Dell is on SQL Server 2008 EE. R2 is about $9.5K more the RTM per processor socket for a total contribution difference of $38K. The IBM system uses 64x16GB DIMMs at $2K each compared with 64x8GB on the Dell at $500 each for a total difference of $96K.

Both systems have over 1000 disk drives. IBM prices that 300GB 15K 3.5in drives ($709ea), while Dell employs a mix of 73GB ($329ea) and 146GB ($479ea) 15K 2.5in drives. Also, the IBM storage enclosure are $268K vs $120K for Dell. Technically, the 300GB drives are not necessary to meet TPC-E requirements, so this difference should not be be considered in comparing the results.

It is unclear whether SQL Server 2008 R2 has any advantages over 2008 SP1, and whether the cost increase is justified. The performance difference of 5% between the IBM and Dell systems could be explained by the 2X difference in memory. The 16GB DIMMs are $125 per GB versus the 8GB at $62.50 per GB.

Update 2010-05-12

HP has published TPC-C and TPC-E results for the 2-way AMD Opteron 6176 12-core 2.3GHz and for the Intel Xeon X5680 6-core 3.3GHz. The Xeon 5680 score 13.8% higher in TPC-C and 25.1% higher in TPC-E. The individual physical core in Westmere are faster than the Opteron core based on SPEC CPU 2006 Integer base (adjusted to exclude parallel components). There is no meaning to compare frequency between completely different processor architectures.

Comment Notification

Comments

Nice post, Joe. For most OLTP workloads, it is pretty hard to argue against a two-socket system with the Xeon X56xx and 72GB of RAM vs. a four-socket system (with two sockets populated) with the Xeon X75xx and 64GB of RAM.

If you are able to exceed the CPU and RAM capacity of the X56xx system without running out of I/O capacity first, it would be better to get a second X56xx system than to populate the empty sockets in the X75xx system (especially since the system vendor will usually gouge you on the additional CPUs bought as parts).

Of course this assumes that you are willing and able to do some engineering work with your applications with something like data dependent routing or vertical partioning, so that you can scale out your database servers.

I used to open up the system all the time, plugging in processors and even pulling them. However, over the last several years, the connectors are now really dense, and I have concerns about. A few people have told me they could not get the system to boot on upgrading from 2 to 4 sockets populated. Hence, I am nolonger a fan of buying a 4-way with 2 sockets populated. Considering the value of a Line-of-Business server, the cost of a 4-way system should not be a deal breaker. So I think the main question is should the customer pay the per-processor licensing on all 4? I believe the MS SQL Server license requires this. So I would to be able to disable processors in BIOS so the OS is never aware of this.

I really don't think reworking the app to run on a pair of systems is feasible. Recall that one of my arguments for the 2-way is that is cheap enough (and reusable) that if it turns out to be insufficient, just move up to the 4-way, and don't get hung up on having bought the 2-way.

One thing has to be said about the Dell R810 is that if you populate all 4 sockets, the memory bandwidth is effectively cut in half by Dell's stupid "flexmem" controller design. Even if you use 2 sockets, half of the memory channels are actually routed through an intermediary chip sitting in the 3rd and 4th socket, acting as memory channel SPI routers. That adds memory latency.

For Nehalem-EX, I think Dell R910 is the only choice. 8 Socket Nehalem-EX from HP DL980G7 and NEC and Fujitsu are pretty nice too. ( I think DL980G7 is probably going to be the cheapest 8 way)

Do you have any documents on the DL980G7?? wouldn't it be 780? as the Intel Xeon 8-way, comparable to the AMD Opteron based DL785? It would be nice if HP were to forward me advanced copies of new system information, It would also be nice if HP sent me systems to put out performance assessments.

OK, when the Dell R810 came out, I thought it was a 2-way system for the Xeon 6500 series. I was puzzled that there was an option to use the Xeon 7500 series processors as well. So I was suprised to find that the R810 actually a 4-way, and annoyed at the goofy FlexMem Bridge architecture.

Is there a niche for a 2-way 8-core 32-DIMM 2U server - definitely yes

Is there a niche for a 4-way 8-core 32-DIMM 2U with less than optimal memory bandwidth - yes, VMs are big memory users, but do not really need memory bandwidth.

What about databases, particularly SQL Server - I have always found that a thoroughly tuned (by me) transaction processing database could run fine on 16-32GB, even if the DB is multi-TB. Of course, by tuning I also mean getting rid of any clustered indexes on the unique identifier. DW can also run fine on a powerful storage system with reasonable memory. Of course, very very people actually bother to configure a proper storage system. Most elect to have the SAN vendor provide a recommendation, usually with seriously crippled performance. So in general, DW needs to run in memory for the typical SAN storage system.

So my first choices in the Dell PowerEdge line for SQL Server are the T710 and R910. There might be limited situations where the R710 or R810 are ok choices, despite the funky FlexMem in the R810.

in fact it is a hybrid 2 or 4-way. Of course, the Dell Online Store does not let me configure it as a 4-way.

There are two questions:

1) where does the 2-way R810 compare with the Xeon 5600-based 2-way T/R-710

2) how does the 4-way R810 compare with the 4-way R910.

Aggregate compute:

T710: 2 x 6 cores x 3.33GHz = 40 core-GHz

R810 2 x 8 cores x 2.22GHz = 36 core-GHz

Memory Bandwidth: 2x3 = 6 vs, 2x4 = 8 memory channels

Memoy Capacity: 18 vs 32 DIMMs

Reliability: Xeon 6500/7500 series has the advanced MCA features.

Had the Xeon 6500/7500 series come out earlier, the 2-way R810 would have had a definite performance edge over the 2-way quad-core Xeon 5500 series (8 cores x 2.93GHz = 23.4 core-GHz, excluding the W models), but it came out nearly one year after the 45nm 5500 series, so it is compared against the 32-nm 5600 series.

The Xeon 6500/7500 series do have the advantage in both memory bandwidth (I do not consider the Xeon 5600 single-DIMM per channel 1333MHz capability because database servers are rarely populated with just 1 DIMM per channel) and capacity.

It used to be true that more memory (bandwidth and capacity) was better for databases, but today this really not true anymore. I am off the opinion that most databases would run just fine on 16-32GB memory with proper tuning. Of course, buying more memory is cheap and easy, where as proper tuning is more expensive, and very difficult skill to find (cough, cough).

On reliability, specifically with regard to the MCA features in Xeon 76xxx series. Everybody say reliability is hugely important. Well, how much money, effort (and validation) was put into this.

When was the last time you had an outage that would have been prevented by MCA? When was the last outage? Did the world end?

For external facing databases, the thought is you do not want any outage, because lost business will never be recovered. For internal databases, productivity may be lost, employees with have to rearrange their work schedules etc.

As a 4-way, there is a definite niche for the higher density R810 (2U vs 4U for the R910) as few applications (really big databases) really need the 64-DIMM sockets on R910. However, I really do not want to give up 50% of my memory bandwidth to get this, as in the funky R810 memory architecture.

Note: transaction processing databases emphasize memory transaction throughput rather than memory bandwidth. So this is really memory latency times the number of independent memory channels, plus accounting for the number of overlapping memory accesses.

Now there are applications other than databases such as virtualization, which want extra memory capacity (over the 18 DIMM sockets on the 710) but do not really need the full 4-channel per socket memory bandwidth of the Xeon 7500/6500. Then of course, VM also does not need the extra compute capability of a 4-way eight-core system either.

Most applications the need high-density compute also need memory bandwidth.

So I really think there should have been two separate systems, a 2-way 32-DIMM system, and a 4-way 32-DIMM system with all 4 x4 memory channels wired symmetrically.

How is having half the bandwidth and half the DIMM count in a 4S R810 configuration not "stupid"? R810 should have been designed as a strict 2S machine without SPI memory channel rerouting for the 6500 series. Save the cost of the 2 extra useless sockets and 2S R810 should have been priced between 2S R710 and 4S R815.