Intel Xeon 5600 (Westmere-EP) and AMD Magny-Cours Performance Update

HP has just released TPC-C and TPC-E results for the ProLiant DL380G7 with 2 Xeon 5680 3.33GHz 6-core processor, allowing a direct comparison with their DL385G7 with 2 Opteron 6176 2.3GHz 12-core processors. Last month I complained about the lack of performance results for the Intel Xeon 5600 6-core 32nm processor line for 2-way systems. This might have been deliberate to not complicate the message for the Xeon 7500 8-core 45nm (for 4-way+ systems) launch two weeks later. http://sqlblog.com/blogs/joe_chang/archive/2010/04/07/intel-xeon-5600-westmere-ep-and-7500-nehalem-ex.aspx

The Xeon 5680 score 13.8% higher in TPC-C and 25.1% higher in TPC-E. The individual physical core in Westmere is faster than the Opteron core on SPEC CPU 2006 Integer base (adjusted to exclude parallel components). There is no meaning in comparing frequency between completely different processor architectures.

The purpose of excluding lib quantum is to compare single core performance. The Intel compiler can parallelize lib quantum, so it is not a single core result(?). I am somewhat inclined to also exclude hmmr because the Intel 11.1 compiler made substantial improvement over their 11.0 compiler. AMD results are on the PGI 8.0 complier, which may not have either optimizations.

Comment Notification

Comments

question about these for OLTP systems. we're looking to buy a pair of HP Proliant DL 380 G6's with 5650's sometime in the next few months. they are going to replace G5's with 5345's. this is going to be for a single instance cluster where most of the workload is small 1 second queries. mostly selects, with some updates and inserts. best guess is 20-30 million queries per day.

currently we're on max degree of parallelism 0. i had the ideal to increase it a little to force each connection/query/thread into it's own core. only problem is that we can't test the workload in QA or any other system.

i tried testing it on a system with a different workload type and didn't like the results.

so my question is, will we see a performance increase or decrease? not looking to see query execution cut from 1 second to half a second but will we be able to process more queries/threads at once if they go through separate cores?

Very Important: Get a decent QA system. I would say that the 5345 is a decent processor and not due for replacement. But if you do not currently have a QA system, then by all means get the latest 2-way system, example X5650, so that you do have a QA system.

Next: a 1 second query today is not a small query, its 2-3 Billion CPU-cycles per core. Small is a query that runs in 10milli-sec. It is important to specify both the average CPU and duration (elapsed time) per query. If your average query consumes 1 CPU-sec, then each core can do 86,400 per day, assuming uniform traffic, all at max load. On an 8-core system, thats 691,200 per day, not 20-30mil. So I suppose you meant 1-sec duration and the CPU is less. Presumably more than 30X less to support 20-30M per day.

Anyways, 30M queries per day means 1.25M per hour. Assuming you have peak hours, your load might actually need to hit 3.6M per hour, or 1000 per sec. If you have a single core system, the average query cost would need to be 1 milli-sec, but on a 12-core system, you could afford 12-ms average.

Assuming a range of query costs, a few might benefit from a parallel query plan, possbily maxdop 2 or 4. I would run tests on a QA system, find out which could generate a parallel plan, then explicitly allow these to have parallel plan, leaving everything else on non-parallel.

Forcing each connection to a specific core can help if your average CPU cost is well below 1 milli-sec. Above that I would not bother. Besides, this is a really complicated subject, that is best left to experts.

Our QA servers are actually faster than production. we bought them last year. the last of the G5's. I think they are 5450 Xeon's. no problem restoring the databases there, but getting the workload simulated is pretty much impossible. we might end up testing in production.

only reason we're replacing them is we need to upgrade our DR servers and it doesn't make sense to buy new hardware to sit around

so what you need to do is figure out the average CPU per SQL call, ie, the top level stored procedure called from the client/app server. the average duration (1 sec) has no meaning, other than its kind of a long time on modern systems.

unless your average cpu per RPC is well under 1 milli-second (my estimate is under 0.3 ms), there will not be much gain in employing thread affinity.

You should have read enough of my work by now to know I do waste time with worthless crap.

As far I know, I am the only one to have published quantitative analysis on high-call volume chatty apps, NUMA and affinity tuning. If you think my data above is not correct, test it out for your self.

For reference, the TPC-C and TPC-E benchmark queries average on the order of 0.5 CPU-ms with the Xeon 5500\5600\7500 series (based on physical core CPU-s, not logical). On the 2-way systems, there is some benefit with port-thread affinity alignment. On the bigger systems the benefit becomes more substantial.

In SAP type applications, where the average call is less than 0.1 CPU-ms, port-affinity alignment is absolutely crucial, resulting in negative scaling on hard NUMA systems without this.

A few years ago, you did a performance eval of an 8-way(+?) Intel hard NUMA system, versus 4-way Opteron dual-core (soft-NUMA) system, find almost no scaling beyond 4-way(?). Published benchmarks showed moderate scaling. The benchmarks used port-affinity tuning and you did not. The benchmark test you were using was on the order of 0.5-1.0 CPU-ms per call, which really needed the affinity tuning on NUMA systems. If the average call were 1000 CPU-ms, then it would not have mattered.