The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application. We looked at SAP's benchmark database for these results. The results below all run on Windows 2003 Enterprise Edition and MS SQL Server 2005 database (both 64-bit). Every 2-tier Sales & Distribution benchmark was performed with SAP's latest ERP 6 enhancement package 4. These results are NOT comparable with any benchmark performed before 2009. The new 2009 version of the benchmark produces scores that are 25% lower. We analyzed the SAP Benchmark in-depth in one of our earlier articles. The profile of the benchmark has remained the same:

Very parallel resulting in excellent scaling

Low to medium IPC, mostly due to "branchy" code

Somewhat limited by memory bandwidth

Likes large caches (memory latency!)

Very sensitive to sync ("cache coherency") latency

There is no doubt here: the Westmere-EX Xeon delivers with 30% higher performance than the previous x86 quad CPU record. The 40-core, 80-thread quad Xeon server can not beat the 32-core, 128-thread IBM Power 750, the RISC champion; however, the high-end IBM servers start at $100,000, two to three times more than a comparable Xeon system.

The 30% extra performance that the new 32 nm Xeon delivers over its predecessor also increases the gap with AMD. The best quad Xeon now offers 50% more performance than the best quad Opteron. The ERP market is a market where RAS, scalability, and performance are the top priorities and hardware pricing is only a secondary thought. There is little doubt in our mind that Intel will continue to dominate the x86 ERP server market.

62 Comments

I'd be more interested at seeing how they perform in slightly more "generic" and non-GPU optimizeable workloads. If I'm running Linpack or other FPU operations, particularly those that parallelize exceptionally well, I'd rather invest time and money into developing algorithms that run on a GPU than a fast CPU. The returns for that work are generally astounding.

Now, that's not to say that General Purpose problems work well on a GPU (and I understand that). However, I'm not sure that measuring the "speed" of a single processor (or even a massively parallelized load) would tell you much, other than "it's pretty fast, but if you can massively parallelize a computational workload, figure out how to do it on a commodity GPU, and blow through it at orders of magnitude faster than any CPU can do it".

However, I can't see running any virtualization work on a GPU anytime soon!Reply

But sometimes (actually, every single time in my experience) the "expensive software" that's been bought to run on these servers lacks a GPU option. I'm thinking of electromagnetic or finite element analysis code.

Finite element engines are the sort of thing that companies make a lot of money selling. They are complicated. The commercial ones probably have >10 programmer-years of work in them, and even if they weren't fiercely-protected closed source, porting and re-optimising for a GPU would be additional years work requiring programmers again at a high level and with a lot of mathematical expertise.

(There might be some decent open-source alternatives around, but they lack the front ends and GUI that most engineers are comfortable using.)

If you think fixing the above issues are "easy", go ahead. You'll make millions.Reply

I agree with you. In my experience GPU computing for scientific applications are still in it's infancy, and in some cases the performance gains are not so high.

There's still a big performance penalty by using double precision for the calculations. In my lab we are porting some programs to GPU, we started using a matrix multiplication library that uses GPU in a GTX590. Using one of the 590's GPU it was 2x faster than a Phenon X6 1100T, and using both GPUs it was 3.5x faster. So not that huge gain, using a Magny-Cours processor we could reach the performance of a single GPU, but of course at a higher price.

Usually scientific applications can use hundreds of cores, and they are tunned to get a good scaling. But I don't know how GPU calculations scales with the number of GPUs, from 1 to 2 GPUs we got this 75% boost, but how it will perform using inter-node communication, even with a Infiniband connection I don't know if there'll be a bottleneck for real world applications. So that's why people still invest in thousands of cores computers, GPU still need a lot of work to be a real competitor.Reply

single vs double precision isn't the only limiting factor for GPU computing. The amount of data you can have in cache per thread is far smaller than on a traditional CPU. If your working set is too big to fit into the tiny amount of cache available performance is going to nose dive. This is farther aggravated by the fact that GPU memory systems are heavily optimized for streaming access and that random IO (like cache misses) suffers in performance.

The result is that some applications which can be written to fit the GPU model very well will see enormous performance increases vs CPU equivalents. Others will get essentially nothing.

Einstein @ Home's gravitational wave search app is an example of the latter. The calculations are inherently very random in memory access (to the extent that it benefits by about 10% from triple channel memory on intel quads; Intel's said that for quads there shouldn't be any real world app benefit from the 3rd channel). A few years ago when they launched cuda, nVidia worked with several large projects on the BOINC platform to try and port their apps to CUDA. The E@H cuda app ended up no faster than the CPU app and didn't scale at all with more cuda cores since all they did was to increase the number of threads stalled on memory IO.Reply