62 Comments

I'd be more interested at seeing how they perform in slightly more "generic" and non-GPU optimizeable workloads. If I'm running Linpack or other FPU operations, particularly those that parallelize exceptionally well, I'd rather invest time and money into developing algorithms that run on a GPU than a fast CPU. The returns for that work are generally astounding.

Now, that's not to say that General Purpose problems work well on a GPU (and I understand that). However, I'm not sure that measuring the "speed" of a single processor (or even a massively parallelized load) would tell you much, other than "it's pretty fast, but if you can massively parallelize a computational workload, figure out how to do it on a commodity GPU, and blow through it at orders of magnitude faster than any CPU can do it".

However, I can't see running any virtualization work on a GPU anytime soon!Reply

But sometimes (actually, every single time in my experience) the "expensive software" that's been bought to run on these servers lacks a GPU option. I'm thinking of electromagnetic or finite element analysis code.

Finite element engines are the sort of thing that companies make a lot of money selling. They are complicated. The commercial ones probably have >10 programmer-years of work in them, and even if they weren't fiercely-protected closed source, porting and re-optimising for a GPU would be additional years work requiring programmers again at a high level and with a lot of mathematical expertise.

(There might be some decent open-source alternatives around, but they lack the front ends and GUI that most engineers are comfortable using.)

If you think fixing the above issues are "easy", go ahead. You'll make millions.Reply

I agree with you. In my experience GPU computing for scientific applications are still in it's infancy, and in some cases the performance gains are not so high.

There's still a big performance penalty by using double precision for the calculations. In my lab we are porting some programs to GPU, we started using a matrix multiplication library that uses GPU in a GTX590. Using one of the 590's GPU it was 2x faster than a Phenon X6 1100T, and using both GPUs it was 3.5x faster. So not that huge gain, using a Magny-Cours processor we could reach the performance of a single GPU, but of course at a higher price.

Usually scientific applications can use hundreds of cores, and they are tunned to get a good scaling. But I don't know how GPU calculations scales with the number of GPUs, from 1 to 2 GPUs we got this 75% boost, but how it will perform using inter-node communication, even with a Infiniband connection I don't know if there'll be a bottleneck for real world applications. So that's why people still invest in thousands of cores computers, GPU still need a lot of work to be a real competitor.Reply

single vs double precision isn't the only limiting factor for GPU computing. The amount of data you can have in cache per thread is far smaller than on a traditional CPU. If your working set is too big to fit into the tiny amount of cache available performance is going to nose dive. This is farther aggravated by the fact that GPU memory systems are heavily optimized for streaming access and that random IO (like cache misses) suffers in performance.

The result is that some applications which can be written to fit the GPU model very well will see enormous performance increases vs CPU equivalents. Others will get essentially nothing.

Einstein @ Home's gravitational wave search app is an example of the latter. The calculations are inherently very random in memory access (to the extent that it benefits by about 10% from triple channel memory on intel quads; Intel's said that for quads there shouldn't be any real world app benefit from the 3rd channel). A few years ago when they launched cuda, nVidia worked with several large projects on the BOINC platform to try and port their apps to CUDA. The E@H cuda app ended up no faster than the CPU app and didn't scale at all with more cuda cores since all they did was to increase the number of threads stalled on memory IO.Reply

HP makes the BL620c G7 Blade server that is a 2P Nehalem EX (soon to offer Westmere EX)And believe it or not, but the massive HP DL980 G7 (8 Proc Nehalem/Westmere EX is actually running 4 pair of EX CPUs. HP has a custom ASIC Bridge chip that brings them all together. This design MIGHT actually support running the 2P models as each Pair goes through the bridge chip.

Dell makes the R810 and while its a 4P Server, the memory subsystem actually runs best when its run as a 2P Server. That would be a great platform for the 2P CPUs as well.Reply

2P E7 looks like a product for a very, very small niche to me. In terms of pure performance, a 2P Westmere-EP pretty much makes up for the deficit in cores with the higher clock - sure it also has less cache, but for the vast majority of cases it is probably far more cost effective (not to mention at least idle power consumption will be quite a bit lower). Which relegates the 2P E7 to cases where you don't need more performance than 2P Westmere-EP, but depend on some of the extra features (like more memory possible, RAS, whatever) the E7 offers.Reply

If anyone wants an explanation of what changed between these types of memory, simmtester.com has a decent writeup and illustrations. Basically each LR-DIMM has a private link to the buffer chip, instead of each dimm having a very high speed buffer daisychained to the next dimm on the channel.

The main difference isn't the removal of the point-to-point connections but the reversion to a parallel configuration similar to classic RDIMMs. The issues with FBDIMMs stemmed from their absurdly clocked serial bus that required 4x greater operating frequency over the actual DRAM clock.

It must be very, very difficult to generate a review for server equipment. Once you get into this class of hardware it seems as though there aren't really any ways to test it unless you actually deploy the server. Anyway, kudos for the effort in trying to quantify something of this caliber.Reply

Correct me if I'm wrong, but isnt an Opteron 6174 just $1000? And it is beating the crap out of this "flagship" intel chip by a factor of 3:1 in performance per dollar, and beats it in performance per watt also? And this is the OLD AMD architecture? This means that Interlagos could pummel intel by something like 5:1. At what point does any of this start to matter?

You know it only costs $10,000 for a quad opteron 6174 server with 128GB of RAM?Reply

Software licensing is a part of the overall picture (particularly if you have to deal with Oracle) but the point is well taken that AMD delivers much better bang for the buck than Intel. An analysis of performance/$ would be an interesting addition to this article.Reply

The analysis isn't too hard. If you're licensing things on a per core cost (Hello, Oracle, I'm staring straight at you), then how much does the licensing cost have to be per core before you've made up that 20k difference in price (assuming AMD = 10k, intel = 30k)? Well, it's simple - 20k/8 cores per server more for the AMD = $2500 cost per core. Now, if you factor in that on a per core basis, the intel server is between 50 and 60% faster, things get worse for AMD. Assuming you could buy a server from AMD that was 50% more powerful (via linearly increasing core count), that would be 50% more of a server, but remember each server has 20% more cores. So it's really about 60% more cores. Now you're talking about an approximately 76.8 core server. That's 36 more cores than intel. So what's the licensing cost gotta be before AMD isn't worth it for this performance level? well, 20k/36 = $555 per core.

OK, fair enough. Maybe things are licensed per socket instead. You still need 50% more sockets to get equivalent performance. So that's 2 more sockets (give or take) for the AMD to equal the intel in performance. Assuming things scale linearly with price, that "server" will cost roughly 15k for the AMD server. Licensing costs now have to be more than 7.5k (15k difference in price between the AMD and intel servers divided by 2 extra sockets) higher per socket to make the intel the "better deal" per performance. Do you know how much an Oracle Suite of products costs? I'll give you a hint. 7.5k isn't that far off the mark.Reply

There are so many factors in the cost of a server that it's difficult to compare on just price and performance. RAS is a huge one -- the Intel server targets that market far more than the AMD used. Show me two identical servers in all other areas other than CPU type and socket count, and then compare the pricing. For example, here are two similar HP ProLiant setups:

Now I'm sure there are other factors I'm missing, but seriously, those are comparable servers and the Intel setup is only about $3000 more than the AMD equivalent once you add in the two extra CPUs. I'm not quite sure how much better/worse the AMD is relative to the Intel, but when people throw out numbers like "OMG it costs $20K more for the Intel server" by comparing an ultra-high-end Intel setup to a basic low-end AMD setup, it's misguided at best. By my estimates, for relatively equal configurations it's more like $3000 extra for Intel, which is less than a 20% increase in price--and not even factoring in software costs.Reply

Yeah but what you gonna do with those two extra Xeons? You cant just bolt them to the side of that two socket server. There is a huge price divide between 2 and 4 socket servers. Your numbers are totally disingenuous. You'd need to drop an extra 6 grand just to move into the 4 socket platform. You can see that right on HP's site. For $12,000 more you get 2 extra sockets, and all four chips get upgraded to 4850s. The upgrade is worth 4 grand. You also get a memory upgrade to 128GB. When you also subtract a couple grand for that 64GB of ram upgrade, you're left with 6 grand for the dual-to-quad socket upgrade.

So why is the 4 core server $6000 more expensive, when factoring out the added parts? Is that what they charge just to install two processors? Is there really that much waste in the IT world? If so then it is no wonder IT is being outsourced at a breakneck pace. Any IT professional who would pay HP 6 grand just to install a couple cpu's needs to be "downsized" immediately.Reply

As L. points out below, the Intel setup is also using the E7-4830, which is 8 core instead of 10. And then the upgraded Intel setup with 10-core CPUs also bumps up to 128GB RAM and 4 x 1200W PSUs and ends up at $26819 (with 4 x E7-4850) -- note that the AMD setup already had 4 x 1200W PSUs.

So once again, we're back to comparing apples and pears -- similar in many ways, but certainly not identical. And for that very reason, you can't even begin to make statements like L.'s "AMD wins on perf/watt/dollar" because we don't have any figures for how much power such an AMD setup actually consumes compared to the Intel setup. It might be more power than our review servers, or it might be less, but it will almost certainly be different.

My main point is that we're not even remotely close to paying 2x as much for an Intel server vs. AMD server. If you want to compare the cheapest quad-Opteron 6174 to a higher quality quad-Xeon, yes, the pricing will be vastly different; that's like pointing out that a luxury sedan costs more than the cheapest 4-door midsize sedan.Reply

I think I saw that comment quite a few times .. but I only just realized there was a big problem with your numbers :The processors listed here are e7-4830.

Those processors are 8 cores (not 10 like 4870) and 2.13Ghz (not 2.4Ghz like 4870).

Assuming linear scaling (although this is absolutely not the case) you would get the same perf/watt as the above model, and a total sap score of 52,518 vs 47,420 for the AMD 6174 (10% more).

And the price is 18% more .. looks like perf/watt/dollar crown goes to AMD again.

Other tests are clearly impossible to guesstimate and clearly the SAP test was where the e7 was getting a better advantage compared to vAPUs mark II test.

So yes, the argument stands that even though Intel has higher-priced extreme components, anything they have in AMD-performance-range is more expensive than AMD's option, quite logical with AMD as the underdog so far.

BUT, as we're showing benchmark of the flagship vs flagship, there tends to be misconceptions about the rest of the product line, just like here "20k more expensive" or "so much better cpu from intel" or other random bullcrap.Reply

What software does the average business need to run on a ridiculous number of cores? The only common application that comes to mind is internet facing Linux/Apache servers, and Linux/Apache are free(anyone dumb enough to pay for and use Windows/IIS deserves what they get).

Most businesses just need a lot of VMs running their various low intensity apps, and dedicated NAS or SAN devices. Magny-Cours Opteron devices do either just as well or better than Xeon, and using VMs to run licensed apps won't result in a penalty for software licensing.Reply

No, this is just another CPU that *could* be for most businesses, you could use it in the "lots of non-intensive apps running in VMs" scenario(or any other scenario), it's just an exceptionally poor value, and will probably not out-perform Magny-Cours or Bulldozer for most people's real world use. This CPU excels at benchmarking, and that's just about it.

AMD got server CPUs right, it's all about how many cores you can fit on a rack.Reply

Your Example of Internet Facing Apps and Linux Apache is the EXACT opposite design methodology of how things work in the real world.

In the real world, internet apps run on the cheapest of the cheap servers and companies just use a ton of them behind a Load Balancer.

Now the Database serving those web servers in the background, running Oracle RAC or MS SQL or even MySQL on the other hand will make use of all these cores and memory assuming you have a large database.

The examples given RIGHT IN THE ARTICLE about things like SAP are probably the most common thing run on these Big Iron type boxes.

If it helps prove my point any further, over at HP on the Sales side of things, the guys that have been selling RISC based machines under the HP Integrity/Superdome name for something like a decade, are now also being paid commission when they sell the DL580/DL980 G7 servers. Those 2 models use the Nehalem EX and will soon be using the Westmere EX.So the type of Apps running on these CPUs are often the same things Fortune 100 companies used to run on Integrity/Sun/IBM Power/etcReply

When you spend $100,000 + on the S/W running on it, the HW costs don't matter. Recently I was in a board meeting for launching a new website that the company I work for is going to be running. These guys don't know/care about these detailed specs/etc. They simply said, "Cost doesn't matter just get whatever is the fastest."Reply

Many of the uses for this class of server involve software that won't scale across multiple boxes due to network latency, or monolithic design. The VM farm test was one example that would; but the lack of features like ECC support would preclude it from consideration by 99% of the buyers of godbox servers.Reply

I think that more and more people are realizing that the issue is more about lack of scaling linearly than anything like ECC. Buying a bullet proof server is turning out to cost way too much money (I mean ACTUALLY bullet proof, not "so far, this server has been rock solid for me").

I read an interesting article about "design for failure" (note, NOT the same thing as "design to fail") by Jeff Atwood the other day, and it really opened my eyes. Each extra 9 in 99.99% uptime starts costing exponentially more money. That kind of begs the question, should you be investing more money into a server that shouldn't fail, or should you be investigating why your software is so fragile as to not be able to accommodate a hardware failure?

I dunno. Designing and developing software that can work around hardware failures is a very difficult thing to do.Reply

I did not see anything in the article about RAS, or at least my understanding of the acronym as its used in IT. Are you using it to mean "Reliability, Availability, and Serviceability"? If so, where was that addressed in the article? If not, what was RAS supposed to mean?

I second this comment. You mention that the new Xeons have exceletn RAS features but do not describe a single one.

How about an article on that topic ? And comparing to Opteron and Itanium while you are at it ? I have no clue about IBM or Sparc chips (Itanium is my daily bread), so I'd be very much interested in such a comparison.

The last thing I saw from a Nehalem Xeon was that it threw an MCA and rebooted the box. The only benefit was that it enabled some diagnostic. An Itanium system would deconfigure the CPU and boot stable with 1 less socket. The Xeon system just kept rebooting at the same point over and over again.Reply

Go back and read the reviews on the Nehalem EX from 9 months ago.There are no major new RAS features in Westmere EX that I am aware of as its a die shrink and not a major feature change.

One of the things I remember was the ability to identify and disable a bad DIMM or even a bad memory chip within a DIMM in such a way that (if the OS supports it) the machine wouldn't crash and could keep running.Also supports memory sparing so you can even load some extra memory in there to take over for the bad DIMM.

Well .. if that's all the Intel 32nm process has to offer, I believe I can say there's blood in the water.

The "crappy" old phenom-2 based Opterons are in fact keeping up in perf/watt WITH ONE LESS DIE SHRINK.

This is just huge ... it means that unless AMD manages to fuck up the bulldozer extremely bad (as in making it worse than the phenom 2), just the die shrink will give them a clear perf/watt advantage.

Add in the speed gained through the new process and the Xeons will look like power-hungry overpriced pieces of junk ... and that's still not considering that the bulldozer architecture is any better than the ph2.Reply

Intel CPUs, especially with Nehalem/Westmere families, just outright sell themselves. For whatever reason, and I cant explain it myself, the AMDs just dont sell as well.

Personally I love the new AMD line for servers.They use the same CPUs for high end 2P and all 4P servers.All the CPUs have the same memory speeds and loading rulesQuad channel memory even on 2PThey give you Cores-o-plenty (this can be a downside in the world of Oracle)

Then they have a much cheaper 1P/2P option with half the cores and Dual Channel memoryEach CPU family only has like 5/6 CPUs as well.Its such a simple lineup its so easy for a enterprise customer to standardize a large cross section of the DC.

Now look at Intel.1P is the 3000 family2P is the 5000 family4P is both the 6000 and 7000 family8P is usually the 7000 family.1/2 and 4/8 have different memory designs including Tri vs Quad channelon 1/2 you get different memory speeds depending on what model CPU you buy.Which is really fun because they have like a dozen or more CPU models on each of 1P and 2P.

So even though AMD seems like the better choice, Intel is still dominating the market.Sandy Bridge 2P Servers will be out before the end of the year. Right now it looks like Bulldozer might beat them to market by a matter of a few months. If AMD slips that date, Intel will still have quite a competitive product and BD had better basically be FLAWLESS.

So for the next gen servers, I think the purchasing habits of most companies will not change unless AMD pulls a major rabbit out of their hat.Reply

I have trouble understanding you : sandy bridge 2p servers will be out before the end of the year ?

Aren't they out yet ?

And even if they're there, they will NOT compete with the AMD chips, as I said above, a 45nm Ph2-based Opteron is as power efficient as a 32nm sb-based xeon - lolwut ?

The only thing that will somehow be bad for bulldozer is Ivy Bridge 22nm IF it comes out as Intel planned it - and even then, it's only a repeat of the same core arch.

If Bulldozer is no more efficient than the phenom, you will have AMD win in perf/watt/dollar until ib is out, and then the only advantage will be the 3d gate, which Intel said would amount to a dozen % improvement over standard 22nm.

As a summary, if the Bulldozer Architecture is 12% more efficient than the Phenom 2, then the Bulldozer will destroy the Westmere-EX at the same process, and face the ivy bridge as an equal.

Considering the design options picked by AMD on bulldozer, I'm quite confident it'll be at least 12% more efficient through architecture.

And even if Intel is good at marketing, AMD has been gaining share and will gain more in the future.

Intel said this ?" With their latest chip, Intel promises up to 40% better performance at slightly lower power consumption."

Well that means that shrinking from 45nm to 32nm yields 30% (pinch of salt ;) ) improvement.

Make no mistake, Bulldozer will totally kill the Sandy Bridge based offerings, by at least a 30% margin on perf/watt/dollar and I would expect this to be in the 40-50% range with the architecture changes.Reply

nobody ever got fired for buying IBM. or these days Intel and Microsoft.

by the time you price out a HP Proliant with AMD CPU's it's the same price or more than an Intel based server. maybe just a little cheaper. and the AMD CPU's do a lot worse on benchmarks that test more real world performance like database OLTP and other more common server tasks.Reply

Err... no it's not the same price.Besides, "a lot worse on benchmarks" is a huge pile of shit, if that was the case, why would Cray and others take Opteron for SC ?Why would anyone, in fact, go out of their habits to buy a chip from the underdog ?

Believe me, even if you see a lot favoring Intel, there's a lot favoring AMD that's less shown but there regardless..

As I said, same perf/watt on the anand benchmarks for two chips that are a die shrink away from each other... this is ludicrous.Reply

Looking at the power consumption and results. It is clear to me that AMD is better in the Perf/Watt performance. Even with an outdated platform (Why no tests with magny-cours again?)they manage to perform better with the same currentdraw.Reply

Why arent there any perf/wat figures? If you look at the data it is Clear that an old AMD platform offers superior Perf/Watt. I also noticed any tests with Magny-cours as a competition is missing?Reply

The Opteron 6174 and 6176 are Magny-Cours processors, so they were indeed tested. I believe the choice in using the 6174 for the majority of the review would be down to the 6176's higher TDP as discussed at the following link:

I have very much enjoyed reading this article as well as the previous one about the 4p systems of intel which are in fact very interesting i must say, because benchmarks about such systems are very rare to find on the net.

I would only like to ask anandtech.com if it is possible to see any rendering benchmarks on such a system. I am using some 3D software and i would very much like to see how the AMD system with the 48 cores is doing when rendering (with mental ray and vray preferably).

I focus on the AMD server system because it really is a very good price per performance example whereas Intel is indeed ahead performance wise, but the prices for an Intel 4P system are astronomical to say the least. I personally believe that very few companies need such a thing, while most of them can do well with an AMD system.

And since you are talking about virtualization, if a company needs more power, just buy another 4P AMD system, the overall result will be a faster system (by far) than a single Intel one, while having the price of a single Intel server! (3K+ for a single intel cpu chip is just outrageous, intel charges like there is noone else on this planet with an equivalent product, at least for the x86 market). Though two AMD systems will use a little more power than a single Intel one (~1150W for 2 AMD servers instead of ~900W for a single Intel based on the info of this article, which is not that much ahead if you think of the performance you gain).

Then again there are the infrastructure costs (more 10 gig ports for the extra system, extra UPS load thus more UPS power to handle the extra system, extra space in the rack, and of course extra cooling for the 2nd system). Which I think these issues are the real deal and hence will make the final decision.

Anyway, that's all i wanted to say. Again i only wanted to ask for some rendering benches, if it such a hustle than at least a mere cinebench 11.5 would be fine.

Hey I'm thinking of building up a multicore box for raw processing. I'm wondering if you could benchmark bibble on these supercore systems.Preferably with an A900 or a 5dmkIIAlso using the wavelet denoise and wavelet sharpening plugins as these are what I use most often.I'm wondering about import and preview speed and also speed to export as jpg.Let me know if its possible to do these benchmarks, also if you need source files and config sets I have some 8-16gig sets.Reply