Post Your Comment

46 Comments

We have a number of these in our data center and they have been a disappointment. Single threaded performance is low and the memory sharing performance under VMWare is poor. That leaves it only competitive for DB and web servers work which is OK but it doesn't make a compelling case for the architecture as similar Intel offerings perform well in all tasks. AMD still has a small price advantage, but once you add the VMWare licensing and data center costs the percentage difference is negligible.Reply

Typical answers, I debate such results all the time with many IT departments all over the world so called "standardized on and runs bette on" statements. So you have a large number of these new Opteron 6200 series already in datacenter and already got this info out of it, yeah right. Our virtual datacenters with approx 1000 servers exist out of AMD based systems, the only time i thought about swapping to intel after performance/price/power review was when intel released the Nehalem (oh and perhaps the Socket R but can't disclose that yet, neither do I want to swap that fast knowing that abu dhabi is about to go in Beta samples already). In the low price 2s-4s (not talking about the over expensive EX versions) the AMD are still owning the virtualization with there multicore and more memory channels with lower price.

Poor single thread has indeed been an issue for magny cours to a certain extend but not that it is noticable on a normal level of applications (you sound like a superpi user that only looks at theoretical results), poor memory sharing, care that to explain :). I suggest you have a look at general best practices on power settings for virtualized environments before complaining about response and throughput.These response time result measured here are not noticable in any general application and once adding some kernel overhead from NFS/iscsi or whatever in medium - high load servers forcing these tasks to HT cores will show quite a bit different result in platform performance. Anandtech Vapp results from are nice but are still not a full reflection of datacenter performance and the results are heavy influenced by the webapp which clearly seems to favor the Intel architecture. Neither does have the Vapp testing any iscsi/nfs kernel related tasks which many sites do have (to reduce infrastructure cost)

Vmware licensing cost more for AMD? only for the enterprise edition, the abandoned vmware license version which was introduced again for v4 because of OEM pressure... time to investigate more before buying anything... advanced - enterprise plus has no difference for any core you would select and neither does v5 have.Reply

NFS/iSCSI: you seem to ignore the fact that besides OLTP, many apps (especially OLAP and web) run mostly out of memory. The whole idea of good data management is to make sure that your requests come out of memory. We have webcaches, database caches, file system caches, RAID caches and on disk caches...all these caches are made to make sure that the response time is NOT dominated by NFS/iSCSI.

We have 5 years of experience building our vAPUS stresstesting client (not vApp) so don't discard our results so quickly. Reply

I won't discard them as I said they are a of great value just like these testing posted today, but that doesn't change the fact that in current vApus stresstesting there is no use of NAS/ISCSI datastores which is very common these days now that 10G is affordable and creates additional overhead on systems, mostly not accounted for when selecting a platform. Sure for the review real performance conclusion are needed from cpu architectures and then this setup remains totally applicable. But you get people which i tried to answer before who take this final score result for granted and leverage it over a total platform as if that is the best choice.

Second remains the fact that the final vApus calculation is based on all scores and the web based VM score is unbalancing the final score. I mentioned that years ago when the vApus1 and 2 were introduced, back in the old days.Reply

In general I think it would be added value to mention what exact BIOS and power mngmnt settings have been set. Since the option exist of using PCC controlled power (BIOS - OEM) or through the OS, also settings like CE6 etc does influence the final results a lot on turbo etc, mainly towards your preliminary review and this one it's not always clear what exactly you have been using.

result wise it is very strange that the 16core does not scale further then the 12core in SQL, for the reference testing that would give clear results when you could have tested the 6234-6238 which are also 12core versions. It is hard to believe that these 16core do not scale further, that they lack about 10-15% IPC @ same ghz sure but not raw performance in core count.

Debatable is the fact that while using Dual rank 4MC on AMD and using 2 rows Dual rank 3MC on the intel, dual row dual rank will give more bandwidth, its not AMD fault they have 4MC vs intel only 3MC.... even Intel next gen will have a serie of 3MC and 4MC..... but this should not result in major differences.

Last point, which IS a fact --> the price compare. While listed price might be comparable, reallity is quite a bit different in retail sales price. For large volume handling the discount between vendor is huge, the discount that Intel allows on it's CPU is way less then AMD, this changes the final cost price a lot.

"dual row dual rank will give more bandwidth"You can easily reverse that argument: if I use 4 GB DIMMs on the AMD, the clockspeed of the DIMMs will throttle back to 1333. The AMD IMC can only use 1600 MHz with 1 DIMM per channel. So this is really the best case for AMD.

"For large volume handling the discount between vendor is huge"ok. 1. those people are probably 0.001% of our readership. 2. Those prices are a moving and unknown target. Reply

yes I have seen those settings, but during the review with so many back and forth testing showing issues on the power and bios settings from a reader perspective it is no longer clear what exact bios settings have been used (ms os - hypervisor) and what has been used and if results were updated with the right BIOS settings.

I am not yet convinced that the 1600 is really added value for the 6200 series, perhaps it will be more added value on piledriver with enhancements. While 8GB ram prices did drop it is still not a default selection certainly not on 1600 speed. I did mention debatable memory :) did you test the difference in the Database benchmarks?

Its not only large volume where the discount is even greater, just look at the HP website and order 3 "similar" designs from the same vendor:

I know that you and I have spoke a little bit offline about possibly doing more HPC and HPC-related testing.

I, for one, would still like to see more of that because I think that's one area that a) is underserved by hardware review sites (sometimes with good reason), b) I think that it stresses the CPUs more/harder, and c) you can create/do/use a consistent benchmark test cases or suite of applications (like you mentioned about the SPEC OpenMP (although they have an MPI one as well, which I think is probably going to be even better).

I think that the biggest downside is that the HPC applications DO take a fairly significant time to run. (Some of them runs for days on end - just to do one pass).

And you can always throw more Hypervisors onto these systems, but I don't think that they're nearly as taxing as when you're running a computationally heavy/demanding application like simulating a car crashing into a wall at 35 mph. :oD

And it's quite possible that you might be able to script the entire benchmarking process...Reply

I too would love to see more HPC related benchmarks. Finite Element Analysis (FEA) or Computational Fluid Dynamic (CFD) programs scale very well with increased core count, and are something that is highly CPU dependent. I've found it very difficult to find good performance information for CPUs under this load.

I'd be happy to help out developing some benchmark problems if need be.Reply

These would indeed be interesting benchmarks to see. These workloads are very floating point heavy so I imagine that the new Opterons will perform poorly. 16 modules won't matter when they only have 8 FPUs. Of course, I am speculating here.

Going forward, these types of workloads should be moving toward GPUs rather than CPUs, but I understand the burden of legacy software.Reply

GPUs are pretty poor for general purpose HPC. If someone wants to fork out tons of $$$ to hack their problem onto a gpu (or they get lucky and somehow their problem fits a gpu well) that's fine but not really smart considering how short release cycles are, etc.

I have access to a quad socket magny cours built mid last year. In december I put together a sandy-e 3930k portable demo system. Needless to say the 3930k had at least 10% more throughput on heavy processing tasks (enabling all intel sse dropped in another 15%). It also handily beat our dual xeon nehalem development system as well. With mixed IO and cpu heavy loads the advantage dropped but was still there.

I'd love to be able to test these new amds just to see but its been much easier telling customers to stick with intel, especially with this new amd cpu.Reply

I think I'd rather see some benchmarks based around Java EE6 and an appropriate container such as Jboss AS 7. I'd also like to see some Java 7 application benchmarks ( server oriented ).

I'd also like to see some custom Java benchmarks using Akka library so we can see some Software transactional memory benchmarks. Possibly a node.js benchmark as well to see if these new technologies can scale.

What I've seen here is that the enterprise circa 2006 has a love hate relationship with AMD. I'd also like to see some benchmarks of the Intel vs AMD vs SPARC T4 in both virtualized and non virtualized J2EE environments. But this article does have some really interesting data.Reply

MySQL has to be the absolute worst possible choice for testing multi-core CPUs (as evidenced in this review). It just doesn't scale beyond 4-8 cores, depending on CPU choice and MySQL version.

A much better choice for "alternative SQL database" would be PostgreSQL. That at least scales to 32 cores (possibly more, but I've never seen a benchmark beyond 32). Not to mention it's a much better RDBMS than MySQL.

MySQL really is only a toy. The fact that many large websites run on top of MySQL doesn't change that fact.Reply

This is a very good point. While it can be done, it's very fiddly to get MySQL to scale to many CPUs, much simpler to just shard the database and run multiple instances of MySQL. (And replication is single-threaded anyway, so if you manage to get one MySQL instance running with very high inserts/updates, you'll find replication can't keep up.)

Same goes for MongoDB and, of course, Redis, which is single-threaded.

We have ten large Opteron servers running CentOS 6, five 32-core and five 48-core, and all our applications are sharded and virtualised at a point where the individual nodes still have room to scale. Since our applications are too large to run un-sharded anyway, and the e7 Xeons cost an absolute fortune, the Opteron was the way to go.

The only back-end software we've found that scales smoothly to large numbers of CPUs is written in Erlang - RabbitMQ, CouchDB, and Riak. We love RabbitMQ and use it everywhere; unfortunately, while CouchDB and Riak scale very nicely, they start out pretty darn slow.

We actually ran a couple of 40-core e7 Xeon systems for a few months, and they had some pretty serious performance problems for certain workloads too - where the same workload worked fine on either a dual X5670 or a quad Opteron. Working out why things don't scale is often more work than just fixing them so that they do; sometimes the only practical thing to do is know what platform works for what workload, and use the right hardware for the task at hand.

You missed something: it does scale beyond 12 Xeon cores, and I estimate that scaling won't be bad until you go beyond 24 cores. I don't see why the current implementation of MySQL should be called a toy.

PostgreSQL: interesting several readers have told me this too. I hope it is true, because last time we test PostgreSQL was worse than the current MySQL.Reply

Unless I'm mistaken, the Xeon 5650 is a 1.17B transistor chip, where the Interlagos 6276 is a 2.4B transistor chip.In that light, doesn't that make Intel's SMT implementation a lot better than CMT?I mean, yes CMT may give more of a performance boost when you increase the threadcount. But considering the fact that AMD spends more than twice the number of transistors on the chip... well, that's pretty obvious.AMD might as well just have used conventional cores.The true strength of SMT is not so much that it improves performance in multithreaded scenarios, but that it does so at virtually no extra cost in terms of transistors (and with little or no impact on the single-threaded performance either).Reply

Interlagos is 1.2 billion chip (maybe 1.3 but anyway). Most of those transistors are spend on the L3 cache: about 0.5 billion. Only 213 million transistors are in a module and each module contains a 2 MB L2-cache, probably good for 120 million transistors. That leaves 90 million transistors to the core, and it has been stated that the second cluster added 12%. So that second cluster costs about 12 million transistors, or 48 million on the total 4 module die. That is less than 5% of the total transistor count but you get a 30-90% performance boost!

So for AMD, this was clearly a great choice.

SMT is perfect for Intel, as the Intel architecture puts all instructions in one big ROB.

For very low IPC serverworkloads, I think the CMT approach gives better results. Unfortunately AMD lowered some of the CMT benefits by keeping the datacache so small and the low associativity of the Icache. Reply

Uhhh, I think you're wrong here... the 4-module Bulldozer is a 1.2B chip (Zambezi). But you tested the 8-module Interlagos (16 threads), which is TWO Zambezi dies in one package.Hence 2*1.2 = 2.4B transistors.Reply

Not in the article, because you did not factor in transistor count (which is the flaw I tried to point out in the first place... comparing two chips, where once is twice the transistor count of the other, is quite the apples-to-oranges comparison. One would expect a chip with twice the transistorcount to be considerably better in multithreading scenarios, not 'catching up' to the smaller chip).

But in your above post, I think it changes everything about your analysis. All your figures have to be done times two.Which makes it a very poor comparison, not only to Intel, but also to AMD's own previous line of CPUs.The 6174 Magny Cours is actually beating Interlagos, with 'only' 12 threads, no kind of CMT/SMT, and 'only' 1.8B transistors.

What we can tell from this article is that if you want to run a single instance of MySQL as fast as possible and don't want to get involved with subtle performance tuning options, the Opteron 6276 is not the way to go.

I'm not sure if you bothered to read the entire article, because MySQL was not the only database that was tested.There were also various tests with MS SQL, and again, Interlagos failed to impress compared to both Magny Cours-based Opterons and the Xeon system.Reply

Again, it is not CMT that makes AMD's transistor count explode but the combination of 2x L3 caches and 4x 2M L2-caches. You can argue that AMD made poor choices concerning caches, but again it is not CMT that made the transistor count grow.

I am not arguing that AMD's performance/billion transistors is great. Reply

I think you are looking at it from the wrong direction.You are trying to compare SMT and CMT, but contrary to what AMD wants to make everyone believe, they are not very similar technologies.You see, SMT enables two threads to run on one physical core, without adding any kind of execution units, cache or anything. It is little more than some extra logic so that the OoOE buffers can handle two thread contexts at the same time, rather than one.

So the thing with SMT is that it REDUCES the transistorcount required for running two threads. By nearly 100%.CMT on the other hand does not reduce the transistorcount nearly as much. So if you are merely looking at an 'exposion of transistor count', you are missing the point of what SMT really does.

Other than that, your argument is still flawed. Even an 8-thread Bulldozer has a higher transistor count than the 12-thread Xeon here. It's not just cache. CMT just doesn't pack as many threads per transistor as SMT does... and to make matters worse, CMT also has a negative impact on single-threaded performance (which again, if you are looking at it from the wrong direction, may look like better scaling in threadcount... but effectively, both with low and high threadcounts, the Xeon is the better option... and this is just a midrange Xeon compared to a high-end Interlagos. The Xeon can scale to higher clockspeeds, improving both single-threaded and multithreaded performance for the same transistorcount).

So what your article says is basically this:CMT, which is nearly the same as having full cores, especially in integer-only tasks such as databases, since you have two actual integer cores, has nearly the same scaling in threadcount as conventional multicore CPUs.Which has a very high 'duh'-factor, since it pretty much *is* conventional multicore.It does not reduce transistorcount, nor does it improve performance, so what's the point?Reply

No, it improves throughput, assuming we are talking from improvement going from 1 physical core to 2 logical cores.Clearly two logical cores (on the same physical core) have less throughput than two physical cores, but that's obvious since you only have half the hardware.

And that, together with the fact that Intel's SMT chips have far better single-threaded performance to begin with, results in very good performance per die area (you know, that thing that people used to praise AMD GPUs for).

"Yes, it does, via the implementation of all that shared hardware on the chip."

You can't say that, since there is no non-modular version of Bulldozer (just as there is no non-HT version of the Intel architectures).However, if you compare a 4-core HT architecture with a non-HT architecture, be that a Core2 Quad or a Phenom X4, Intel's transistorcount is clearly in the same ballpark, so HT does not add much in terms of transistorcount.

With CMT we see little or no indication of reduced transistorcount. AMD's 4-module chips are MUCH larger than regular 4-core chips have been. In fact, AMD"s 4-module design is even larger than Intel's 6-core design with HT.

"Two different approaches to the same idea."

I disagree. SMT is a very different idea from CMT (which is a bogus marketing term invented by AMD anyway). CMT is more of a marketing excuse for not having proper SMT, and shows no merit in actual silicon.

"but I don't think we can label one as inherently better than the other yet."

Well clearly we disagree on that then.I think SMT is clearly inherently better than CMT. SMT has far more flexible sharing of resources than AMD's half-baked approach.And any theoretical disadvantages (fighting over resources and all that) can be put to bed with benchmarking such as in this review: the disadvantages may exist, but the net performance is unbeatable anyway. A midrange Xeon schools a CMT-based chip of twice the size.Reply

Well, there are a lot of factors affecting single-threaded performance in real life. So CMT indeed has its scaling advantages as tests suggested. At least most of the things should be constant when comparing CMT-on and CMT-off, while comparing SMT and CMT on different implementations is not. Lack of single-threaded performance is not a valid point of blaming CMT.

If you want to *proof* CMT is a half-baked marketing crap while SMT is the only solution, what you need is a SMT-based AMD BD monolithic core or a CMT-based Intel monolithic module for comparison.

For the transistors counting, well, that's their choice of making such a cache and uncore configuration. You can keep telling 4-module chip is blahblahblah, but in some cases it beats a 4C8T Xeon chips. Transistors is not a big matter from customer viewpoint but just the producer viewpoint. If you want to argue with GPU's performance metrics, GPU is a data-parallel processor with bunch of logic units, while CPU is a latency-sensitive girlfriend of caches. Large amount of cache can make your Performance/mm^2 or Performance/transistors look worse. So trade-offs on the amount of cache should have been done before they started to design the chip.Reply

Well, one of the reasons why AMD's current CPUs have such poor single-threaded performance is because they moved from 3 ALUs per thread to 2 ALUs per thread.This is part of the whole CMT design.So in that sense, CMT can be blamed for the poor single-threaded performance at least.And since single-threaded performance is so bad, it is only logical that scaling to more threads is relatively good.On a CPU with faster single-threaded performance, you run into IO limits sooner (memory, disk etc), so it is more difficult to maintain similar scaling with increased thread count.

The strength of SMT is that Intel did not have to cut any ALUs when implementing HT. Pentium 4 Northwood with HT still had two double-pumped ALUs, like the non-HT Willamette that went before it.Likewise, Core i7 still has 3 ALUs, like Core2.Another strength of SMT is that even with one less ALU per 2 threads than CMT, it still reaches similar performance in multithreaded scenarios. CMT can not share these ALUs between threads, while SMT can.Conclusion: CMT is nonsense.For the full version, see: http://scalibq.wordpress.com/2012/02/14/the-myth-o...Reply

read the article, the baseline they use for price/performance is based on spec results....lots of companies still use these results to decide on a platform.

but then again, benchmarks don't always show the real world value or even hard to compare since many have in house applications that don't scale or scale different like the ones benchmarked in reviews. 90% of the datacenters don't even require more then any midrange cpu, anything above midrange is wasted money and both vendors provide more then adequate solutions to that. It's the human mind that is often blocking sanity. Investing that wasted money in other solutions often provide a better total performing solution.Reply

But if u really want to see what the true story is, have a look at AMD's stock price lately, and their server wins. They absolutely smoke intel on virtualization, and anything that requires a lot of threads. It's not even close.Reply