My last post generated quite a bit of discussion, some of it based on misunderstandings. In this post I'll try to make a few things more clear. In a previous post, I pointed out that there are a good indications that a dual Nehalem EP has a 40 to 100% advantage over Shanghai (depending on the application, based on the SAP and Core i7 workstation benchmarks).

because a sap bench gives 100% advantage of a hyperthreaded core you already think that it will scale in all applications, you should know better then that before posting such nonsense. Why don't you wait for the real performance charts before you post. Now you have 0 backup of your comments if its true then let it be but if not you can't back this up then just shut up. You just sound like payed blue marketing.

lets see choosing between 12 real cores or 8 real + 8 virtual in virtualisation is not going to cut it for nehalem, although vmware changed the code to better work with HT they also believed this should not be seen as a real core !!!!!!! Reply

"In well threaded applications, the best a "hex-core Shanghai" can do is give about a 30-40% boost to performance compared to the current Shanghai, which is most likely not enough to close the gap with the upcoming Nehalem CPU (let alone the 32 nm hex-core version). However, Istanbul is more than a hex-core Shanghai. The improved memory controller and HT-assist can lower the latency of inter-CPU syncing and increase the effective memory bandwidth. For that reason, Istanbul will do better than just "a shanghai with 2 added cores" in many applications such as SAP, OLTP databases, Virtualization scenario's and HPC. Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem. It is clear that the hex-core "Westmere" which will have a slightly improved architecture will be a different matter."

So you begin the argument by saying "which is most likely not enough to close the gap with the upcoming Nehalem CPU", and then you close the argument with "Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem"? Which one are you on? Oh I see the operative word here is "might". Yeah, it is hard to speculate performance numbers when you don't have Istanbul silicon in hand do you? I have been very generous when extrapolating Istanbul performance by saying that assuming "linear scalability", which is probably the best case situation for AMD. It still would not overcome the 100% advantage Nehalem-EP has over Shanghais.

Another funny thing is that the 41GB/sec Stream Bandwidth benchmark you posted, I am not sure that it is from the current Nvidia 3600 chipset. Theoretically, quad socket dual channel DDR2-800 cannot produce that amount of bandwidth in the first place.

The only thing that you presented that is true is the 32DIMM argument. But you failed at pointing out the 32 DIMM disadvantage on Opterons. When you use 8 DIMMS per Socket, the memory bus downclocks to DDR2-533, which isn't much when you are talking about Barcelonas when the default is 667Mhz, but it is a huge downclock when you talk about Shanghai and Istabul's DDR2-800. All benchmarks published on websites are done with 4 DIMMs per socket, which operates at full speed. When you push 8 DIMMs per socket, you get dual channel DDR2-533, which is close to a 33% performance degradation, something you fail to mention.

Now, I don't know many people here who are virtualization customers who push 8 DIMMs per socket on their 4S servers. Database people are different, because even with slower memory bus, it is still faster than disk seek.

Then your argument on "scaling up" vs "scaling out". It isn't a philosophical question. People without software expertise to "scaling out" would have to "scale up". It is particularly true for database because it is hard to "shard". Another reason is when you buy expensive commercial licenses like Windows Server Enterprise and Oracle licenses or SQL Server Enterprise or VMware ESX Server, the licenses themselves force you to purchase the largest box money can buy.

Given that 2S Nehalem is likely to canibalize Intel's own 4S Dunningtons, I expect to see Becton released very soon. I don't see how Intel can allow Nehalem to eat away at its own fat cash cow.

BTW, your HP SSD denial article is still missing, and March 25, 2009 isn't much better than March 1, 2009. Reply

[quote]
So you begin the argument by saying "which is most likely not enough to close the gap with the upcoming Nehalem CPU", and then you close the argument with "Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem"? Which one are you on?
[/quote]

How hard can it be to understand that a theoretical Sixcore version of Shanghai can not keep up with Nehalem, while the extra improvements (HT assist, mem controller) of Istanbul might bring the Istanbul CPU closer?

[quote]
It still would not overcome the 100% advantage Nehalem-EP has over Shanghais.
[/quote]

So by your reasoning Nehalem always has a 100% advantage? It is the best Server CPU Intel has brought out in years, but let us keep it sensible, shall we?

[quote]
Another funny thing is that the 41GB/sec Stream Bandwidth benchmark you posted, I am not sure that it is from the current Nvidia 3600 chipset. Theoretically, quad socket dual channel DDR2-800 cannot produce that amount of bandwidth in the first place.
[/quote]

It is that funny because it shows you are extremely critical for someone else, but for your own posting you are pretty sloppy.

[quote]
Then your argument on "scaling up" vs "scaling out". It isn't a philosophical question. People without software expertise to "scaling out" would have to "scale up".
[/quote]

In case of virtualization (and that is what I was talking about), it is a lot more easier to make software scale out or scale up. For example, a badly scaling php site (hard to scale up) could be divided into several VMs, and be one large NLB cluster. In this way you can both Scale out (few VMs on many servers) and up (many VMs on few 4S-8S servers).

[quote]
"The only thing that you presented that is true is the 32DIMM argument."
[/quote]

And you are the ultimate judge on that right? And yet I have already demonstrated 4 serious errors in your reasoning. And I didn't even talk about your completely off 37 GB/s (6/4 * 25 GB/s) calculation in your first post. As if Stream scales with the number cores.... (Remember Dunnington??)

I hope that you can keep the discussion more respectful instead of always jumping to the gun. I have been in this business for 10 years, and I have always learned a lot from good debates. So I have no problem with people point out in a respectful manner that I made a technical or reasoning error.

But constantly shouting that "you are so wrong" while you build up posts full of factual errors is simply a waste of time.

[Quote]
How hard can it be to understand that a theoretical Sixcore version of Shanghai can not keep up with Nehalem, while the extra improvements (HT assist, mem controller) of Istanbul might bring the Istanbul CPU closer?
[/Quote]

When isn't Istanbul a sixcore version of Shanghai? The extra improvements such as HT assist is there because otherwise, the six cores won't experience linear scaling. The memory controller is another issue I will point out later. All available data shows that 2S Nehalem is equivalent to 4S Shanghai right now, the problem I have with you is your constant usage of the word "might", which indicates that you don't have data to back up your assumption that the extra improvements like HT assist can overcome the 100% performance per watt advantage of Nehalem-EP vs Shanghai.

[Quote]
So by your reasoning Nehalem always has a 100% advantage? It is the best Server CPU Intel has brought out in years, but let us keep it sensible, shall we?
[/Quote]

No, not always, but in anything memory related, yes, triple channel DDR3-1333 will have double the bandwidth of dual channel DDR2-800 per Socket.

[Quote]
It is that funny because it shows you are extremely critical for someone else, but for your own posting you are pretty sloppy.
4 Sockets x 2 Channels x DDR800 * 8 Byte/channel= 51.2 GB/s is the theoretical maximum.
[/Quote]

Divide by two please!
Anyone can go to wikipedia and find out the theoretical memory bandwidth, but only the hardware designer and software engineers know that for each memory operation, it needs both a 64bit address and 64bit data, so the maximum theoretical DATA throughput like Stream benchmark, it is half of the theoretical memory bandwidth because address lines goes through the same memory bus. Look at the Quad Shanghai's Stream benchmark 25GB/sec, which is precisely the maximum Quad Socket Dual Channel DDR2-800 can offer. What magic did AMD pull off to suddenly get a 17GB/sec extra memory bandwidth off of Dual channel DDR2-800? I have thought about this more, and think I found out what game AMD is playing. Just read the comments below after I finish defending myself here.

[Quote]
In case of virtualization (and that is what I was talking about), it is a lot more easier to make software scale out or scale up. For example, a badly scaling php site (hard to scale up) could be divided into several VMs, and be one large NLB cluster. In this way you can both Scale out (few VMs on many servers) and up (many VMs on few 4S-8S servers).
[/Quote]
Using VM to scale up because you can't write a PHP program to scale up is stupid and a temporary solution. I really don't want to hear the argument that AMD is better at Virtualization anymore. They WERE better, but not anymore since the release of Dunnington and Nehalem. AMD HAD a stronghold in VM simply because of the cost of the VMware ESX licensing, making 4S AMD the only viable hardware to deploy on. But Dunnington already changed that. So is Nehalem-EP and soon Nehalem-EX.

[Quote]And you are the ultimate judge on that right? And yet I have already demonstrated 4 serious errors in your reasoning. And I didn't even talk about your completely off 37 GB/s (6/4 * 25 GB/s) calculation in your first post. As if Stream scales with the number cores.... (Remember Dunnington??)
[/Quote]
What 4 errors? You mean 4 facts you thought were errors? How is my 37GB/sec calculation wrong when it is only 10% off the AMD's internal Stream Benchmark? That number assumes linear scaling, which is pretty close. Stream does not scale with cores but it does scale with memory controller.

[Quote]
I hope that you can keep the discussion more respectful instead of always jumping to the gun. I have been in this business for 10 years, and I have always learned a lot from good debates. So I have no problem with people point out in a respectful manner that I made a technical or reasoning error.
[/Quote]

I have no trouble being rude to people who compare 4S AMD to 2S Nehalem and claims that 4S Istanbul is OMG winning against 2S Nehalem. It is simply human stupidity. The fact that you are in this business for 10 years only amplify how true that statement is. You should know why you did something wrong.

Now that I have defended my position, I have something more to add, that Johan you aren't going to enjoy reading.

The 41GB/sec Stream benchmark isn't and can't be from Nvidia 3600 chipset. It can't be done on Quad Socket Dual Channel DDR2-800. The only possible explanation you can find is that this benchmark is done on Dual Channel DDR3-1333. That Istanbul is a Phenom II with dual channel DDR3-1333 controller with two extra cores. The fact that Phenom II design had both DDR2 and DDR3 controller makes this all possible.

The benchmark is done with Dual Channel DDR3-1333. So quad Istabul gives you 8 channels of DDR3-1333 performance, which compared to Dual Nehalem-EP, which is 6 total channels of DDR3-1333. That's why you get 41GB/sec vs Nehalem-EP's 34GB/sec. Math would then tell you that AMD's implementation of DDR3 controller is 10% worse in performance than Intel's. Anyways, that also means, that benchmark is done on AMD's own HT3.0 enabled chipset supporting DDR3. So Istanbul is likely going to be a complete platform change rather than a drop in replacement, which is a major risk for AMD.

Now, that Istanbul is also a drop in replacement for Socket F on the DDR2-800 platform because it also has a ddr2 controller. Similar to how Phenom II can use both DDR2 and DDR3. But memory performance on 4 DIMM per socket DDR2-800 will be limited, let alone 8DIMM per Socket DDR2-533 downclock.

The game AMD is playing is that it wants the reviewer sites to publish benchmark using complete DDR3 platform, which nobody right now knows if it is mature enough yet. But AMD really wants to sell to the existing barcelona Socket F upgrade market. See? Advertise DDR3 based performance, but trying to fudge the performance down to DDR2 based platform where memory bandwidth will be cut by half. And that is the Dubai game.

Of course no one knows how it is going to play out yet, and my opinion is that Johan, you should only push benchmark when you have retail CPUs in hand.

Another final point I want to make is this: adding cores isn't a solution to performance per watt deficiency of the AMD CPU design. Even if the cores scale linearly, so does the power it takes. So performance per watt actually stays the same. In fact performance per watt per dollar actually goes down because 6 core will cost more than 4 core. Are you going to choose a 6 core Istanbul with 120W TDP over a 95W TDP Nehalem-EP knowing that it is also 50% slower? I wouldn't. Performance per watt tells you you shouldn't. Anyone can CTRL-C and CTRL-V a whole bunch of cores, it does not solve the finer point that AMD's cores are at half the performance/watt compared to Nahalem cores. Imagine that Intel had to compete on performance/watt against AMD when it had the FSB/FB-DIMM handicap, now that the handicaps are removed, that's how you get a 100% boost in performance/watt. This isn't something AMD can engineer over night.

Johan, in the bigger picture, you are just getting paid to write "notes" to the hardware community based on what AMD PR department wants the public to believe. Being in the hardware scene for that long, you should have the intelligence to tell which benchmarks are true, and which benchmarks are fluff.

If you want to continue discussing, go straight ahead. I would actually contact your AMD PR representative and ask how they got the 41GB/sec Stream benchmark. Or did you get that information from Youtube?

Whether you are right or wrong, this isn't the way to convince someone. It tends to make people think that you aren't thinking clearly and they tend to just assume you are wrong.

[Quote] ... which indicates that you don't have data to back up your assumption that the extra improvements like HT assist can overcome the 100% performance per watt advantage of Nehalem-EP vs Shanghai. [/Quote]

I seem to remember reading that it might be able to compete with the more common 40% advantage. The reason he said "So by your reasoning Nehalem always has a 100% advantage?" is because that's the case you are constantly bringing up.

[Quote] ... only the hardware designer and software engineers know that for each memory operation, it needs both a 64bit address and 64bit data [/Quote]

First, there are separate address and data lines (generally of different bit widths) going to each DDR chip. If you'd ever designed a hardware memory interface you'd know this.

Second, the way memory works in practice (simplified) is it takes a single memory address and bursts out data sequentially starting from that address. The reason you can't hit theoretical rates is that there is access latency associated with each operation. (Do a search on Column Access Strobe and keep reading about associated memory timings ... read again ... come back in a month) Bursting data was an idea implemented long long ago to partially hide latencies like this. It should be noted, though, that the number of data bits you can burst is still limited (particularly if the data required isn't stored linearly).

Third, making absolute generalizations like "only the hardware designer and software engineers know ..." is presumptuous. In my experience, most software engineers I know don't really know much about the hardware outside of how it affects their code. I.E. they do an exceptional job of designing code to take advantage of n cores with a maximum memory bandwidth of X GB/s and an average latency of Z ns, but they couldn't look at a system and tell you its bandwidth and latency characteristics. There are obviously exceptions. Are you by chance a software engineer?

[Quote] How is my 37GB/sec calculation wrong when it is only 10% off the AMD's internal Stream Benchmark? [/Quote]

Your 37GB/s is a theoretical maximum. If they are indeed getting 41GB/s in a real world application with the system in the article, then being off by only 10% isn't any less wrong as it exceeds the maximum.

[Quote] The 41GB/sec Stream benchmark isn't and can't be from Nvidia 3600 chipset. It can't be done on Quad Socket Dual Channel DDR2-800. The only possible explanation you can find is that this benchmark is done on Dual Channel DDR3-1333. [/Quote]

This is contradicted by the following.

[Quote] To better understand this, we combined our own stream benchmarking with the one that AMD presented. All AMD systems are using DDR-2 800. [/Quote]

Am I to understand that your real issue is the honesty of not just AMD but the author? While I don't have the same misgivings about the author, I do agree with the idea that suppositions should be back with evidence. What is presented here is hardly more than theory.

However, this is only a blog post. The only real purpose of this is to let people know that Istanbul is more significant than adding two cores to Shanghai. Detailed analysis should be saved for the detailed review. Whether AMD is trying to pull the wool over our eyes or not will be obvious then.

I'm not sure who you think you are, but you have no idea what you are talking about on this STREAM BW / DRAM topic. Johan's math is right. Yours is not. I won't be surprised if you respond to this post claiming (yet again) that you are right . . . but it just isn't so. Save yourself some embarassment, learn something, and admit you made a mistake and learned something from the valuable feedback you've received from this discussion so that future dialogue can be enhanced.

A single error in math is understandable, everyone makes mistakes. But when you continue to make the same error over and over (even when it's been pointed out) you just look like a fool.
Reply

Yeah, the 4-socket 32-core "Beckton" Xeon platform will be absolutely insane! It has quad channel memory and FB-DIMMs so it should support an enormous amount of memory.. Coming in H1 2010 I believe Reply

Your comment "Yes, 4S blades exist, but as IBM's numbers show, they are not on the radar when it comes to consolidating VMs." has no merit.

The Chart simply says "Mostly Racks". That means there are some blades (and "some" in the grand scale of things is still fairly mind boggling).

The fact is I saw a quote come across my desk for another group of IT people at my office that was for 16 x HP BL680c maxed out with each server containing 4 x Xeon 7450s and 128GB of RAM. They were also able to put 8 Gig NICs and 2 4Gbps FC ports in each blade. This was 1 order of a large group of orders for a new DataCenter that has been designed from the ground up to be treated as a hosting facility and VMs are a large portion of that technology stack.

So while I would tend to agree that 4P Servers are "mostly" Rack Models, to say they are "not on the radar" is a fallacy.
You also need to give the Blades a little more time. Rack servers have been around for a very long time. Blades only reached a really mature level a few years ago. Companies usually don't replace their infrastructure, especially the big servers hosting Databases, SAP, VMware and such as you mentioned, but every 4-5 years.

So if you have some real numbers, show them.
But keep the conjecture to a minimum if you don't. Reply

Are you trying to say that although AMD's 4S system may be somewhat faster than the Intel 2S, the main reason that you foresee people picking the 4S is due to a vastly increased amount of memory capacity and design continuity with current installations?

I thought that the newest Intel server chips were using the same architecture as the Core i7? Wouldn't that make it fairly trivial for Intel to use those same chips (with their overwhelming per socket performance) in a 4S design with similar memory capacity? Reply

1. if 6core Istanbul is launched early, it will have a window of opportunity to keep up with 4core Nehalem. This will allow AMD to be competitive (depending on the app) even in 2S.

2. Dual Nehalem (at 2.93 GHz) will be as fast/almost as fast as quad Shanghai (at 2.7 GHz) in some apps. Quad Istanbul will open up gap, which is important for AMD to keep the AMD marketshare in 4S, which is very profitable market. This will make AMD very competitive in 4S until Beckton arrives (H1 2010). Reply

1. Don't start your statement with "if", especially "if" the if is about AMD launching a product early.

2. You still don't get it. Comparing 4S to 2S Nehalem is flawed in the first place. The fact that 2S is holding up is the miracle of the decade. Quad Istanbul is unlikely to open up the gap significantly to make AMD's 4S solution attractive.

I really want to find out what you would say "if" AMD went bankrupt in 2009 or early 2010, which is highly probable.
Reply