SOLORI’s Take: The most interesting aspect of the EX benchmark is its clock-adjusted scaling factor: between 70% and 91% versus a 2P/8-core Nehalem-EP reference (Cisco UCS, B200 M1, 25.06@17 tiles). The unpredictable nature of Intel’s “turbo” feature – varying with thermal loads and per-core conditions – makes an exact clock-for-clock comparison difficult. However, if the scaling factor is 90%, the EX blows away our previous expectations about the platform’s scalability. Where did we go wrong when we predicted a conservative 44@39 tiles? We’re looking at three things: (1) a bad assumption about the effectiveness of “turbo” in the EP VMmark case (setting Ref_EP_Clock to 3.33 GHz), and (2) underestimating EX’s scaling efficiency (assumed 70%), (3) assuming a 2.26GHz clock for EX.

Correcting for the as-tested clock/turbo numbers, and using AMD’s 2P-to-4P VMmark scaling efficiency of 83%, and shifting to the new UCS baseline (with newer ESX version) the Nehalem-EX prediction factors to:

Clearly, this approach either overestimates the scaling efficiency or underestimates the “turbo” mode. IBM claims that a 2.93 GHz “turbo” setting is viable where Intel suggests 2.67 GHz is the maximum, so there is a source of potential bias. Looking at the tiles-per-core ratio of the VMmark result, the Nehalem-EX drops from 2.13 tiles per core on EP/2P platforms to 1.5 tiles per core on EX/4P platforms – about a 30% drop in per-core loading efficiency. That indicator matches well with our initial 75% scaling efficiency moving from 2P to 4P – something that AMD demonstrated with Istanbul last August. Given the high TDP of EX and IBM’s 2.93 GHz “turbo” specification, it’s possible that “turbo” is adding clock cycles (and power consumption) and compensating for a “lower” scaling efficiency than we’ve assumed. Looking at the same estimation with 2.93GHz “clock” and 71% efficiency (1.5/2.13), the numbers fall in line with VMmark:

This give us a good basis for evaluating 2P vs. 4P Nehalem systems: scaling factor of 71% and capable of pushing clock towards the 3GHz mark within its thermal envelope. Both of these conclusions fit typical 2P-to-4P norms and Intel’s process history.

That’s nowhere near good enough to top the current 8P, 48-core Istanbul VMmark at 53.73@35 tiles, so we’ll likely have to wait for faster 6100 parts to see any new AMD records. However, assuming AMD’s proposition is still “value 4P” so about 200 VM’s at under $18K/server gets you around $90/VM or less.

It’s my second day at the beautiful Mandalay Bay in Las Vegas, Nevada and VMware PartnerExchange 2010. Yesterday was filled with travel and a generous “Tailgate Party” with burgers, dogs, beverages and lots of VMware geeks! I managed to catch the last quarter of the game from the Mandalay Bay Poker Room where I added to my chip stack at the 1/2 No-Limit Texas Hold ‘Em tables. Then it was early to bed – about 9PM PST – where I studied for the upcoming VCP410 exam.

Today (Monday) was occupied with a partners-only VMware Certified Professional, Version 4, Preparation Course which outlined the VCP4 Blueprint, question examples and test-taking strategies. The “best answer,” multiple-choice format of the VCP410 exam promises to offer me some challenges as I apply black-and-white logic to a few shades-of-grey questions. The best strategy to overcome such an obstacle: read the question in its entirety, eliminate all wrong answers, then choose the answer(s) that best satisfy the entire question. A key example is this from the on-line “mock-up” exam:

What is the maximum number of vNetwork switch ports per ESX host and vCenter Server instance?

Well, it might have been obvious that “c” is the “correct” answer, but “a” is right off of Page 6 of the vSphere Configuration Maximums guide. Both are solidly “correct” answers, it’s just that “c” speaks to both the ESX question and the vCenter question making it more correct. However, neither is completely correct since vDS ports are bound by vCenter and ESX host, while vSS ports are bound only by ESX host. Since neither answer “a” or “c” specifies which limitation they are answering – host or vCenter – it is left to subjective reasoning to infer the intent. According to Jon Hall (VMware, Florida) the most ports any vNetwork switch can have in any one host is 4,088 – regardless of type. Therefore, to reach the “total virtual network ports per host (vDS and vSS ports) at least one switch of each type must exist. Alone, they can only reach 4,088 ports, however the Configuration Maximums document never spells this out for the vNetwork Distributed Switch. Hopefully this exception will be foot-noted in the next revision of the document. [Note: the additional information about vDS type vNetwork switches that Jon logically invalidates “a” as a response.]

Following the VCP4 Prep Course, I “recharged” in the Alumni Lounge. VMware had snacks and drinks to quell the appetite and lots of power outlets to restore my iPhone and laptop. While I waited, I contacted the wife and got the 4-1-1 on our baby, checked e-mail and ran through the “mock-up” exam a couple of times. Then it was off to the Welcome Reception at the VMware Experience Hall where sponsors and exhibitors had their wares on display.

iPhone Screen Capture of the ESX Host Running Nehalem-EX, 4P/32C/64T

Just inside the Hall – across from the closest beverage station – was Intel’s booth and the boys in blue were demonstrating vMotion over 10GE NICs. Yes, it was fast (as you’d expect) but the real kick was the “upcoming” 10GE Base-T adapters to challenge the current price-performance leader: the 10GE Base-CR (also supporting SFP+). At under $400/port for 10GE, it’s hard to remember a reason for using 1Gbps NICs… Oh yes, the prohibitive per-port cost of 10GE switches. AristaNetworks to the rescue???

Intel was also showing their “modular server” system. Unfortunately, the current offering doesn’t allow for SAS JBOD expansion in a meaningful way (read: running NexentaStor on one/two of the “blades”), but after discussing the issue of SAS/love with the guys in the blue booth, interests were peaked. Evan, expect a call from Intel’s server group… Seriously, with 14x 2.5″ drives in a SAS Expander interconnected chassis, NexentaStor + SSD + 15K SAS would rock!

Last but not least, Intel was proudly showing their 4P, Nehalem-EX running VMware ESX with 512GB of RAM (DDR3) and demonstrating 64active threads (pictured.) This build-out offers lots of virtualization goodness at a hereto unknown price point. Suffice to say, at 1.8GHz it’s not a screamer, but the RAS features are headed in the right direction. When you rope 64-threads (about 125-250 VM’s) and 1TB worth of VM’s (yes, 1TB RAM – about $250K worth using “on-loan Samsung parts”) you are talking about a lot of “eggy in the basket.” By enhancing the RAS capabilities of these giant systems, component failure mitigation is becoming less catastrophic – eventually allowing only a few VM’s to be impacted by a point failure instead of ALL running VM’s on the box.

vCenter ESX Host Status Showing 512GB of RAM

In case you haven’t seen an ESX host with 512GB of available RAM, check-out this screen capture (excuse the iPhone quality) to the right. That’s about $33K worth of DDR3 memory sitting in that box and assuming that the EX processors run $2K a piece and giving $6K for the remainder of the system, that’s nearly $6K/VM in this demo: fantastically decadent! Of course – and in all due fairness to the boys in blue – VM density was not the goal in this demonstration: RAS was, and the 2-bit error scrubbing – while painful as watching paint dry – is pretty cool and soon to be needed (as indicated above) for systems with this capacity.

Other vendors visited were Wyse and Xsigo. The boys in yellow (Wyse) were pimping their thin/zero clients with some compelling examples of PCoIP (Wyse 20p) and MMR (Wyse r90lew). The PCoIP demos featured end-to-end hardware Teradici cards displaying clips from Avatar, while the MMR demo featured 720p movie clips from an iMAX cut of dog fight training. While the PCoIP was impressive and flawless, the upcoming MMR enhancements – while flawed in the beta I saw – were nothing short of impressive.

Considering that the MMR-capable thin client was running a 1.5GHz AMD Semperon, the 720p Windows Media stream looked all the better. Looking back at the virtual machine from the ESX console, only about 10-15% of a core was being consumed to “render” the video. But that’s the beauty of MMR: redirect the processor intensive decoding to the end-point and just send the stream un-decoded. While PCoIP is a win in LANs with knowledge workers and call center applications, the MMR-based thin clients look pretty good for Education and YouTube-happy C-level employees looking to catch-up on their Hulu…

“wPrime uses a recursive call of Newton’s method for estimating functions, with f(x)=x2-k, where k is the number we’re sqrting, until Sgn(f(x)/f'(x)) does not equal that of the previous iteration, starting with an estimation of k/2. It then uses an iterative calling of the estimation method a set amount of times to increase the accuracy of the results. It then confirms that n(k)2=k to ensure the calculation was correct. It repeats this for all numbers from 1 to the requested maximum.”

SOLORI’s Take: As a “reality check” we can compared the reigning quad-socked, quad-core Opteron 8393 SE result in wPrime 32M and wPrime 1024M at 3.90 and 89.52 seconds, respectively. Adjusted for clock and core count versus its Shanghai cousin, the Magny-Cours engineering samples – at 3.54 and 75.77 seconds, respectively – turned-in times about 10% slower than our calculus predicted. While still “record breaking” for 2P systems, we expected the Magny-Cours/Istanbul cores to out-perform Shanghai clock-per-clock – even at this stage of the game.

Due to the multi-threaded nature of the wPrime benchmark, it is likely that the HT Assist feature – enabled in a 2P Magny-Cours system by default – is the cause of the discrepancy. By reducing the available L3 cache by 1MB per die – 4MB of L3 cache total – HT Assist actually could be creating a slow-down. However, there are several things to remember here:

These are engineering samples qualified for 1.7GHz operation

Speed enhancements were performed with tools not yet adapted to Magny-Cours

The author indicated a lack of control over AMD’s Cool ‘n Quiet technology which could have made “as tested” core clocks somewhat lower than what CPUz reported (at least during the extended tests)

It is speculated that AMD will release Magny-Cours at 2.2GHz (top bin) upon release, making the 2.6+ GHz results non-typical

[Ed:The 10% difference is likely due to the fact that the author was unable to get “more than one core” clocked at 3.0GHz. Likewise, he was uncertain that all cores were reliably clocking at 2.6GHz for the longer wPrime tests. Again, this engineering sample was designed to run at 1.7GHz and was not likely “hand picked” to run at much higher clocks. He speculated that some form of dynamic core clocking linked to temperature was affecting clock stability – perhaps due to some AMD-P tweaks in Magny-Cours.]

Using the same 8-processor HP ProLiant DL785 G6 platform as in the previous run – complete with 2.8GHz AMD Opteron 8439 SE 6-core chips and 256GB DDR2/667 – the new score comes with significant performance bumps in the javaserver, mailserver and database results achieved by the same system configuration as the previous attempt – including the same ESX 4.0 version (164009). So what changed to add an additional 5 tiles to the team’s run? It would appear that someone was unsatisfied with the storage configuration on the mailserver run.

Given that the tile ratio of the previous run ran about 6% higher than its 24-core counterpart, there may have been a small indication that untapped capacity was available. According to the run notes, the only reported changes to the test configuration – aside from the addition of the 5 LUNs and 5 clients needed to support the 5 additional tiles – was a notation indicating that the “data drive and backup drive for all mailserver VMs” we repartitioned using AutoPart v1.6.

The change in performance numbers effectively reduces the virtualization cost of the system by 15% to about $257/VM – closing-in on its 24-core sibling to within $10/VM and stretching-out its lead over “Dunnington” rivals to about $85/VM. While virtualization is not the primary application for 8P systems, this demonstrates that 48-core virtualization is definitely viable.

SOLORI’s Take: HP’s performance team has done a great job tuning its flagship AMD platform, demonstrating that platform performance is not just related to hertz or core-count but requires balanced tuning and performance all around. This improvement in system tuning demonstrates an 18% increase in incremental scalability – approaching within 3% of the 12-core to 24-core scaling factor, making it actually a viable consideration in the virtualization use case.

In recent discussions with AMD about the SR5690 chipset applications for Socket-F, AMD re-iterated that the mainstream focus for SR5690 has been Magny-Cours and the Q1/2010 launch. Given the close relationship between Istanbul and Magny-Cours – detailed nicely by Charlie Demerjian at Semi-Accurate – the bar is clearly fixed for 2P and 4P virtualization systems designed around these chips. Extrapolating from the similarities and improvements to I/O and memory bandwidth, we expect to see 2P VMmarks besting 32@23 and 4P scores over 54@39 from HP, AMD and Magny-Cours.

SOLORI’s 2nd Take: Intel has been plugging away with its Nehalem-EX for 8-way systems and – delivering 128-threads – promises to deliver some insane VMmarks. Assuming Intel’s EX scales as efficiently as AMD’s new Opterons have, extrapolations indicate performance for the 4P, 64-thread Nehalem-EX shoud fall between 41@29 and 44@31 given the current crop of speed and performance bins. Using the same methods, our calculus predicts an 8P, 128-thread EX system should deliver scores between 64@45 and 74@52.

With EX expected to clock at 2.66GHz with 140W TDP and AMD’s MCM-based Magny-Cours doing well to hit 130W ACP in the same speed bins, CIO’s balancing power and performance considerations will need to break-out the spreadsheets to determine the winners here. With both systems running 4-channel DDR3, there will be no power or price advantage given on either side to memory differences: relative price-performance and power consumption of the CPU’s will be major factors. Assuming our extrapolations are correct, we’re looking at a slight edge to AMD in performance-per-watt in the 2P segment, and a significant advantage in the 4P segment.

What does this mean for AMD and the only 6-core shipping today? Since Intel’s still projecting Q2/2010 for the server part, AMD has a decent opportunity to grow market share for Istanbul. Intel’s biggest rival will be itself – facing a wildly growing number of SKU’s in across its i-line from i5, i7, i8 and i9 “families” with multiple speed and feature variants. Clearly, the non-HT version would stand as a direct competitor to Istanbul’s native 6-core SKUs. Likewise, Istanbul may be no match for the 6-core Nehalem with HT and “turbo core” feature set.

However, with an 8-core “Beckton” Nehalem variant on the horizon, it might be hard to understand just where the Gulftown fits in Intel’s picture. Intel faces a serious production issue, filling fab capacity with 4-core, 6-core and 8-core processors, each with speed, power, socket and HT variants from which to supply high-speed, high-power SKUs and lower-speed, low-power SKUs for 1P, 2P and 4P+ destinations. Doing the simple math with 3 SKU’s per part Intel would be offering the market a minimum of 18 base parts according to their current marketing strategy: 9 with HT/turbo, 9 without HT/turbo. For socket LGA-1366, this could easily mean 40+ SKUs with 1xQPI and 2xQPI variants included (up from 23).

SOLORI’s take: Intel will have to create some interesting “crippling or pricing tricks” to keep Gulftown from canibalizing the Gainstown market. If they follow their “normal” play book, we prodict the next 10-months will play out like this:

Initially there will be no 8-core product for 1P and 2P systems (LGA-1366), allowing for artificially high margins on the 8-core EX chip (LGA-1567), slowing the enevitable canibalization of the 4-core/2P market, and easing production burdens;

Gulftown will remain high-power (90-130W TDP) and be positioned against AMD’s G34 systems and Magny-Cours – plotting 12-core against 12-thread;

Intel creates a “socket refresh” (LGA-1566?) to enable “inexpensive” 2P-4P platforms from its Gulftown/Beckton line-up in 2H/2010 (ostensibly to maintain parity with G34) without hurting EX;

Revised, lower-power variants of Gainstown will be positioned against AMD’s C32 target market;

Intel will cut SKUs in favor of higher margins, increasing speed and features for “same dollar” cost;

Non-HT parts will begin to disappear in 4-core configurations completely;

Intel’s AES enhancements in Gulftown will allow it to further differentiate itself in storage and security markets;

It would be a mistake for Intel to continue growing SKU count or provide too much overlap between 4-core HT and 6-core non-HT offerings. If purchasing trends soften in 4Q/09 and remain (relatively) flat through 2Q/10, Intel will benefit from a leaner, well differentiated line-up. AMD has already announced a “leaner” plan for G34/C32. If all goes well at the fabs, 1H/2010 will be a good ole fashioned street fight between blue and green.

We expect to hear more news about Istanbul’s availability in keeping with Tyan’s upcoming announcement next week. Based on current technology and economic trends, Istanbul and G34 could offer AMD a solid one-two punch to counter Intel’s relentless “tick-tock” pace. With Nehalem servers sales weak despite early expectations and compounding economic pressures, market timing may be more ideally suited for AMD’s products than Intel’s for a change. As Gartner puts it, “the timing of Nehalem is a bit off, and it probably won’t make much of an impact this year.”

In the meantime, Phil Hughes at AMD has a posted a personal reflection on Opteron’s initial launch, starting with the IBM e325 in 2003, and ending with Opteron’s impact on the Intel Itanium market by year-end (while resisting a reference to “the sinking of the Itanic“). Phil acknowledges Sun’s influence on Opteron and links to some news articles from 2003. See his full post, “The Sun Also Rises,” here… As 64-bit processors go, 2003 was much more the year of the Opteron rather than “the year of the Itanium” (as predicted by Intel’s Paul Otellini.)

Speaking of Itanium, TechWorld has an article outlining how Intel’s upcoming Nehalem-EX – with the addition of MCA technology derived from Itanium – could bring an end to the beleagered proprietary platform. TechWorld cites Insight 64 analyst Nathan Brookwood as saying the new Xeon will finally break Intel’s policy of artificially crippling of the x86 processor which has prevented Xeon from being competitive with Itanium. The 8-core, SMT-enabled EX processor was being demonstrated by IBM in an 8-socket configuration.

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.