Most importantly for virtualization systems architects is how the vCPU scheduling affects “measured” performance. The telling piece comes from the difference in comparison results where vCPU scheduling is equalized:

AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6, 2009.

When comparing the results, De Gelas hits on the I/O factor which chiefly separates VMmark from vAPUS:

The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570’s crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.

This is yet another issue that VMware architects struggle with in complex deployments. The latency in “Dunnington” is a huge contributor to its downfall and why the Penryn architecture was a dead-end. Combined with 8 additional threads in the 2P form factor, Nehalem delivers twice the number of hardware execution contexts than Shanghai, resulting in significant efficiencies for Nehalem where small working data sets are involved.

When larger sets are used – as in vAPUS – the Istanbul’s additional cores allows it to close the gap to within the clock speed difference of Nehalem (about 12%). In contrast to VMmark which implies a 3:2 advantage to Nehalem, the vAPUS results suggest a closer performance gap in more aggressive virtualization use cases.

“wPrime uses a recursive call of Newton’s method for estimating functions, with f(x)=x2-k, where k is the number we’re sqrting, until Sgn(f(x)/f'(x)) does not equal that of the previous iteration, starting with an estimation of k/2. It then uses an iterative calling of the estimation method a set amount of times to increase the accuracy of the results. It then confirms that n(k)2=k to ensure the calculation was correct. It repeats this for all numbers from 1 to the requested maximum.”

SOLORI’s Take: As a “reality check” we can compared the reigning quad-socked, quad-core Opteron 8393 SE result in wPrime 32M and wPrime 1024M at 3.90 and 89.52 seconds, respectively. Adjusted for clock and core count versus its Shanghai cousin, the Magny-Cours engineering samples – at 3.54 and 75.77 seconds, respectively – turned-in times about 10% slower than our calculus predicted. While still “record breaking” for 2P systems, we expected the Magny-Cours/Istanbul cores to out-perform Shanghai clock-per-clock – even at this stage of the game.

Due to the multi-threaded nature of the wPrime benchmark, it is likely that the HT Assist feature – enabled in a 2P Magny-Cours system by default – is the cause of the discrepancy. By reducing the available L3 cache by 1MB per die – 4MB of L3 cache total – HT Assist actually could be creating a slow-down. However, there are several things to remember here:

These are engineering samples qualified for 1.7GHz operation

Speed enhancements were performed with tools not yet adapted to Magny-Cours

The author indicated a lack of control over AMD’s Cool ‘n Quiet technology which could have made “as tested” core clocks somewhat lower than what CPUz reported (at least during the extended tests)

It is speculated that AMD will release Magny-Cours at 2.2GHz (top bin) upon release, making the 2.6+ GHz results non-typical

[Ed:The 10% difference is likely due to the fact that the author was unable to get “more than one core” clocked at 3.0GHz. Likewise, he was uncertain that all cores were reliably clocking at 2.6GHz for the longer wPrime tests. Again, this engineering sample was designed to run at 1.7GHz and was not likely “hand picked” to run at much higher clocks. He speculated that some form of dynamic core clocking linked to temperature was affecting clock stability – perhaps due to some AMD-P tweaks in Magny-Cours.]

Using the same 8-processor HP ProLiant DL785 G6 platform as in the previous run – complete with 2.8GHz AMD Opteron 8439 SE 6-core chips and 256GB DDR2/667 – the new score comes with significant performance bumps in the javaserver, mailserver and database results achieved by the same system configuration as the previous attempt – including the same ESX 4.0 version (164009). So what changed to add an additional 5 tiles to the team’s run? It would appear that someone was unsatisfied with the storage configuration on the mailserver run.

Given that the tile ratio of the previous run ran about 6% higher than its 24-core counterpart, there may have been a small indication that untapped capacity was available. According to the run notes, the only reported changes to the test configuration – aside from the addition of the 5 LUNs and 5 clients needed to support the 5 additional tiles – was a notation indicating that the “data drive and backup drive for all mailserver VMs” we repartitioned using AutoPart v1.6.

The change in performance numbers effectively reduces the virtualization cost of the system by 15% to about $257/VM – closing-in on its 24-core sibling to within $10/VM and stretching-out its lead over “Dunnington” rivals to about $85/VM. While virtualization is not the primary application for 8P systems, this demonstrates that 48-core virtualization is definitely viable.

SOLORI’s Take: HP’s performance team has done a great job tuning its flagship AMD platform, demonstrating that platform performance is not just related to hertz or core-count but requires balanced tuning and performance all around. This improvement in system tuning demonstrates an 18% increase in incremental scalability – approaching within 3% of the 12-core to 24-core scaling factor, making it actually a viable consideration in the virtualization use case.

In recent discussions with AMD about the SR5690 chipset applications for Socket-F, AMD re-iterated that the mainstream focus for SR5690 has been Magny-Cours and the Q1/2010 launch. Given the close relationship between Istanbul and Magny-Cours – detailed nicely by Charlie Demerjian at Semi-Accurate – the bar is clearly fixed for 2P and 4P virtualization systems designed around these chips. Extrapolating from the similarities and improvements to I/O and memory bandwidth, we expect to see 2P VMmarks besting 32@23 and 4P scores over 54@39 from HP, AMD and Magny-Cours.

SOLORI’s 2nd Take: Intel has been plugging away with its Nehalem-EX for 8-way systems and – delivering 128-threads – promises to deliver some insane VMmarks. Assuming Intel’s EX scales as efficiently as AMD’s new Opterons have, extrapolations indicate performance for the 4P, 64-thread Nehalem-EX shoud fall between 41@29 and 44@31 given the current crop of speed and performance bins. Using the same methods, our calculus predicts an 8P, 128-thread EX system should deliver scores between 64@45 and 74@52.

With EX expected to clock at 2.66GHz with 140W TDP and AMD’s MCM-based Magny-Cours doing well to hit 130W ACP in the same speed bins, CIO’s balancing power and performance considerations will need to break-out the spreadsheets to determine the winners here. With both systems running 4-channel DDR3, there will be no power or price advantage given on either side to memory differences: relative price-performance and power consumption of the CPU’s will be major factors. Assuming our extrapolations are correct, we’re looking at a slight edge to AMD in performance-per-watt in the 2P segment, and a significant advantage in the 4P segment.

SOLORI’s Take: While the September timing of the release might imply a G6 with AMD’s SR5690 and IOMMU, we’re doubtful that the timing is anything but a coincidence: even though such a pairing would enable PCIe 2.0 and highly effective 10Gbps solutions. The modular design of the DL785 series – with its ability to scale from 4P to 8P in the same system – mitigates the economic realities of the dwindling 8P segment, and HP has delivered the pinnacle of performance for this technology.

We are also impressed with HP’s performance team and their ability to scale Shanghai to Istanbul with relative efficiency. Moving from DL785 G5 quad-core to DL785 G6 six-core was an almost perfect linear increase in capacity (95% of theoretical increase from 32-core to 48-core) while performance-per-tile increased by 6%. This further demonstrates the “home run” AMD has hit with Istanbul and underscores the excellent value proposition of Socket-F systems over the last several years.

Unfortunately, while they demonstrate a 91% scaling efficiency from 12-core to 24-core, HP and Istanbul have only achieved a 75% incremental scaling efficiency from 24-cores to 48-cores. When looking at tile-per-core scaling using the 8-core, 2P system as a baseline (1:1 tile-to-core ratio), 2P, 4P and 8P Istanbul deliver 91%, 83% and 62.5% efficiencies overall, respectively. However, compared to the %58 and 50% tile-to-core efficiencies of Dunnington 4P and 8P, respectively, Istanbul clearly dominates the 4P and 8P performance and price-performance landscape in 2009.

In today’s age of virtualization-driven scale-out, SOLORI’s calculus indicates that multi-socket solutions that deliver a tile-to-core ratio of less than 75% will not succeed (economically) in the virtualization use case in 2010, regardless of socket count. That said – even at a 2:3 tile-to-core ratio – the 8P, 48-core Istanbul will likely reign supreme as the VMmark heavy-weight champion of 2009.

SOLORI’s 2nd Take: HP and AMD’s achievements with this Istanbul system should be recognized before we usher-in the next wave of technology like Magny-Cours and Socket G34. While the DL785 G6 is not a game changer, its footnote in computing history may well be as a preview of what we can expect to see out of Magny-Cours in 2H/2010. If 12-core, 4P system price shrinks with the socket count we could be looking at a $150/VM price-point for a 4P system: now that would be a serious game changer.

NEC’s venerable Express5800/A1160 tops the 48-core VMmark category today with a score of 34.05@24 tiles to wrest the title away from IBM who established the category back in June, 2009. NEC’s new “Dunnington” X7460 Xeon-based score represents a performance per tile ratio of 1.41 and a tile to core efficiency of 50% using 128GB of ECC DDR2 RAM.

Compared to the leading 24-core “Dunnington” results – held by IBM’s x3850 M2 at 20.41@14 tiles – the NEC benchmark sets a scalability factor of 85.7% when moving from 4-socket to 8-socket systems. Both servers from NEC and IBM are scalable systems allowing for multiple chassis to be interconnected to achieve greater CPU-per-system numbers – each scaling in 4-CPU increments – ostensibly for OLTP advantages. The NEC starts at around $70K for 128GB and 48-cores resulting in a $486/VM cost to VMmark.

If AMD’s Istanbul scales to 8-socket at least as efficiently as Dunnington, we should be seeing some 48-core results in the 43.8@30 tile range in the next month or so from HP’s 785 G6 with 8-AMD 8439 SE processors. You might ask: what virtualization applications scale to 48-cores when $/VM is doubled at the same time? We don’t have that answer, and judging by Intel and AMD’s scale-by-hub designs coming in 2010, that market will need to be created at the OEM level.

Based on the performance we’re seeing in 8-socket systems relative to 4-socket and the upcoming “massively mult-core” processors in 2010, the law of diminishing returns seems to favor the 4-socket system as the limit for anything but massive OLTP workloads. Even then, we expect to see 48-core in a “4-way” box more efficient than the same number of cores in an 8-way box. The choice in virtualization will continue to be workload biased, with 2P systems offering the best “small footprint” $/VM solution and 4P systems offering the best “large footprint” $/VM solution.

Today AMD published pricing for 5 new Istanbul SKUs – two designated as 105W APC high-performance SE and three as 55W APC low-power HE models.

In the SE category, the 2439SE and 8439SE at 2.8GHz replace the top-bin 2435/8435 Istanbul which occupies the 2.6GHz, 75W APC bin. Besides the clock frequency changes, maximum CPU temp is reduced from 76C to 71C. As with all other Istanbul’s so far, these are HT3 bus parts running at 4.8GT/s. Price per socket has been announced at $1,019 and $2,649 for the 2439SE and 8439SE, respectively.

While the new SE parts do little to help the Opteron surpass the X5560 in raw performance, they fit well into the price-performance picture for AMD so long as street prices for the X5560 continue to hover in the $1,200-1,300 range.

SPECint_rate2006 - AMD Istanbul SE SKU's

In the HE category, the 2425HE/8425HE and 2423HE are new clock speed bins running at 2.1GHz and 2.0GHz, respectively. These parts maintain the same 76C maximum CPU temp as the normal 75W ACP parts, but are selected to consume just 55W ACP. Again, these SKU’s also carry the 4.8GT/s HT3 bus of their Istanbul brethren. Pricing per socket has been announced at $523 and $1,514 for the 2425HE and 8425HE, respectively, with the 2423 HE targeted at $455 each.

SPECint_rate2006 - AMD Istanbul HE SKU's

Here, AMD’s lower power target and pricing help the chip maker do some profit-taking as the price-performance of the HE parts appear to offer a measurable advantage over the L5506 (60W TDP) which is circling the $475 region (street price). See AMD’s official press release about High Energy Efficiency and the Processing Power of Six-Cores for more details.

SOLORI’s Take: AMD has expanded the Istanbul line with both high-performance and low-power SKU’s as promised. With DDR3 prices inching downward, AMD’s price-performance position is eroding slowly as Q3/2009 approaches. However, the 2-to-1 price penalty for top-bin Xeon/Nehalem platforms will take a lot more time to overcome, leaving the AMD the solid choice for budget conscious virtualization.

What’s perhaps more exciting for AMD followers – especially in the good-enough performance market – is sitting in the HE bin. The HE shows weakness in the 2P space, however, against the 2.26GHz L5520 part from Intel which sports 8 thread per CPU and can burst core speeds in excess of 3GHz with its “turbo” feature. This places the 2P 2425 HE somewhere in-between L5506 and L5520 in performance-per-watt, with 2425 HE maintaining a reasonable price-performance advantage.

In the unchallenged 4P space, the 8425 HE, at 2.1GHz and $1,580 (est. street price) offers nearly 3:2 power savings over the standard part offering 24-cores at a little over 200W ACP (4P configurations). This savings will help scale-out clouds both private and public.

SPECfp_rate2006 (SE SKU)

SPECint_rate2006 (SE SKU)

SPECfp_rate2006 (HE SKU)

SPECint_rate2006 (HE SKU)

(Note: SPEC CPU results gathered from published tables at http://spec.org.)

Thanks to a tweet from @ErikBussink and the quick thinking of Charlie Demerjian at SemiAccurate we’ve been treated to a picture of the upcoming Tyan S8212 (2-way) based on AMD’s new line-up of motherboard chip sets. While we see a x16 and 3 x8 PCIe slots, 6 SATA and 8 SAS ports, there is (conspicuously) no 10GE LOM – just 1GE.

Popular Posts

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.