Post Your Comment

72 Comments

Odd that Intel went the 3 die route with Ivy Bridge-EP. It was no surprise that the lowend would be a variant of the 6 core Ivy Bridge-E found in the Core i7-4900 series. Apple leaked that the line up would scale to 12 cores. The surprise is a native 10 core part and the differences between it and the 12 core design.

Judging from the diagrams, Intel altered its internal ring bus for connecting cores. One ring goes orbits around all three columns of cores while another connects two columns. Thus the cores in the middle column have better latency for coherency as they have fewer stops on the ring bus to reach any core. The outer columns should have similar latency than the native 10 core chip for coherency: fewer cores to stop but longer traces on the die between columns.

Not disclosed is how the 12 core chip divides cache. Previously each core would have a 2.5 MB of L3 cache that was more local than the rest of the L3 cache. The middle column may have access to L3 cache on both sides.

The usage of dual memory controllers on the 12 core die is interesting. I wonder what measurable differences it produces. I'd fathom tests with a mix of reads/writes (ie databases) would show the greatest benefit as a concurrent read and write may occur. In a single socket configuration, enabling NUMA may produce a benefit. (Actually, how many single socket 2011 boards have this option?) Reply

The memory controller interfaces are different between the Ivy Bridge-EP and Ivy Bridge-EX. The EP uses DDR3 in all of its forms (vanilla, ECC, buffered ECC, LR ECC) where as the EX version is going to use a serial interface similar in concept to FB-DIMMs. There will be two types of memory buffers for the EX line, one for DDR3 and later another that will use DDR4 memory. No changes need to be made to the new EX socket to support both types of memory.Reply

I would have expected this newest Intel 12-core cpu to perform better. For instance, in Java SPECjbb2013 benchmarks, it gets 35,500 and 4,500. However, the Oracle SPARC T5 gets 75.700 and 23.300 which totally demolishes the x86 cpu. Have not the x86 cpus improved that much in comparison to SPARC? The x86 still lags behind?https://blogs.oracle.com/BestPerf/entry/20130326_s...Reply

What do you mean by inflated for marketing purposes? SPECjbb2013 is clearly a real world, recent benchmark that’s full audited by all vendors on the SPEC committee. If you make such claims, surely you have some evidence?Reply

TDP's are indeed higher on the SPARC side but not as radically as you indicate. Generally they do not consume more than 200W. (Unfortunately Oracle doesn't give a flat power consumption figure for just the CPU, this is just an estimate based upon their total system power calculator. For reference, the POWER7 is 200W and the POWER7+ is 180W.)Reply

@Kevin G"...The results on the SPARC side are often broken..."What do you mean with that? The Oracle benchmarks are official and anyone can see how they did it. Regarding the SPARC T5 performance, it is very fast, typically more than twice as fast as Xeon cpus. Just look at the official, accepted benchmarks on the site I linked to.Reply

@BrutalizerThere is a SPEC subtest whose result on SPARC is radically higher than other platforms. The weight of this one test affects the overall score. It has been a few years since I read up about this and SPARC as a platform has genuinely become performance competitive again. Reply

x86 was clearly ahead of SPARC till about SPARC T4 timeframe when Oracle took over R&D on SPARC. SPARC T4 allowed Oracle to equalize the playing field, especially in areas like database where the SPARC T4 really shines and shows considering many of the world record benchmarks that where released. When SPARC T5 came out last year, it increased performance by 2.4x, clobbering practically every other CPU out there. Today, you'll be hard pressed to find a real world benchmark, ones that are fully audited, where SPARC T5 is not in a leadership position, whether java based like SPECjbb2013 or SPECjEnterprise2010 or database like TPC-C or TPC-H.Reply

I know that EX will be using the (scalable) memory buffer, which is probably the main reason for the separate pin-out. I guess they could still keep both memory controllers in, and fuse the appropriate one depending if it is an EX or EP SKU, if this would still make sense from a production perspective.Reply

It wouldn't make much sense as the EX line up moves the DDR3 physical interface off to the buffer chip. There is a good chunk of die space used for the 256 bit wide memory interface in the EP line. Going the serial route, the EX line is essentially doubling the memory bandwidth while using the same number of IO pins (though at the cost of latency).

The number of PCIe lanes and QPI links also changes between the EP and EX line up. The EP has 40 PCIe lanes where as the EX has 32. There are 4 QPI links on the EX line up making them ideal for 4 and 8 socket systems were as the EP line has 2 QPI links good for dual or a poor quad socket configuration. Reply

Well, I guess without Intel openly saying or somebody laser-cutting the die it would not be possible to know exactly is it a shared die between HCC EP and EX.

However, there are lots of "hints" that the B-package Ivy Bridge EP hide more cores, like the ones you linked. If it is the case, it is really a shame that Intel did not enable all cores in the EP line. There would still be lots of places for differentiation between the EX and EP lines, since EX anyway contains RAS features without which the target enterprise customers would probably not even consider EP, even if it had the same number of cores.

How about adding turbo frequencies to sku-comparison tables? That'd make comparison of the sku's a bit easier as that is sometimes more repsentative figure depending on the load that these babies are run.Reply

I added Turbo speeds to all SKUs as well as linking the product names to the various detail pages at AMD/Intel. Hope that helps! (And there were a few clock speed errors before that are now corrected.)Reply

"Once we run up to 48 threads, the new Xeon can outperform its predecessor by a wide margin of ~35%. It is interesting to compare this with the Core i7-4960x results , which is the same die as the "budget" Xeon E5s (15MB L3 cache dies). The six-core chip at 3.6GHz scores 12.08."

What I find most interesting here is that the Xeon manages to show a factor 23 between multi-threaded and single-threaded performance, a very good scaling for a 24-thread CPU. The 4960X only manages a factor of 7 with its 12 threads. So it is not merely a question of "cores over clock speed", but rather hyperthreading seems to not work very well on the consumer CPUs in the case of Cinebench. The same seems to be true for the Sandy Bridge and Haswell models as well.

Do you know why this is? Is hyperthreading implemented differently for the Xeons? Or is it caused by the different OS used (Windows 2008 vs Windows 7/8)?Reply

However, HCC version of IvyTown has two separate memory controllers, more features enabled (direct cache access, different prefetchers etc.). So it might scale better.

I am achieving 1.41x speed-up with dual Xeon 2697 v2 setup, compared to my old dual Xeon 2687W setup. This is so close to the "ideal" 1.5x scaling that it is pretty amazing. And, 2687w was running on a slightly higher clock in all-core turbo.

Actually if you run the numbers, the scaling factor from 1 to 48 threads is actually 21.9. I'm curious what the result would have been with Hyperthreading disabled as that can actually decrease performance in some instances.Reply

Oops, you are perfectly right of course. In that case the 4960X actually gets the slightly better efficiency (12.08 is 0.28 per thread and GHz) than the dual 2697s (33.56 is 0.26 per thread and GHz), which makes perfect sense.

It also indicates the 4960X gets about 70% of the performance of a single 2697 at 38% of the cost. Then again, a 1270v3 gets you 50% of the performance at 10% of the price. So when talking farms (i.e. more than one system cooperating), four single-socket boards with 1270v3 will get you almost the power of a dual-socket board with 2697v2 (minus communication overhead), will likely use similar power demand (plus communication overhead), and save you $4400 in the process. Since you use 32 instead of 48 threads, but 4 installations instead of 1, software licensing cost may vary strongly in either direction.

Would be interesting to see this tested. Anybody willing to send AT four single-socket workstations?Reply

Won't the much higher price of a 4 socket board kill any CPU cost savings?

In any event, the 1270v3 is a unisocket chip so you'd need to do 4 boxes to cluster.

Poking around on Intel's site it looks like all 1xxx Xeons are uniprocessor, 2xxx is dual socket, 4xxx quad, 8xxx octo socket. But the 4xxx series is still on 2012 models and 8xxx on 2011 releases. The 4 way chips could just be a bit behind the 2way ones being reviewed now; but with the 8 way ones not updated in 2 years I'm wondering if they're being stealth discontinued due to minimal cases where 2 smaller servers aren't a better buy.Reply

I'd fathom the bigger benefit of Haswell is found in the TSX and L4 cache for server workloads. The benefits of AVX2 would be exploited in more HPC centric workloads. Now if Intel would just release a socketed 1200v3 series CPU with L4 cache.Reply

A 4 socket board is expensive, but thats not the discussion I was making. A Xeon E5-4xxx is not likely to be less expensive than the E5-2xxx part anyways.

The question was specifically how four single socket boards (with 4 cores each, at 3.5GHz, and Haswell technology) would position themselves against a dual-socket board with 24 cores at 2.7GHz and Ivy Bridge EP tech. Admittedly, the 3 extra boards will add a bit of cost (~500$), and and extra memory & communications cards, etc. can also add something depending on usage scenario. Then again, a single 4-core might get the work done with less than half the memory of a 12-core, so you might safe a little there as well.Reply

E5-46xx v2 is coming in few months, qualification samples are already available and for all intents and purposes it is ready - Intel just needs to ramp-up production.

E7-88xx v2 is coming in Q1 2014, it is definitely not discontinued, and the platform (Brickland) will be compatible with both Ivy Bridge EX (E7-88xx v2 among others) and Haswell EX (E7-88xx v3 among others) CPUs and will also be able to take DDR4 RAM. It will require different LGA 2011 socket, though.

EX platform will come with up to 15 cores in Ivy Bridge EX generation.Reply

The E5-46xxx is simply a rebranded E5-26xx with official support for quad socket. The dies are the going to be the same between both families. Intel is just doing extra validation for the quad socket market as the market tends to favor more reliability features as socket count goes up.

While not socket compatible, Brickland as a platform is expected to be used for the next (last?) Itanium chips.Reply

So I work for HP and your comments about 4x1P instead of 2x2P make me wonder if you have been sneaking around our ProLiant development lab in Houston.

I was there 6 weeks ago and a decent sized cluster of 1P nodes was being assembled on an as yet unannounced HP platform. I was told the early/beta customer it was for had done some testing and found for their particular HPC app, they were in fact getting measurably better overall performance.

The interesting thing about this design was they put 2 x 1P nodes on a single PCB (Motherboard) in order to more easily adapt the 1P nodes to a system largely designed with 2P space requirements in mind.

Pretty sure the chips were Haswell based as well but can't recall for sure.Reply

There's a minor error on the Cinebench single-threaded graph. It has the clock speed for the E5-2697 v2 as 2.9 instead of 2.7, as it should be. Which is semi confusing on that graph as it explains the lower single-threaded performance from the E5-2690.Reply

Please make some gaming related tests. Im planning on upgrading from 2x2609 to 2x2690v2, now that i now for sure that 10 cores 25 mb cache is a complete die. I dont trust verz much the design on the 12 core die, its not how i would design the CPU. Besides the 2690v2 is 3ghz base and 3.6 boost, perfect for gaming.

Would have like to see how a 2690v2 would compare with a 2687w v2 in gaming related tests, seeing as the latter has a 3.4 base 4 ghz boost but 2 cores less.

Anyways, im not pazing 3000+ euros on disabled die (like the one in 2687v2) so the 10 core is my choice, but still would have like to seee how higher freq lower core count would impact gaming performance !Reply

I can tell you now that the 8 core is going to kick the 10 core's ass for gaming. The higher clock will win here. So as you are going to pay 3000 euros you may as well get the best, even if it does have two cores disabled. But I do agree for me a more interesting comparison would have been 12 vs 10 vs 8 all V2s all fastest clock available versions...Reply

IMO for gaming you'd be better off with a used oc'd 2700K. I just bought one for 160 UKP,fitted with a used TRUE (cost 15), two new Coolermaster Blademaster fans, Q-fan active(ASUS M4E mbd, used, cost 130), runs at 5GHz no problem, silent running when idle. See:

The vast majority of games gain the most from a sensible middle ground betweenmultiple cores and a high clock. Few will properly exploit more than 4 cores with HT.Using a multi-core XEON for gaming is silly. You would see far greater gamingperformance by getting a much cheaper 4/6-core and spending the saved cash onmore powerful GPUs like two 780 or Titans SLI, or two 7970 CF, etc. A 4-core Z68should be just fine, though if you do want oodles of PCIe lanes for high-end SLI/CFthen I'd get X79 and a 3930K (don't see the point of IB-E).

Trust me, a 5GHz 2700K, or a 4.7GHz 3930K, paired with two much better GPUsvia the saved money, will be massively better for gaming vs. what you could affordhaving spent thousands on two 10 or 12-core CPUs with much lower clocks. Most2600Ks will oc pretty nicely too.

Bytales, what GPU(s) do you have in your system atm?

Ian.

PS. IB/HW are a waste of time. They don't oc aswell as SB. I bought a 2500K for 125, onlytook 3 mins to get it running 4.7 stable on a used Gigabyte Z68 board (which cost a mere 35).Reply

The reason im looking at xeons is because of the motherboard i own, which is the z9ped8ws, which i bought because i need the pci express lanes two xeons provide. No other motherboard could have gottwn me what this one does, and i have looked everywhere. Thats the reason i need these xeons. I originally bought two 2609 cpus and a crossfire tahiti le(one burned down due to bitcoin mining) their purpose were/are to make the pc usable until the new xeons and the new radeons wil become available. I know i wont be getting the best possible cpus for gaming on this platform. I just want some decent performers. The 2609 i have now are 2.4 ghz no boost no HT, and did their job good so far. Im expecting decent gaming performance out of a 3ghz chip with multiple cores. Sure, i could get the 2687wv2 for the same price, but i have a hate for disabled things. Why the hell didnt they make a 10 core chip with 25 mb cache 3.5 base 4ghz boost and 150 160 w tdp. I would have bought such a cpu. But as it is ill have to make due with two 2690. Maybe, just maybe, if i see some gaming benchmarks between the two cpus, i will consider the 2687wv2. Untill then, my first choice is the 2690.Hopefully, the people from anandtwch will test this aspect of the cpus, gaming that is, becauae all they tested was server/enterpriae stuff, which was to be expected after all.Gaming was not what these cpus were built for. But i like having strong cpus which will have my back if i decie to do some other stuff as well. I do bunch of converting, compressing, autocad photoshop. Etc. Thats why more cores. The better.Reply

I would think you can get the PCIe lanes you want with a motherboard that has a PLX bridge chip, such as the ASUS P9X79-E WS, without needing to resort to a two-socket motherboard. As far as gaming, I think the E5-1620 v2 gives good bang for the money, and if you need more cores, the E5-1650 v2 does well, too. If you need a little better performance, you can get the E5-1680 v2, but at a price. Too bad Intel doesn't sell single-socket CPU versions with more than 6 cores, though.Reply

The Xeon2660v2 could in theory be what Ivy-E should have been for enthusiasts: something at least a bit more worth spending big $ on. The mainboard would have to let us enable multi-core turbo and OC the bus though.Reply

Situation with IvyBridge EP is absolutely the same as with Sandy Bridge EP:

- No BCLK "straps" (or ratios) for Xeon line - only 100 MHz allowed- No unlocked multipliers- BCLK overclocking works - your mileage may vary. I can get up to 105 MHz with dual Xeon 2697 v2 setup on Z9PE D8 WS

So, Ivy Bridge EP Xeons do not overclock particularly well - the best you can get out of 2S parts (26xx v2) is 100-150 MHz depending on the max. turbo multiplier your SKU has.Reply

Johan, what do you mean by "...over four NUMA nodes" in the last sentence on the Compression And Decompression page?

My understanding is that for both Opeteron and Xeon, a NUMA node is a complete CPU package (with all its cores) and the associated RAM directly connected to that CPU's memory controllers. In the charts, all of the Opterons are listed as "2x Opteron XXXX". Are you considering each die within the Opteron MCM package to be a separate NUMA node -- or how else are you coming up with "four" above?Reply

The data in this article is incomplete. The JVM tuning used is targeted for throughput alone, basically ignoring GC pause times. The critical jOPS metric is intended to measure with response time constraints, and the results posted here are most likely highly variable and definitely dreadfully low because of the poor tuning choices.

Actual customers care more about response time/latency these days. Throughput is often solved by scaling horizontally, response time is not. Commercial benchmarking should try to reflect that desire by focusing on response time and the SPECjbb2013 critical jOPS in order to influence hardware and software vendors to compete.

Finally, to Kevin G, I think it's also likely that SPARC T-series systems have been focusing on customer metrics more than competitive benchmarks, and now there's a benchmark that takes response time into consideration.Reply

Surely its more interesting to see if the 12 core is faster than the 10 and 8 core V2s.Its not obvious to me that the 12 Core can out perform the 2687w v2 in real world measures rather than in synthetic benchmarks. The higher sustained turbo clock is really going to be hard to beat.Reply

There will be a follow-up, with more energy measurements, and this looks like a very interesting angle too. However, do know that the maximum Turbo does not happen a lot. In case of the 2697v2, we mostly saw 3 GHz, hardly anything more. Reply

Yes based on bin specs 3Ghz is what I would expect from 2697v2 if more than 6 or more cores are in use. 5 or more cores on 2687wv2 will run @ 3.6Ghz. While 2690v2 will run 3.3Ghz with 4 or more cores. So flat out the 12 core will be faster than 10 core will be faster than 8 core - but in reality hard to run these flat out with real-world tasks, so usually faster clock wins. Look forward to u sharing some comparative benchmarks.Reply

E5-2697v2 vs. E5-2690 +30% performance @ +50% cores? I am a bit disappointed. Don't get me wrong, I am aware of the 200 Mhz difference and the overall performance per watt ratio is great but I noticed something similar with the last generation (X5690 vs. E5-2690).There are still some single threaded applications out there and yes, there is a turbo. But it won't be aggressive on an averagely loaded ESXi server which might host VMs with single threaded applications.I somehow do not like this development, my guess is that the Hex- or Octacore CPUs with higher clocks are still a better choice for virtualization in such a scenario.

I think 12 core diagram and description are incorrect! The mainstream die is indeed a 10 core die with 25 MB L3 that most skus are derived from. But the second die is actually a 15 core die with 37.5 MB. I am guessing (I know I am right :-)) That they put half of the 10 core section with its QPIs and memory controllers, 5 cores and 12.5 MB L3 on top and connected the 2 sections using an internal QPI. From the outside it looks like a 15 core part, currently sold as a 12 core part only. A full 15 core sku would require too much power well above the 130W TDP that current platforms are designed for. They might sell the 15 core part to high end HPC customers like Cray! The 12 core sku should have roughly 50% higher die area than the 10 core die!Reply