IBM’s 8-core POWER7: twice the muscle, half the transistors

IBM's 8-core POWER7 crams an amazing amount of hardware into about half the …

IBM's Hot Chips presentation on its forthcoming 45nm POWER7 server processor had a wealth of information on the chip, which, at 1.2 billion transistors and 567mm2, is actually quite svelte considering what it offers. The secret is the first use of a special cache technology that IBM has been touting since 2007, but more on that in a moment.

POWER7 will come in 4-, 6-, and 8-core varieties, with the default presumably being the 8-core and the lower-core variants being offered to improve yields. Each core features 4-way simultaneous multithreading, which means that the 8-core will support a total of 32 simultaneous threads per socket. POWER7 is designed for multisocket systems that scale up to 32 sockets, which means that a full 32-socket system of 8-core parts would support 1024 threads.

Feeding eight cores in a single socket is quite a challenge, which is why each POWER7 has a pair of four-channel DDR3 controllers that can support up to 100GB/s of sustained memory bandwidth. Also helping the situation is a whopping 32MB of on-die L3 cache—IBM was able to cram this much cache on there by using a special embedded DRAM (eDRAM) design that cuts the transistor cost of its large cache pool roughly in half.

To see how dramatic the transistor savings are this eDRAM cache scheme, compare the 8-core, 32MB-cache POWER7's 1.2 billion transistor count with the 2 billion transistor count of the 4-core, 30MB-cache "Tukwila" Itanium from Intel. Sure, POWER7's eDRAM is almost certainly a bit slower than Tukwila's SRAM, but in today's power-sensitive age that level of transistor savings is impressive. Also consider how POWER7 stacks up to the eight-core Nehalem EX, which has 24MB of cache and weighs in at over 2.2 billion transistors; again, IBM did more with less.

Note that the four-way SMT design is another trick that helps with problem of feeding all that hardware by acting as a latency-hiding mechanism for each core's back end. If one thread stalls waiting on memory, the core can (ideally) find instructions from another running thread to feed to the execution units in order to keep them busy. This bandwidth issue is probably one reason behind IBM's decision to go with such a high level of SMT.

Speaking of a POWER7 core's back end, each core contains a very robust suite of execution resources. There are 12 execution units in total, broken down as follows:

2 integer units

2 load-store units

4 double-precision floating-point units

1 branch unit

1 condition register unit

1 vector unit

1 decimal floating-point unit

Those of you who've read my past microprocessor articles or my book will know what most of the above units are for, with the possible exceptions of the PPC-specific condition register unit (that was present on the 970) and the decimal floating-point unit, which accelerates math functions commonly found on mainframe workloads.

My only real comment about the above is that four DP floating-point units is a lot of floating-point power. This makes sustained streaming bandwidth from memory critically important for POWER7's FP performance, so it's a good thing that it has plenty of it.

I'm told that the POWER7 continues the "group dispatch" scheme that has been a part of the POWER line since the POWER4 days. I described in detail how this works in my first article on the PowerPC 970 (a.k.a., G5)—in a nutshell, it cuts down on the amount of bookkeeping logic needed to track in-flight instructions by dispatching and tracking the instructions in bundles. On the POWER4 and 970, instructions dispatched from the instruction queue to the back end in bundles of 5 each, but the dispatch groups have now been widened to 6 slots.

In all, IBM has produced a very impressive 32-thread monster of a chip with a ton of cache and plenty of memory bandwidth, and done so with half the transistors of the competition. This is quite an achievement, and it reiterates just how strong IBM remains in the very lucrative mainframe market.

64 Reader Comments

IBM still plans on MCM configurations with upto 4 dies per package. That'd be 32 cores per socket. High end systems using 32 sockets would have a 1024 coherent processors and 4096 concurrently running threads potentially. That is a lot of processing power for one logical system: 32 TFLOP at 4 GHz.

"This is quite an achievement, and it reiterates just how strong IBM remains in the very lucrative mainframe market."

By mainframe, do you mean Z-server, P-server running AIX, or P-server running Linux. The mainframe market is not nearly as lucrative as it used to be. But IBM does seem to have succeeded in maintaining a good business rooted in its classic mainframe architecture System 3x0. Certainly, a decade ago, nobody would have thought that the Power architecture based RS6000 was a mainframe. Today, I suspect IBM is using the Power based chips for all of its proprietary servers. The old names were (AS400, System 390, RS600). They surely run Linux along with MVS on z-server. But the center of their mainframe business is presumably still large customers running much the same MVS workloads that they ran 20 years ago on System 380. Still it would be interesting if IBM has managed to build a business running Linux based large system workloads on a Power based P-server architecture.

Originally posted by dnjake:"This is quite an achievement, and it reiterates just how strong IBM remains in the very lucrative mainframe market."

By mainframe, do you mean Z-server, P-server running AIX, or P-server running Linux. The mainframe market is not nearly as lucrative as it used to be. But IBM does seem to have succeeded in maintaining a good business rooted in its classic mainframe architecture System 3x0. Certainly, a decade ago, nobody would have thought that the Power architecture based RS6000 was a mainframe. Today, I suspect IBM is using the Power based chips for all of its proprietary servers. The old names were (AS400, System 390, RS600). They surely run Linux along with MVS on z-server. But the center of their mainframe business is presumably still large customers running much the same MVS workloads that they ran 20 years ago on System 380. Still it would be interesting if IBM has managed to build a business running Linux based large system workloads on a Power based P-server architecture.

Z series runs on specialized and expensive to manufacture custom logic. If you want rock solid performance and massive uptime Z series is where it's at. There's a lot of work such as I/O taken on by dedicated hardware. The MVS MIPS market is tightly controlled, I don't think we are going to see a revolution in MIPS for the $$$ in this space.

with the 2 billion transistor count of the 8-core, 30MB-cache "Tukwila" Itanium from Intel.

Tukwila is only four core. Though each core does have 2 threads. Also, Intel only claims 30 MB of "total on-die" cache. To me, that includes L1 and L2, but most of the press seems to interpret that number as the amount of L3. Possibly because of a presentation of some kind.

@Digraph: Apple made it clear that since the start of OS X, 4+ years before the intel transition, OS X and all applications were coded against IBM PPC, Intel and AMD technology. They also made it clear should IBM ever offer a new compelling line, OS X might very well transition back, and the transition would be as smooth as moving to Intel was. We can expect Apple has continued to support both the PPC and intel code bases on all forthcoming technology. I would not be surprised at all to find out that in the bowels of Apple's data centers, P6.5 processors were running the core systems and databases with a custom build of OS X...

Originally posted by Tundro Walker:So how much liquid nitrogen does it take to keep this thing running cool?

Since it has almost half the # of transistors than a competing 8-core Intel CPU I would imagine it runs fairly cool.

Except that the vast majority of the difference in transistor cound was in the L3 cache fromthe use of eDRAM instead of SRAM. L3 typically consumes a few order of magnitude lesspower per transistor than CPU core logic so the difference in device level transistor count isbasically meaningless in predicting relative device level power.

Originally posted by zelannii:@Digraph: Apple made it clear that since the start of OS X, 4+ years before the intel transition, OS X and all applications were coded against IBM PPC, Intel and AMD technology. They also made it clear should IBM ever offer a new compelling line, OS X might very well transition back, and the transition would be as smooth as moving to Intel was. We can expect Apple has continued to support both the PPC and intel code bases on all forthcoming technology. I would not be surprised at all to find out that in the bowels of Apple's data centers, P6.5 processors were running the core systems and databases with a custom build of OS X...

Given the abandonment of the IBM Power architecture by Apple, it seems more than likely that POWER7 has abandoned all the Apple specific baggage holding back the architecture. So the chances of OS XI on POWER7 would be fanciful at best.

Originally posted by chipguy:Except that the vast majority of the difference in transistor cound was in the L3 cache fromthe use of eDRAM instead of SRAM. L3 typically consumes a few order of magnitude lesspower per transistor than CPU core logic so the difference in device level transistor count isbasically meaningless in predicting relative device level power.

Oops! Thanks for the clarification. I was going off of this line in the article:

quote:

...but in today's power-sensitive age that level of transistor savings is impressive.

Would love to see power and heat measurements from both Intel and IBM here.

Originally posted by kperrier:The POWER processors are not used in the mainframe line. They are used in the p and i series (AIX and OS/400 line.)

There is a gradual push by IBM to standardise on a single architecture - it is taking its time but it'll eventually get there. I'm looking for the road map but it is something like within the next 2-5 years when the transition will be complete.

Originally posted by zelannii:@Digraph: Apple made it clear that since the start of OS X, 4+ years before the intel transition, OS X and all applications were coded against IBM PPC, Intel and AMD technology. They also made it clear should IBM ever offer a new compelling line, OS X might very well transition back, and the transition would be as smooth as moving to Intel was. We can expect Apple has continued to support both the PPC and intel code bases on all forthcoming technology. I would not be surprised at all to find out that in the bowels of Apple's data centers, P6.5 processors were running the core systems and databases with a custom build of OS X...

Given the abandonment of the IBM Power architecture by Apple, it seems more than likely that POWER7 has abandoned all the Apple specific baggage holding back the architecture. So the chances of OS XI on POWER7 would be fanciful at best.

No more fanciful than OS X on x86 was five years ago. There's no Apple-specific baggage in x86, there's no technical reason Apple couldn't internally maintain a POWER build of OS X, at least xnu and toolchain (and several good technical and strategic reasons TO maintain such a build).

quote:

Originally posted by Shizam:

quote:

Originally posted by protomech:"on-die L3 cache"

Where's the on-die L3 cache in that floorplan?

Power efficiency between this, nehalem, and magny-cours will be quite interesting.. four DP units is nice

WOW... still the move to intel by Apple was for the best overall..but I cannot help but drool over the Power7. At least it is very likely that some sucessor of Power7 will end up in the next XBox... That should be cool.

Hmmm... I can't help but think, however, how hot does this thing run? Looking at botht eh G5 macs and XBox 360, it seems that IBM is unable or unwilling to do very much about heat characteristics.

With regards to Apple and the Power7, the fundamental problem still remains: Apple does not sell enough units to make it worth IBM's time to produce a custom family of processors.

If it was a matter of designing a chip to a single specification that is static for several years (like they use in game consoles), then it would be affordable. Unfortunately Apple would need speed boosts / power reductions every 6-12 months. This type of work could not be done for anywhere near the cost of Intel due to the Intel's economy of scale benefits.

My only esoteric point/question is the degree to which IBM has kept the "group" retirement schema ... it was never clear to me exactly to what degree Power6 kept it/implemented it, and at this point what microarchitectural evolutions Power7 gets from Power6 microarch haven't been detailed?

I think that one major reason for retirement groups is RAS however.

The eDRAM is slick, something I didn't expect. IBM rather seemed to be derogating eDRAM, nice to see it pop up with a vengence in an IBM-signature product. IIRC there were "some issues" with soft error rates in eDRAM so I assume IBM has them implemented with a lot of error-correction bits/capability.

WRT all the OOOOHHHHH-APPLE-PowerN ... not going to happen, for all the exactly same reasons it never happened back when Apple was PPC ... except for the development of the 970 ... aka G5. There are two issues here

* Power7 is really not optimized for typical PC work-loads, duh. Even a two core (8 thread) Power7 derivative would be a lot more threads and still not-great single-thread PC "performance."

* G5 was a market-situation anomaly ... very difficult to imagine any circumstances which would create the like today, particularly difficult to imagine circumstances which would wean Apple off x86 now. IBM did the G5 for Apple (Apple contracted with IBM for the G5 ... whichever way you want to say it) because the Motorola G5 project died when Intel hired off a lot of Moto's engineers in Austin in 2000 (and Intel paid Moto some undisclosed settlement thereafter ... but of course that did nothing to bring back the moto G5). Think something like this could happen to Intel? And then if it did, AMD represents a much simpler alternative vendor.

* Apple is heading toward ARM now, not PPC

Apple has very good reasons to keep all code ARM portable, and one might suppose they keep PPC ports perhaps ... but the "OMFG zowie CPU for Apple" never got anywhere before and there is even less likelihood now.

I believe that's not a "1 decimal" floating point unit, but a single "decimal floating point unit" (as opposed to a binary floating point unit). Someone correct me if I'm wrong, but I believe it's used by financial institutions for calculating monetary transactions.

Originally posted by Kevin G:IBM still plans on MCM configurations with upto 4 dies per package. That'd be 32 cores per socket. High end systems using 32 sockets would have a 1024 coherent processors and 4096 concurrently running threads potentially.

Don't assume you can connect 32 MCMs.

quote:

Originally posted by TheFerenc:OK, now how stratospheric will the pricing be, and when can I get my new Power7 based workstation?

POWER workstations? LOL.

quote:

Originally posted by tigas:What about the rumours that POWER7 would use (at least in one version) the G34 socket and interconnect scheme of Magny-Cours?

Between Intel, AMD, and POWER everybody seems to pretty much have all their bases cover and I VERY much doubt that SPARC will be competitive with Intel and AMD in terms of price or with POWER in terms of performance.

OK, now how stratospheric will the pricing be, and when can I get my new Power7 based workstation?

POWER workstations? LOL.

There have been plenty of POWER workstations... jeez ... Power5 workstations were pretty cheap too. I don't think there is a Power6 workstation ... yet???

quote:

Originally posted by Wes Felter:

quote:

Originally posted by Pradeep:...it seems more than likely that POWER7 has abandoned all the Apple specific baggage holding back the architecture.

Heh. IIRC the G5 had the IBM-specific baggage disabled.

You know, it depends on what "baggage" he is talking about... that comment could be pure tin-foil-hat (Anti-apple variety?) 'perspective' ... in which case it is pretty funny ... but there is one piece of "Apple specific baggage" the G5 was stuck with ... Apple's 925 northbridge (memory controller, which Apple designed in house, but part of the deal over the 970 was that IBM could use it for blades).

The 925 northbridge didn't support critical-word-first (which the CPU and it's FSB do) ... and the result of that in combination with the rather puny cache the 970 had, the large cacheline size (128 bytes) and the MP/fsb topology is that the 970 had distinctly worst-of-class pseudo-random latency ... which was/is killing in server applications.

Had that not been the case, the G5 might have found a wider market, might have survived ... but IMO probably not. But the latency problem also made the G5 a dog for a lot of 'PCish' problems where the data can't be streamed ... and Apple had nobody to blame but itself for that.

Pradeep, that what you talking about? Or maybe it is altivec aka VMX? If so, got news for you, Power7 does altivec ... does it just fine

Wes -- I presume your comment is wrt the "Amazon" memory stuff? You are certainly correct G5 didn't have it. The degree to which Power7 has this (and I presume it does) affects the degree to which Power7 can do a variety of Z-family workloads... I don't know how far down the "Eclipse" path IBM is or plans to go ....

OSX is based on BSD which is one of the most portable OS's ever developed. Snow Leopard could be recompiled for Power7 in an afternoon, the biggest technical hurdle would be the QA.

The reason there won't be any Power7 Macs will not be technical, it will be financial. Intel makes more chips, cheaper. They are good enough, and they give Steve Jobs the fat profit margin he needs. The bottom line here is the financial bottom line.

But one could imagine what it would be like to run an OS on that kind of silicon. You would need a lot of apps and threads to really put this beast through its paces. A couple copies of Mozilla, PhotoShop, and WinAmp aren't going to do it. This CPU is architected for heavy server loads.

yeah, except for that whole "Snow Leopard doesn't run on PowerPC" thing.

I'd suspect that Apple still builds OSX against whatever modern CPU arch is available. Just because they are choosing not to make PPC box or ship PPC code, doesn't mean the OS isn't compiled and tested on other arch from time to time to "future proof" it.

And FWIW, Apple's switch to intel was mainly about Intel's ability to provide fast, power efficient chips for Apple's notebook line. A market IBM had no interest in. POWER7s are probably great, and may have been able to have been used in workstations with some minor fiddling, but there is no way they could get one into a laptop. Heck, they could never even get a G5 into a laptop.

A couple years back Apple claimed 60% of their sales were notebooks. I wouldn't be surprised if its even more now.

One interesting aspect of the Power7 no one has really emphasized is thatit is by far the biggest monolithic processor IBM will have ever broughtto market, 567 mm2. That is much bigger than the 341 mm2 of Power6 andjust a bit smaller than the 596 mm2 of Montecito/Montvale. Tukwila is thebiggest processor yet disclosed at 699 mm2 but keep in mind that it ismade in a very mature Intel 65 nm bulk CMOS process so defectivity will beabsolute rock bottom.

The Power7 will be made in IBM's new 45 nm SOI process with AFAIK verylittle volume production under its belt. IBM has said it will release 4 and6 core variants of Power7 in products as well as the 8 core flagship. Carefulwatch of which parts get offered in which systems and at what price mightprovide a clue of how that yield thing is working out for big blue.