A Brief Look at the PowerPC 970

During a session yesterday at the Microprocessor Forum 2002, IBM gave analysts and the press their first public peek at the internals of the newly announced PowerPC 970. I've been following the reporting all day and collecting information, so this article will serve as a brief wrap-up of the day's revelations that will hopefully answer some of the questions about the processor that I still see popping up in press articles and on newsgroups. I'll reserve a detailed analysis of the PPC 970's microarchitecture for a later, more in-depth technical article.

The basic stats

I'll begin with the basic stats, for which I am indebted to RealWorldTech's David Wang. He posted this information on the RWT forums shortly after the session had ended and before it was picked up on most other news sites.

When the PowerPC 970 first ships in the second half of 2003, it should clock in at around 1.8GHz on a 0.13 micron, 8-layer SOI process with copper interconnects. On this process and at this clockspeed it should consume 42W of power at 1.3V. The processor will have 52 million transistors.

Now for some architectural details:

L1 data cache: 32 KB

L1 instruction cache: 64 KB

L2 cache: 512K

Registers:

32 64-bit general purpose registers

32 64-bit floating-point registers

32 128-bit vector registers

Unfortunately, I haven't yet been able to find a good breakdown of the functional units in this chip. I do know that it has at least one dedicated SIMD unit, and I've seen some reports that it has two. This dedicated SIMD unit, which implements the Altivec instruction set (IBM calls it "VMX" because "Altivec" is a Moto trademark), is good news for those of us who were afraid that IBM would take the less desirable approach of re-using the two 64-bit floating-point units as a 128-bit vector unit (ala the P4). How this chip performs on Altivec code will be a big factor in how it stacks up to Motorola's G5 (whatever and whenever that is).

Instruction issue slots and width

Even before the session yesterday, the PPC 970 had been widely rumored to be an 8-wide superscalar machine. This led some reporters and online discussion participants to confuse instruction dispatching with instruction issuing. I'll briefly attempt to clear this up here, but a better explanation with diagrams and such will have to wait for my upcoming tech article.

The 970 fetches eight instructions per cycle from its 64KB instruction cache into an instruction queue. These instructions then move through a series of pipeline stages that IBM calls "decode, crack, and group formation." (I'll explain why these stages are so called in a moment.) From the "decode, crack, and group formation" phase, the 970 dispatches five instructions per clock (4 instructions + 1 branch) in program order to a set of issue queues. The out-of-order execution logic then pulls instructions from these issue queues out of program order to feed the chip's eight functional units.

The mechanics behind this five-issue design are fascinating, and I'll touch on them briefly here. Both the 970 and the Power4, much like the Pentium and the Athlon, convert instructions in their "native" ISA into a special internal instruction format for execution. Just like the P4 decomposes x86 instructions into smaller, simpler micro-ops (uops), the 970 "cracks" PowerPC instructions into smaller, simpler sets of "iops". It is these iops that are actually executed by the 970's functional units. Most PPC instruction decode into only one iop, but some occasionally decode into more.

Unlike the P4, the 970 does one more trick after it has cracked the PPC instructions down into iops. The 970 divides up the iop stream into "groups" of five iops a piece. So first it cracks the PPC instructions down into iops, then it collects the iops back together into groups. The iops are placed the group's five slots in program order with the stipulation that all branch instructions must go in slot 4 (the last slot). Furthermore, slot 4 can hold only branch instructions and nothing else. It is these groups of five iops that are dispatched in-order to the issue queues. (I haven't yet seen a functional diagram of the 970's core, so I'm not sure how many issue queues there are.)

The 970 regroups the iops so that they're easier to keep track of. Since the iops move through the machine in groups of 5, it requires less logic to track these larger groups than it would to track individual iops.

The 900MHz frontside bus

One of the most important and least-discussed features of the PowerPC 970 is its 900MHz DDR frontside bus. This bus physically runs at 450MHz, but it's double-pumped. Its architecture is interesting in that the bus actually consists of two, 32-bit unidirectional point-to-point links. David Wang described it in a post to comp.arch as follows:

It's two 32 bit links: one from CPU to "companion chip" [the northbridge], and one back from that chip to the CPU. Each link runs at 900 MHz (1.8 GHz CPU core. the interface link runs at integer fraction of the CPU core, in this case 1/2)

So 4 bytes to, 4 bytes from, at 900 MHz that's 3.6 GB/s raw BW each way. The link multiplexes command and address info over the same pins, so it's some sort of packet based protocol. The math gets you 7.2 GB/s of raw bandwidth, but after subtracting out command and address overhead, raw peak data bandwidth is supposed to be about 6.4 GB of that 7.2 GB/s.

This high-bandwidth link to the northbridge is one of the elements that's going to make this chip as a media machine; it's exactly what Apple's current bandwidth-starved G4 systems lack, and it's going to be a major selling point for systems based on the PPC 970. When coupled with the right memory in an SMP configuration, the 970 should do quite well in bandwidth-intensive applications.

Conclusions: why Apple won't be talking about the 970 anytime soon

The PowerPC 970 represents a substantial leap over Motorola's existing G4 offerings in just about every conceivable way. It has the bandwidth, the clockspeed and the floating-point prowess to make a fine media workstation. And its power consumption is low enough to where Apple can continue to do the kinds of innovative industrial designs that quite frankly account for most of the appeal of their current offerings (OS X accounting for the rest). Furthermore, IBM has repeatedly said that the 970 is intended for use in SMP systems, and as well it should be considering its Power4 design legacy. I would expect Apple to debut the 970 in a dual-processor workstation, and if they don't an SMP box should follow shortly after the initial 970-based offerings.

All of the above, when combined with the fact that we won't see the PowerPC in an Apple system until at best early 2004, means that Apple won't be publicly announcing any official plans to use this chip for quite some time. The reason for this should be obvious: with the 970 looming on the horizon and the G4 apparently stuck again around the 1GHz mark, nobody in their right mind would shell out for a new PowerMac any time after mid-2003. Apple certainly doesn't want their sales to grind to a halt as potential buyers await the Next Big Thing, a phenomenon that occurs often enough in the Mac world. So I'd expect Apple to keep quiet about any official plans until they're just about ready to launch.

The estimated SPEC INT and SPEC FP numbers (937 and 1051) would allow the 970 to clearly dominate the desktop scene were it released tomorrow, but by the time we see this chip in a shipping system the performance landscape will look significantly different in both the 32-bit (P4 at 4GHz+ with SMT) and 64-bit (AMD's Hammer) desktop markets. I won't try to predict exactly how it will stack up to the x86 and x86-64 offerings in late 2003/early 2004, but when it finally ships the 970 certainly won't spanking anything from Intel or AMD in the SPEC benchmarks. It should, however, enable Apple to avoid the kind of overpriced embarrassment (from a hardware perspective, at least) that is their current "pro" desktop line. And in fact a dual- or quad-970 system could
potentially compare quite nicely in terms of price/performance to a single-processor Prescott or Hammer machine.