POWER5, UltraSparc IV, and Efficeon: a look at three new processors

Share this story

Introduction

The recent Microprocessor Forum produced some great details on forthcoming processors from a variety of companies for a whole range of market segments. This article originally started life as an MPF CPU roundup, but it has evolved into more of an overview of three specific upcoming processors: IBM's POWER5, Sun's UltraSparc IV, and Transmeta's Efficeon. Actually, the article focuses mostly on IBM's POWER5 and Transmeta's Efficeon, but I also cover Sun's UltraSparc IV because it's relevant to the "big picture" that I want to paint with this report. Hopefully, by the time you're done reading you'll have a better sense of what these three designs say about the direction the industry is headed in, and why it's headed that way.

IBM's POWER5

Of great interest to the Mac community was IBM's disclosure of more details on the upcoming POWER5. POWER5 is an evolutionary advance over POWER4, retaining many of the same characteristics and adding some new features into the mix. I'll just give a brief rundown of the basic stats before talking in more detail about the aspects of the design that interest me specifically. For more info on things like die size and process technology, see MacWorld's coverage for some of the details that I've left out.

Like POWER4, POWER5 is a dual-core design, but the POWER5 sports some changes to its caches as well an integrated DDR controller. However, neither of these are really the big news for Mac fans, because neither of them would necessarily be included in a stripped-down design for Apple. Besides, my suspicion is that we'll see a version of the 970 with an integrated DDR controller well before we see a full-blown POWER5 derivative for the Mac.

No, the big news is that POWER5 is an SMT (a.k.a. "hyperthreading") design, with each core capable of running two threads at once. This means that a single dual-core POWER5 chip will look like four logical processors to the OS. The rest of the big changes to the core are related to the addition of SMT, which IBM claims increased the size of each core by 24%. (This increase in die size is another reason why an SMT-capable POWER5 derivative for the Mac is a ways off.)

Let's take a look at a few of the core enhancements made for SMT.

The first order of business is to clear up any potential confusion that might arise from the following quote in the MacWorld coverage of the POWER5 talk:

The Power4 collects a group of up to five instructions per clock cycle and can complete one group of instructions per clock cycle. The Power5 doubles that throughput by collecting two groups of up to five instructions per clock cycle and completing two groups per clock cycle. Sinharoy said that is was not uncommon to see a "40 percent improvement for SMT (Symmetric Multithreading) instructions," a key performance characteristic for server processors, over the Power4.

Now, the POWER5 can fetch two groups per cycle, for a total of 10 instructions per cycle (two groups of five instructions each). However, it cannot decode and dispatch a total of 10 instructions per cycle. This would, of course, be pointless, since it only has 8 execution units anyway. Instead, the fetch groups are alternately* passed one at a time into the decode/dispatch stages, so that only five instructions per cycle ? all from a single thread ? are dispatched to the core's issue queues. Since the issue logic is group-agnostic, I'd assume that it's thread-agnostic as well, so that up to eight instructions per cycle can issue out-of-order to the various execution units from either of the two threads in any possible combination (e.g., three from one thread, two from another; one from one thread, four from another; and so on).

*(I say that they're sent along "alternately" from the fetch stage, but this is probably not quite the case. More on that in a moment, though.)

Since increasing execution unit utilization is one of the main goals of SMT, the increase in issue bandwidth utilization as described above is going to be key, especially for the POWER5. I say this because in my first articles on the G5 I suggested that the POWER4/970's group-based dispatch scheme and issue queue configuration probably constrains issue flexibility and therefore execution unit utilization in some peculiar ways under certain worst-case scenarios (i.e., one execution unit of a pair is overloaded while the other is starved, due to a combination of poor instruction ordering on the part of the programmer/compiler and the group dispatch limitations). Apple's recently-released G5 programming manual talks about this phenomenon explicitly. The fact that the issue queues will be populated by instructions from two different threads probably lessens the chances of this degenerate case occurring, thus providing a boost to execution unit utilization.

Another very nifty feature of POWER5's SMT implementation is the support for dynamic thread prioritization. The POWER5 can dynamically, under software control, assign each thread what we might call "priority points" from a eight-point "priority pool." So thread zero might have a score of 6, meaning that thread one would have a score of 2. Or, thread zero might have a score of 3, so that thread one would have a score of 5. The higher a thread's priority relative to that of the other thread, the more of the processor's resources it can monopolize.

The way this priority scheme is implemented is fairly simple, at least going from the PDF of IBM's Hot Chips presentation (I've been led astray by reading too much into presentation slides before, so take that into account when reading the following). The chip simply controls the decode rate of instructions from the two threads, decoding more of the instructions from the higher-priority thread than from the lower-priority thread on each cycle. So the higher-priority thread has higher instruction density in the issue queues, and hence takes up more of the core's execution resources.

I suspect that there's more to things than just controlling the decode rate, though. For instance, I'm sure that the fetch rate is affected along with the decode rate, so that higher-priority threads have more instructions fetched over the course of a number of cycles. Anyway, when the tech docs come out, it'll be nice to have more detail.

For workloads that don't benefit much from SMT, and there are a number of application types that fall into this category, POWER5 can dynamically turn off SMT altogether and devote its expanded menu of resources to a single thread. I'm sure these would include the POWER5's increased number of rename registers (up to 120 from 80 on the POWER4), increased issue queue depths, and so on.

In all, POWER5's SMT implementation looks more refined and mature than Intel's implementation on the P4. In fact, with the dynamic thread prioritization and other such features that give the OS better control over execution, I think that IBM has probably given us a glimpse of the kinds of stuff that we can expect to see in future iterations of Intel's SMT implementations.

At this point, I could talk about the need for SMT in an Apple system, but I'll just leave off that sort of commentary for now and observe only that Apple's long-standing and ongoing affinity for SMP designs has resulted in two things: 1) a huge potential for wasted execution resources on the current crop of non-SMT-capable G5s and 2) a body of natively-developed and -ported applications that have been subjected to years of pressure to use multithreading wherever possible in order to wring the best performance out of Apple hardware. I think both of these factors will converge to make a SMT a significant improvement for the Mac platform.