Server Platforms Today

In this theoretical article we are going to take a closer look at the multi-processor architectures used in contemporary servers and will dwell on a few server processor, which are not that well-known to the public yet. Namely, we are going to reveal all secrets of the Intel Itanium and Hewlett Packard PA8700, as well as Alpha, MIPS, SUN UltraSPARC and of course IBM Power4 microprocessors.

Traditionally, we classify processors in three vast leagues: for the home PC, the mainstream and for server systems. Today I’m going to talk about the latter category.

What is a server processor? Taking the meaning of the term literally, we have that it is a processor that works in a server. Or serves in a server if you like. Anyway, this chip often determines the consumer qualities of a server computer. First of all, the server, however overwhelmed with work it is, must quickly respond to users’ requests. It must be fast at performing the tasks the owner of the server has set for it. Thus, the processor performance and of course reliability become the most important factors for choosing the appropriate server platforms. In other words, servers are speed plus reliability. Cost is a secondary matter, although an important factor too (money is always important).

Besides performance, which I will discuss shortly, other characteristics determine the value of a processor for integrators of server systems. I think the most important factors are:

Reliability;

Performance;

Scalability;

Software architecture (i.e. the instruction set of the processor and the spread and popularity of software for this platform);

Cost;

Heat dissipation;

Availability.

We will see below what the performance of a server processor depends upon. Right now, let me add a few words on each of the items of the list.

Reliability is of the highest priority for server manufacturing, since the server usually has to deal with very valuable data. Frankly speaking, the properties of the processor don’t usually tell on the overall reliability of the system as hard disk drives and fans are much more likely to fail. In other words, if we don’t count in possible errors during CPU manufacture (like the notorious coprocessor error in the Pentium), the reliability of nearly all modern processors is more than enough for today. If you follow the usage rules, of course …

Scalability means the readiness of the platform to performance growth, both for the single processor and for their number in the system. Of course, processors themselves and the technical properties of the platform will limit the growth opportunities – starting from some CPU frequency (and from a certain number of processors per system), the performance won’t practically change. The level of this saturation depends on too many parameters and I will discuss this problem below, too.

Programmable architecture or the instruction set reminds us that the x86 is not the only instruction set, but one of many. Moreover, this set hasn’t got any standing in the sector of highest-performing servers as x86 systems used to lose to top-end RISC systems in performance and capabilities. They lose even today, although there have appeared many interesting and inexpensive solutions, as long as a RISC server can be called “inexpensive”, of course. In other words, x86 systems don’t occupy the apex of the performance Olympus.

Cost. Engineers may spurn the base metal, but the end customer does care about the price tag on the solution. Well, the very history of x86 systems bears witness to that: they were always losing in performance to all their competitors. But they always cost much less! And it is the x86 platform that survived, rather than the Next platform, for example, with its sparkling but very expensive hardware. No manufacturer of server systems ignores the price factor today.

Heat dissipation is a factor I have included on purpose. Of course, the purchaser of a 64-processor Sun Fire 15K machine can rest assured that however hot the processors may be, the server will cool them down efficiently. Thus, the absolute numbers of heat dissipation don’t matter much here. On the other hand, there are areas where heat becomes a suffocating factor for performance growth. For example, it is heat dissipation that puts restrictions on the number of CPUs in a blade server. By the way, blade servers are a peculiar market – performance proper is only a matter of third importance there, after heat dissipation and geometrical dimensions.

Availability is always important. Every server processor usually represents a compromise between the wishes of the engineering department and the capabilities of the manufacturing one. Considering that production output of server processors is always lower than that of desktop CPUs, the problem of availability of such solutions becomes critical. Moreover, chance sometimes gets involved into the manufacturing process: low yield of processors with desired frequencies and high percent of defective cores may make the manufacturer unable to produce the necessary amount of dies. Well, they sometimes can, but the cost becomes sky-high, hitting severely on the sales.

Anyway, in most cases, server manufacturers and developers handle the situation the right way and come up with ready products. In the next section we will try to discover what the performance of a server processor depends on.

Processor Performance

Let’s try to determine what the performance of a processor may depend upon, without sticking to any particular implementation. In the long run, software for any processor is reduced to a set of elementary arithmetic and logical operations. Thus, the processor will be faster if it can perform more such operations per a certain period of time.

The amount of calculations the processor can make depends on two basic things: how many elementary operations the processor performs in one clock cycle and what frequency it works at. The number of calculations is then estimated by multiplying the frequency by the number of operations per clock cycle. Accordingly, we have two ways to higher performance: either create an architecture oriented at making more work in a clock cycle (processors with “heavy” clock) or boost the frequency to the maximum. Of course, in the ideal world, we would do both things at a time, but it seldom happens in reality. As a result, the architecture developer team usually faces the problem of choosing the priority direction.

Many developers of server processors chose the variant with a “heavy” clock. In other words, the priority is to execute as many operations within a clock cycle as possible. The manufacturers go for various tricks to achieve this. For example, they integrate two processor cores into one die as IBM did in its Power4. Intel chose two directions in its two server processor series: the Xeon is rather a representative of frequency-growth-oriented CPUs, while the Itanium family includes processors that should do a lot of work per cycle. The frequencies as well as performance of CPUs from the two series differ greatly and the lower-clocked Itanium has an advantage. Both directions involve certain difficulties, though.

There are serious obstacles on the way to further increase of the work performed per cycle. There are many algorithms (in some areas they form a majority of algorithms) that are not easily paralleled, being sequential in their nature. Thus, it takes some effort – recompilation of software – to load with work as many execution units of the processor as possible. Technologies like Hyper-Threading from Intel serve the same purpose (they load idle units of the CPU with work). We will see below that a majority of RISC processors remained at the level of 6 pipelines – more pipelines don’t provide any performance gains. The only exception is the Itanium family, but at a high cost: performance of this processor heavily depends on the quality of the compiler. Let’s summarize: the increase in work performed per clock cycle is limited by the nature of many existing algorithms.

Frequency growth also brings about a series of unpleasant surprises and predictable consequences. It results in the processor’s higher power consumption and heat dissipation. Then, any circuit – especially the sophisticated circuitry of a modern processor - has a frequency ceiling. This ceiling is mainly because of the necessity to synchronize different processor units. There’s always a tiny difference between the operational timings of different units of the CPU and the units start working out of sync at a certain frequency. This frequency is the limit, conditioned by the architecture of the particular processor.

System Bus

Besides the limitations I have described in the previous section, there are several even more severe ones. The processor can only process data when it has them at its disposal. So, we need a system bus to serve data to the CPU. The caching mechanism helps the bus, too. We will discuss caching today, now let’s deal with the bus.

The bus bandwidth (and the bandwidth of the memory subsystem) is a factor which limits the system performance growth as the CPU frequency increase causes saturation. So the bandwidth of the system bus is one of the main characteristics of a server platform. It’s also obvious that the bandwidth of the system memory should correspond to the demands of the system bus – otherwise it wouldn’t make sense to create a fast system bus. In other words, the system bus serves data to the CPU and takes them from it. The faster this process goes, the more time the CPU has for processing the data, resulting in higher performance. In many cases, the system bus links processors among themselves and with memory. The last thing is not always true: there are processors like the Opteron from AMD and the UltraSPARC IIi from Sun that have an integrated memory controller. Thus, memory connects directly to the processor, rather than via the chipset. Thus, such processors only need the system bus for connecting to each other and to the I/O system.

Now let’s dwell upon the ways to organize a multiprocessor system. There are several principal methods: a shared bus, “point-to-point” and switch architecture (the last is in fact a hybrid of the first two variants).

A shared bus is often employed for building 2-way systems for several reasons: simple wiring (the mainboard topology is less complex), fewer contacts, lower development cost. The point of this technology is linking processors and memory with one and the same bus as the illustration below shows. So, the system is simpler and easier to build, but all processors have to share the bus bandwidth. For example, all systems based on CPUs from Intel have this architecture (the Xeon and, with certain reservations, the Xeon MP and Itanium).

In fact, there are no principal limitations to building a system with more processors than two (for example, four-processor Xeon MP-based systems are based on the same principle). In theory, we can shape up an eight-processor system with a shared bus, too. But the system has one disadvantage, as you may have guessed: the bus bandwidth becomes not enough for so many processors. In fact, they have to include large caches into CPUs even in a 4-way system so that while one CPU is controlling the bus, the others could do something useful rather than wait idly. Another problem in multiprocessor systems is arbitration since requests and data are sent on the same bus usually in this system and the latency between the CPU request and the system response becomes even more important than the bandwidth. Of course, low latencies are better. Considering that we also need to keep track of the processor status and the cache data, many manufacturers also use another auxiliary bus. For example, Intel introduces the so-called Snoopy Bus for system monitoring.

Now let’s review another way of organizing a 2-way system, which is called “point-to-point”. The following illustration shows its principal difference from the previous variant:

You can notice that each processor is now connected to the chipset only, instead of being connected to one shared bus. As a result, each processor can use the entire bus bandwidth. All highs and lows of this multiprocessor system organization ensue from this fact: this system is more efficient as the processors don’t directly compete for the data bus. Besides that, such a system, unlike the previous one, puts less strict demands on the processor electrical parameters. The shared bus needs the processors to have the same electrical parameters or to be simply identical, which is even better. In the point-to-point scheme the processors don’t necessarily have to be the same, although following this recommendation will help you avoid some potential problems. On the other hand, this system is harder to implement as you actually have to wire two buses instead of one. More contacts are also necessary; such systems usually require complex multi-layer PCBs (to avoid electromagnetic interference between the two buses) and, accordingly, cost more. We should also note that such systems have a higher competition for memory access, i.e. higher memory load. Examples of such systems are the Alpha platform from DEC, the company who was the first to use this organization, and the modern Athlon MP platform from AMD.

The last architecture is switch-based. We can explore it by the example of the 8-way ProFusion platform from Intel.

As you see, this variant looks like two 4-way platforms “stuck” together. The chipset, which is the joint piece, works as a switch. In fact, there can be several levels of such switches: these eight processors can be attached to another eight with one more switch. The processors don’t have to be eight either. A modification is possible when several CPUs on a daughter card are united with one switch and are then attached to a higher-level switch and so on.

In fact, all truly multiprocessor machines (8 and more CPUs) follow this switch architecture. Each “brick” can be based on the shared bus architecture (as in multiprocessor Itanium-based systems) or the “point-to-point” one as in the Opteron-based Red Storm from Cray. Anyway, this method allows building systems with more processors than would be possible otherwise.

Types of Multiprocessor Architectures

Before getting deeper into technological details, I would like to make one important reservation. Systems that include a lot of processors differ from those that have just a few in one thing: memory hierarchy. Imagine we have a really big system with, say, several hundred CPUs. Let them be united in fours, and four fours are joined into a group with a switch; several such groups are joined with a higher-level switch. We will find certain peculiarities in the operation of this system, which we haven’t noticed before. The memory access time will now differ depending on where the accessed memory module is. For example, it will take less time to access the local memory, situated on the processor PCB, than the memory that belongs to another four-processor unit – the switch has to spend some time to process the request and find the necessary address. And if the necessary address belongs to another group, the access time will be much higher than with the local memory.

The simplest case is a symmetric multiprocessor system (SMP) where there is no memory hierarchy at all. Figures 1 and 2 are examples of such systems: they show that each processor is absolutely the same as another one as far as memory access is concerned. The memory is accessed through the memory controller, which resides in the chipset, and the request that came earlier is served first. Nearly all 2-way systems follow this principle as well as a majority of 4-way and a certain number of 8-way systems. For example, the above-mentioned ProFusion, although has a switch inside, is an SMP system since every processor has equal rights with regard to memory access.

The Non-Uniform Memory Access concept is a more complex case – I described it in the example with a several-hundred-processor machine. I have already stressed the fact that the memory access time will greatly differ depending on what exactly address we require. Accordingly, to reach the maximum performance the operating system must allocate the closest memory for the processor that performs a task. It’s necessary that the operation system could efficiently allocate memory resources and knew about the architecture of this system. Such operation systems are difficult in development and very expensive like the hardware itself (it’s clear that a system of several hundred processors can’t be cheap by definition). This architecture is mostly applied in supercomputers and mainframes (we can distinguish between these types of computers mainly by the ratio of the memory performance to the processor performance).

Again, as I have already said, a majority of multiprocessor servers (from 8 CPUs on) are NUMA-systems. Of course, the manufacturers support such servers with NUMA-optimized operation systems.

There are subtypes of multiprocessor platforms in-between these two types (SMP and NUMA). I mean the Opteron platform from AMD in the first hand. This processor is remarkable since it has an integrated memory controller. This is a chart of a 4-way Opteron-based platform:

Evidently, the P0 processor can access its local memory faster than the memory of other processors. It seems like we deal with a NUMA architecture, no doubt. Yes, but it’s not exactly so. The NUMA architecture is characterized by a big difference between the times it takes to access different levels of the memory hierarchy, but in the case you see in the picture above, it takes just about 30% more time to access the memory of the processor located against the diagonal. This is a small difference, we can neglect it completely. Thus, we can write programs just as we do for an SMP platform – I already said that programming for NUMA is much more difficult. And if we use programs optimized for NUMA systems, we can hope for a performance gain. AMD coined a new term for this memory access organization – SUMO or Sufficiently Uniform Memory Organization.

Also there is such a form of multiprocessor systems organization as cluster. Clusters are usually many two- (sometimes one-) processor platforms, joined in some way. There are many ways, from exotic connections like SCSI to “classical” variants with Myrinet and Ethernet. The difference is in the cluster’s being in fact a conglomerate of computers, each of which runs its own operation system. Such machines exchange data, performing tasks together. An important fact about such systems is that the data exchange rate is much lower between nodes than within one node of the cluster (a node is a single computer in a cluster).

Clusters are mostly engaged into executing easily-paralleled algorithms that can run as data-unconnected (or slightly connected) tasks on each node independently. A typical example of such algorithm is calculations for 3D graphics, for example in 3D Studio Max, where each frame is rendered independently. This task can be performed efficiently by a cluster, while the cost of a cluster is much lower than that of a computer with the same number of processors. Of course, not all tasks can be solved with a cluster. For example, a hydrodynamic problem is unsuitable for a cluster since it involves calculating the model grid depending on the conditions in the neighborhood of each point, i.e. it’s necessary to exchange large amounts of data. As the data-exchange speed is much lower between nodes than within one node, the calculation of this task is inefficient – the processors are waiting for data from each other most of the time.

Cache Memory

As you know, cache memory is a high-speed type of static memory that stores frequently accessed data. Why is this type of memory necessary at all?

That’s because frequencies of memory and processor, originally similar, have long been very different. The frequency of modern processors is high above the possible frequency of the cell of ordinary random-access memory (and this situation has been around for long), which is the base of all existing memory types from EDO DRAM to DDR SDRAM. Of course, the memory manufacturers are conquering new heights by improving the production technologies, but the current memory clock rate is no match for the CPU clock rate anyway. The maximum frequency for mass-manufactured server memory is 400MHz (200MHz plus Double Data Rate effect, to be exact), while modern processors have already reached 3GHz and are set for more.

Thus, there arose the need for introducing a certain intermediary – fast memory that the processor shouldn’t wait for tens or hundreds of clock cycles to access. This memory could store frequently-accessed data and the results of calculations.

Right now, the cache memory subsystem of the processor is rather sophisticated. We now have several levels of cache. Server processors typically have two or three levels of cache memory, while systems on ProFusion (Intel) or Summit (IBM) chipsets contain four (!) levels of cache. This caching scheme is intended for the processor to wait as little as possible for data from memory.

Cache memory proved to be even more important for multiprocessor systems. Such machines use large caches which often provide a bigger performance advantage than a much higher CPU frequency would do. For example, the Xeon MP contains up to 2MB of L3 cache, while the system on the Summit chipset has up to 64MB of L4 cache! This amount of expensive static memory is amassed just for one purpose – to load the processor with work so that it didn’t stay idle. Besides that, the large cache can help to unload the system bus during peak loads. In other words, the more cache you have – the better. There’s only one downside to that: large cache leads to a very tangible increase of the system cost.

Cache Coherency

Besides other things, there is one inherent problem of all multiprocessor systems, irrespective of their organization. I talk about data coherency. When manipulating data, the processors change the memory contents. For the system to process information adequately, all other processors must know that this particular memory cell has changed its content. If other processors don’t need this cell at this moment, there’s no problem as this processor can just change its content for something new. But what happens if several processors need one and the same memory cell? When the cell is altered by a CPU, other CPUs must learn about the new value of the cell. Otherwise, they will use incorrect data in their future work. Thus, the problem of cache coherency must be solved for a multiprocessor system to work normally. This problem is among the most difficult that the developer of a multiprocessor system faces.

They solve it by introducing special connection protocols, which the processor caches use to exchange their information. For example, Intel uses the MESI protocol with a description of four cache cell states… Other manufacturers (AMD and IBM) use the more advanced MOESI protocol, which deals with five states for each cache cell. In any case, this exchange takes some part of the system bus bandwidth, but this is unavoidable, at least today.

Modern Server Processors

Enough of theory! It’s time we saw how the above-exposed theoretical premises are implemented in existing products. Let’s also limit the sphere we are going to cover, by the way. Evidently, one article can’t accommodate descriptions of all processors and platforms. So let’s put aside x86 CPUs - they have enough of our attention anyway - but talk about architectures that are usually just glanced over.

So we will talk about RISC architectures and their potential successor - the Intel Itanium.

Intel Itanium Platform

This is a famed platform. Once it was supposed to replace the “out-dated and slow” x86 platform. Today there’s less certainty about the Itanium being the x86 killer, though.

The main idea of the Itanium is making the processor perform more work per clock cycle. This is achieved by increasing the number of execution units operating in parallel. The work of this processor is described by the VLIW concept (Very Long Instruction Word). I won’t cite it in detail, just a few basic things:

The CPU performance is increased by making it perform more work per clock cycle. Optimized code is necessary for that, which is created by a special compiler;

Data and instructions are packed into long “words” and sent for execution;

There’s no (!) paralleling logic – it is the compiler that must create the optimal and dense command stream. Thus, there’s no out-of-order execution – the proper planning of the execution order is also the compiler’s job;

The processor consists of a set of execution units, buffers and cache. All other things being equal, it allows higher performance because it’s now possible to use the free space for enlarging cache or including more execution units;

The Itanium contains many general-purpose registers, 328 in total. They are 128 general-purpose registers, 128 FPU registers and 72 “predicate” registers (see below). A unique register rotation mechanism is employed for reducing the load on this unit and increasing its efficiency.

So the general ideas of the VLIM concept are revealed above - we make the processor perform better by feeding it not chaotic code that the processor’s logic then tries to comb up “on-the-fly”, but pre-optimized code, created with a special compiler. The problem of efficiency is thus solved beforehand. Intel terms this concept EPIC - Explicitly Parallel Instruction Computing. This concept can be considered as a post-RISC concept to some extent.

The Itanium architecture started to be developed about twenty years ago, when Intel found itself obliged to offer its alternative to the leaders in the high-performance CPU sector of those times. It wanted a processor that could be used in top-end servers. Of course, that architecture had to be 64-bit - this requirement ensued from the need for a large address space and large amounts of supported memory. It had to be scalable both in frequency and in the number of processors. In perspective, if everything went right, this platform was to oust x86 CPUs (which were lagging behind all other processor architectures in performance). Thus, Intel conceived a smooth transition to the architecture of the future. It would be an architecture where the compiler, rather than hardware, played the crucial part, although hardware solutions would be important, too.

The estimated cost of development and introduction of this architecture was too high for even such a semiconductor giant as Intel, so the corporation teamed up with another big shark, Hewlett-Packard. That company had an extensive experience of developing 64-bit CPUs (besides other things, Hewlett-Packard is the developer and manufacturer of HP PA-8xxx series processors, which I will discuss later) and a team of software developers - that was exactly what Intel hadn’t at that time.

The first version of the processor, Itanium, came out with a long delay, like two or three years (depending on the date from which you count up the delay). Intel had been warning beforehand that this first CPU would be rather a sample, just for the market to get to know the new architecture. Well, that’s exactly what we saw - there was no agitation about first Itanium-based systems. Of course, big corporations showed up with announcements of such products, but this was definitely not a commercial product. It was not one just because there was no software to run on it. Although Intel took care of the option of running x86 code on the Itanium, this ability was somewhat theoretical. Well, you could get a performance of Pentium 90 out of an Itanium processor with 800MHz frequency and nothing more.

Architecture of Itanium-Based Systems

Intel Corporation is conservative in designing architectures for multiprocessor systems. Up to four Itanium CPUs can be settled on a shared 128-bit bus (400MHz frequency, 6.4GB/s bandwidth). By the way, developing such a wide and at the same time high-speed bus is not a trivial task. Particularly, the bus is divided into electrically-independent segments of 8 bits, each of which has its own synchronous signal, to avoid cross-talk effects and other surprises of physics.

Computers with more than four processors use switches that join several systems, each of which consists of several processors on a shared bus. By the way, manufacturers usually install no more than two CPUs on each shared bus as the bus bandwidth is shared between the processors. The 4-way variant means a 1.6GB/s bandwidth chunk for each processor, while two processors on a shared bus will have 3.2GB/s each. To supply the processors with data from the system memory at an appropriate rate, multi-channel memory interleaving is employed (typically, eight-channel). PC1600 ECC registered memory is usually used.

Today, the number of processors in Itanium-based systems can reach as many as 128 (I mean mass-produced servers, while nothing prevents you from ordering a cluster with thousands of processors). The growth of this number is only limited by the fact that more processors would bring a small performance gain, but make the system much more expensive.

Itanium Micro-Architecture

Now let’s dwell upon the Itanium micro-architecture. As I have mentioned above, some part of logic, which would otherwise take space in the die (and which is less scalable in frequency than other chip units) has been taken out, into the compiler. That’s the more appropriate since the Itanium doesn’t (yet?) use sophisticated methods for improving performance like out-of-order execution. As a result, the processor contains six (!) integer pipelines with three branch-prediction units, two FPU pipelines, one SIMD unit with SSE2 support, two load units, three branch processing units (see below) and two store units. The processor is designed to perform six operations per clock cycle. Accordingly, the die can boast "grownup" dimensions (up to 374 sq. mm in the Madison core) and a huge amount of transistors, 221 million for a model with 1.5MB cache (and about half a billion for the Madison with 6MB L3 cache).

This construction is supported by a cache hierarchy of an appropriate size: a 16KB+16KB L1 cache for data and instructions (its latency is only 1 cycle!), 256KB L2 cache, up to 6MB L3 cache. By the way, there will be a variant of the processor with 9MB L3 cache! Larger caches will come after that: Intel promises to expand the cache to 12MB and then to 24MB per processor in the future!

By the way, the Itanium architecture may be the most cache-dependant of all. First, 64-bit code and data take more space than 32-bit code and, second, the shared-bus architecture (Intel clings to it even in hi-end systems) shows better results when processors have more cache memory. Besides that, some systems (like those on the Summit chipset from IBM) also have an L4 cache (!) for increasing the overall system performance (the Summit contains up to 64MB of L4 cache).

Curiously enough, the two server processor series from Intel - Xeon and Itanium - use different means for performance increase. Particularly, the Xeon stresses the frequency growth. Its pipeline is made longer just for that: for example, the pipeline grew from 10 to 20 stages on the transition from the Pentium III Xeon to the Xeon (on the Pentium 4 core). Intel went in the opposite direction with the Itanium: the pipeline length diminished from 10 to 8 stages during the transition from the first version of the processor to the second! Cache latencies were also reduced.

Although the Itanium 2 (that’s the official of the second version, shipped currently) may be simpler than other processors in some aspects (this follows from the very concept of the EPIC - taking as many of processor logic out into the compiler as possible), it has some points of interest. The conception, by the way, implies that the processor performance greatly depends on the work of the compiler - sometimes more than on the architectural traits or the frequency.

The first point of interest is the original mechanism of "register rotation". As I said above, the Itanium architecture has numerous architectural registers (328, to be precise). The speed of the register file in every processor is crucial for performance, since most operations engage registers. A register file of a large size is also harder to access quickly. Intel solved this problem in an elegant way, making the register file rotate with a definite period (to be more precise, they invented a mechanism to maintain an acceptable access speed to the register file). This is a compensation for the missing out-of-order execution mechanism (which actively employs the register renaming technique, by the way).

The second point of interest is the processor’s way of dealing with branches. If the compiler finds a branch in the program code, it leaves marks in special registers (“predicate” registers). There are 72 of registers like that. In this case, the processor can use its abundance of execution units and execute both branches without writing the results into the architectural registers for a while. After the situation clears up, the wrong branch is discarded, and the results of the right branch are written down. This is an interesting solution that smoothes out the performance loss in case when the system is waiting for the user reaction, for example.

Of course, the effort spent for developing the Itanium architecture should have its reward. And the reward came as the processor speed characteristics were revealed. Today the Itanium 2A 1500MHz (6MB L3 cache) overcomes the barrier of 2000 points in SPEC_fp base 2000 test (the leader, the HP Integrity Server rx2600 system, scored 2119 points). The same Itanium shows somewhat humbler results in SPEC_int 2000 base (1322, while the same HP Integrity Server rx2600 is on top), but on a level with other processors: only the Pentium 4 and Opteron surpassed it in this test.

In other words, the Itanium is the leader (with a big advantage) in algorithms that require a lot of floating-point calculations and one of the leaders in integer algorithms. These facts determine the spheres where Itanium-based systems are strong: scientific and CAD applications, databases where large addressing space and high performance of each given processor are necessary. Systems based on this processor are also among the leaders in many server performance benchmarks.

The table below shows some info on Itanium processors: their frequencies, L3 cache size and so on:

CPU model

Frequency

L3 cache size*

Die size

Max. number of processors

Itanium 2 for MP and DP servers (workstations)

1.3GHz, 1.4GHz, 1.5GHz

3MB, 4MB, 6MB respectively

Madison, 374mm2

4 (on a single bus)

Itanium 2 for DP servers (workstations)

1.4GHz

1.5MB

Deerfield, 180mm2

2

Itanium 2 Low Voltage for DP servers (workstations)

1.0GHz

1.5MB

Deerfield, 180mm2

2

* The sizes of the L1 and L2 caches for all Itanium models are 32KB and 256KB, respectively.

The Itanium in Perspective

Let’s summarize. The Itanium is a good processor as it is. That is, from the engineering point of view. Well, it just couldn’t have been the other way - two big corporations invested by different estimates from $9 to 15 billion and numerous working hours into developing this architecture. They couldn’t afford not reaching the planned result. The Itanium is a performance leader among all RISC processors today. The EPIC architecture has grown mature and has speeds no worse than competitors: the current CPU, the Itanium 2, is a leader in SPEC_fp 2000 performance (as I have already mentioned above).

In other words, this is a worthy product - but there has been a kind of slip in the marketing front. The market was ready for the Itanium to be an expensive processor, but the software side drags this platform down. Of course, the situation is changing steadily - Intel and Hewlett-Packard are working with software developers, supply sample systems on the first demand to any interested individual or organization. The market of customers interested in Itanium systems is growing. On the other hand, this market is still much smaller than any other market sector. For example, IDC estimates Itanium sales volumes like $1 billion for the Q3 of the last year. Considering that the entire server market is noticeably larger (something like $28 billion), you may form your opinion about the market share of this server platform.

So there are numerous side factors that may affect the Itanium’s perspectives in the market, notwithstanding its brilliant speed characteristics. Of course, both corporations have put a lot of effort into the promotion of the new platform – ad materials, compilers, evaluation systems which are generously (and practically for free) distributed among the potential customers – but there are objective reasons that hinder the development of this platform.

First of all, the Itanium is an expensive processor. Its die size is enormous and the chip yield is smaller because of that. There are fewer dies manufactured from a wafer and all this results in high production cost of this processor. Although Intel tries to improve the situation by issuing cheaper Itaniums on the Deerfield core (1GHz, 1.5MB L3 cache) priced at about $800 (and with support of 2-way configurations only), the performance of such “value” solutions is far from perfect. Moreover, the price of the processor doesn’t equal the price of the whole platform, and this latter price is high, even for its market (servers, workstations and so on).

This problem is the smallest of all, though, as all servers and server processors are expensive as they go. The second and bigger problem is software. Today, the list of software for the Itanium platform is much shorter than that of programs for the x86 system. The development of special software versions costs money. In case of x86 systems this cost is distributed among numerous clients and the final price of x86 software is rather low (compared to versions for other architectures).

In the Itanium case, the whole development cost is counted in, and the software product becomes very expensive, producing a kind of vicious circle: software won’t be low-cost until there are many Itanium-based platforms, but there won’t be many Itanium-based platforms until there are various low-cost programs for them. That’s why the Itanium platform lacks some popular programs in different fields (like SAP R/3, one of the most popular enterprise management systems). The emulation speed of x86 code is so low that you can forget about the idea of running resource-hungry code on the Itanium.

Of course, Intel is trying hard to change the situation for the better. For example, besides working with software developers, Intel is now working on a special (integrated into Windows for IA64) software layer that should translate x86 code into IA64 instructions “on the fly”. Transmeta employs something like that in its Crusoe processor using the “Code Morphing” mechanism (a special software layer inside the CPU translates x86 commands – and, theoretically, instructions of any architecture – into processor-native VLIW instructions on the fly). The promised performance level is half the speed of running Itanium native code. That is, the processor will execute x86 programs like average x86 systems do. That’s not much, but enough in many cases. And of course this is a big step forward from the current hardware emulation which gives a performance similar to Pentium 90-133MHz.

Other processor architectures have a historical advantage in this respect as they were forming up when all software was expensive. As a result, each existing processor architecture has accumulated numerous programs in the decades of its life. This is a buoy that keeps some of them afloat, by the way, we will see examples of such architectures later today. The Itanium platform hasn’t yet got this software baggage.

Actually, we might have waited for the new architecture to grow beyond its children’s diseases and accumulate the critical mass of software, if new market realities hadn’t risen up. AMD’s launching of the 64-bit and fully x86-compatible Opteron processor puts Intel into dire straits. On the one hand, the Opteron is not a direct competitor to the Itanium and AMD doesn’t position it as such. Capabilities of systems of dozens of Itaniums and of 8 (at maximum) Opterons (such servers even haven’t been yet officially introduced) differ too much for that.

On the other hand, the 64-bit architecture removes many obstacles inherent in the x86 architecture, while the smaller performance of the Opteron in floating-point calculations (it is faster than the Itanium in integer calculations) is compensated by the much lower cost of the system. And the main advantage of the Opteron-based system is the ability to use existing software. You can decouple the purchase of hardware and software – thus buying a 64-bit system on credit. You see, the times when customers’ IT budgets were infinite have gone, and it seems like forever.

As a result, Intel has to react to the sudden danger. The company has switched from the open dislike of the 64-bit processor idea to a more discreet stance like “if the market demands…” And the market demands as over 10 thousand systems on Opterons sold in the Q3 against 5 thousand on the Itanium. Of course, Itanium systems are ahead in the monetary estimation and in the number of processors: these 5000 systems contain nearly 100,000 processors, while the sales results differ by a factor of two. Nevertheless, this is a disturbing signal for Intel – considering that the Opteron has started its market life not so long ago.

Actually, the Opteron poses a threat to the Itanium platform, rather than to Intel’s supremacy in the server market. Both above-mentioned numbers seem negligible against the following number: 1.18 million systems on Xeon CPUs (that is, about 2.25 million processors) were shipped in the Q3 of 2003.

There are several ways for Intel to react to AMD’s moves. They will certainly have to implement 64-bit extensions in the Xeon sooner or later. Making it too late would mean giving some time advantage to AMD – that’s unacceptable. Thus, they have to do it now (or in the near future). There are three possible variants: introducing an AMD64-compatible architecture or an IA64-compatible architecture (in commands) or a completely new, architecture not compatible with anything.

The first variant is less probable, although appeals to software developers (they don’t have to split their products into versions for different platforms). This is mostly a political decision – Intel has never used AMD solutions. This rule is unlikely to be broken just for a momentary profit.

The second variant is good for the Itanium architecture, but unlikely for technical reasons. The concepts and micro-architectures of the Itanium and Xeon differ too much to create a common instruction set for them. Even if the engineers manage to build an IA64-compatible Xeon, the processor would be too slow.

The third variant means creation and support of one more platform. Intel has resources for doing this, but this would also mean that the Itanium would never become a main platform. It will remain a platform for expensive servers and workstations in its relatively narrow niche.

Besides that, additional effort is necessary to keep the Itanium the performance leader. This is necessary for positioning this processor as a high-end solution. In fact, the competition between the Xeon and Opteron will lead to a higher performance delivered by both processors (as x86 systems at large have grown in strength and overcame previously-unreachable RISC systems). As a result, the Itanium must have some noticeable performance gap behind to enjoy demand while being expensive.

Hewlett Packard PA8700

Hewlett Packard teams up with Intel on the development of the Itanium platform, but it also has its own 64-bit architecture, the PX8x00 processor and an appropriate platform! The current generation is called PA8700, and this is the fourth generation of the HP 64-bit architecture. Moreover, after the mergence with Compaq, HP found itself the owner of one more 64-bit architecture, the legendary Alpha microprocessor (Compaq in its turn had inherited the Alpha when it bought DEC, the original developer of this architecture). Thus, one and the same corporation now has three 64-bit architectures at once - a whim of fortune!

Of course, even the glorious marketing department of Hewlett-Packard was taken aback by the necessity to differentiate the three competing architectures that targeted about the same market sector. It was all quite clear with the PA8700 and Itanium: according to the official doctrine of HP, all modern PA8700-based servers are compatible with the Itanium. And HP will transition to the Itanium in the future. This transition will be the easier as the HP UX operation system is compatible with both processor architectures. Besides that, the Itanium understands the PA8700 instruction set, so the PA8700 can be considered something like a precursor of the Itanium. Moreover, I think that the ability to make the Itanium binary-compatible with the PA8700 made HP join the EPIC platform project. As for the Alpha microprocessor, it seems to be unnecessary anymore. Of course, HP will provide support for buyers of the Alpha platform, as it took Compaq’s obligations along with its assets. This processor has no long-term perspective, though. Its developer team has already been dismissed (or, rather, it joined the team of Itanium developers). In other words, after the lifecycle of the existing platforms comes to an end, the customers will be offered to transition to the Itanium (there’s nothing bad about it, actually, as HP has devised a customer-loyal transition program).

We’ll talk about the Alpha soon, now let’s get back to the PA8700. This is an interesting processor, by the way, although it seldom catches the spotlight. This is a snapshot of its core:

You can notice some nontrivial characteristics of the chip right in the snapshot. For example, the PA8700 has a one-level cache - a strange solution in comparison to other processors. Architects from HP often stress the fact that they think it more useful to have one large cache, than a multi-level system of caches that require sophisticated internal arbitration. They also think that the commonly accepted system of “one small and fast L1 cache plus a big and slower L2 cache” suits only for benchmarks. They think that in real work, when the application always gasps for data, and other applications and services are running in the background, it is more effective to have a slower but large L1 cache, which would have enough capacity for the processor to receive data from memory without halting its operation. It is sad, but this argument will never be continued as the snapshot above shows you the last version of the PA8700 processor (the PA8700+ modification with an up to 1GHz frequency).

So the die of 304 sq. mm is manufactured with 180nm+SOI technology and includes 186 million transistors, a big chunk of which make up a 2-port four-channel partially-associative data cache of 1.5MB capacity and a four-channel partially-associative instruction cache of 0.75MB capacity. Each cache has a 128-bit bus that connects it to other processor units. By the way, this processor has the biggest L1 cache today and none is likely to surpass it in the near future. The core is a superscalar processor with the following execution units: two 64-bit integer addition/multiplication units, two shift/compare units, two floating-point addition/multiplication, and two different division/root extraction units. That is, there are four ALUs and four FPUs! Eight execution units are complemented with two load/store units. The instruction fetch unit can take up to four instructions from the instruction cache per clock cycle. The microprocessor can perform prefetch and contains (as the picture suggests) an out-of-order execution unit. This 56-instructions-long unit tracks interdependencies between instructions and data and sends ready-to-execute instructions to vacant execution units. By the way, the PA8700 has one more curious unit: the memory access reorder unit. It groups memory requests in such a way as to make the resulting execution time the smallest. The branch-prediction unit in the PA8700 keeps the history of 2K previous branches and uses the dynamic method of branch prediction.

Overall, this is an interesting processor. Let’s check its speed characteristics. It scored 642 in SPEC_int base 2000 (the PA8700+ model with 875MHz frequency; I couldn’t find the results for the 1GHz model) and 600 in SPEC_fp base 2000. We see that it can’t compete with modern server processors - now we understand why Hewlett Packard is so interested in transitioning to the Itanium platform. Moreover, the transition can be easily performed with PA8700-based systems as they use the same bus architecture as Itanium-based ones. Moreover, there are chipsets like the ZX1000 that support both processors (both systems use a 128-bit bus); only the Itanium works at a higher frequency.

Alpha Microprocessor

The Alpha processor had a cruel fate. The legend of the computer world, the first 64-bit platform, this processor was far ahead of its time in many respects. Sometimes revolutionary solutions don’t get what they deserve. The Alpha platform is no exception. Digital Equipment Corporation, its developer, went bankrupt, mostly because of marketing slips. Having the best microprocessor of that time, the developers thought little of promoting it into the market. They must have thought that a good processor would sell by itself. Alas, the outcome is quite logical - however good your processor is, you should first tell about it to the potential customer.

Compaq was the second owner of the processor. It was more complex here: when Compaq bought Digital, the then-existing Alpha platform was already somewhat obsolete and the development of the new generation was far behind the schedule (partially because of personnel problems as many developers of the first processor generations had quit the company). Besides that, Compaq didn’t use the international dealer net it had inherited from Digital. That platform was somewhat alien to Compaq and no one managed it, no one promoted it. The result was again predictable: the platform lost all its performance advantage and had no development perspectives (the proposed minor innovations couldn’t change the overall situation).

Thus, when Compaq and Hewlett-Packard went for a merger, the Alpha platform found itself in a position of a stepdaughter: HP has no use for a third 64-bit platform and the corporation says this explicitly. So now we can have a look at the past grandeur - there’s nothing permanent in this world of change and microprocessor architectures grow and die, too.

So, the Alpha is a microprocessor capable of performing four operations per clock cycle. The processor has a two-port dual-channel L1 cache of 64KB size and a 64KB dual-channel instruction cache. The processor supports from 1MB to 16MB of an external L2 cache (I mean CPUs of the EV68CB/EV68DC series or Alpha 21264C) and tags of the L2 cache are stored in the processor core for faster processing. The L2 cache is joined with the core by means of a 128-bit-wide bus (plus error correction). The 21364 series Alpha microprocessor has 1.75MB of L2 cache on-die.

The CPU core contains two ALUs (each with an address computation unit) and two FPUs (one is responsible for multiplication, another with addition, division, and root extraction). The processor has a branch-prediction unit. Speculative execution is possible and the processor has 80 integer and 72 FPU registers (32 and 32 architectural) for that. All operations are pipelined, save for division and root extraction, but it’s possible to launch other operations with root extraction and division going on in the background (with some restrictions).

The Alpha has the shortest pipeline - 7 stages only, both for integer and floating-point calculations. Thus, the processor feels great in branching algorithms where longer-pipelined CPUs would stumble. This reason also makes impossible an easy growth of the clock rate: the processor units perform too much work per clock cycle to be easily scalable in frequency.

Well, this processor evidently loses to its competitors in the amount of execution units as well as in their functionality. We should make allowances, though, that the Alpha micro-architecture appeared first of all, being a kind of example for all developers of future CPUs. Thanks to the well-designed and polished-off architecture, this processor can show good results in benchmarks still, although cannot match modern CPUs. Its performance numbers are rather weird, by the way. It seems like an older generation of CPUs should be slower than the younger, but that’s not all that simple. The AlphaServer GS1280 7/1150 based on the Alpha 21364 processor (1.75MB on-chip L2 cache) scores 795 in SPEC_int base 2000, while the Alpha Server ES45 68/1250 with the previous-generation Alpha 21264C processor with a higher frequency and 16MB of off-chip L2 cache scores 845 points. The difference in results may probably be caused by the gross difference in the cache size. The AlphaServer ES45 68/1250 with the Alpha 21264C (16MB off-chip L2 cache) is the leader in SPEC_fp base 2000 with 1019 points.

In fact, nothing changed principally in the Alpha architecture since then – there have been only minor innovations like addition of 1.75MB L2 cache into the core in the 21364 model or integration of an RDRAM controller into the processor in the 21464 model.

Strangely enough, in spite of the outdated architecture, systems with Alpha CPUs show good performance; they are faster than the PA8700, for example. This is a great achievement of Digital developers who created a well-balanced processor, a good platform for it and an efficient compiler. Particularly, the Alpha processor bus, also known as EV6, was licensed by AMD and employed in the Athlon MP CPU. Besides that, the Alpha platform uses the "point-to-point" topology. It used it first - then AMD took this idea over.

Overall, we can put down a full stop in the history of this processor. There’s no hope left that this platform will continue its evolution. It will be serving its owners for some time yet and then will go into museums of computer equipment.

MIPS Microprocessors

MIPS microprocessors are an unusual tale in the world of CPUs. First of all, MIPS, the developer, doesn’t manufacture them. The company develops the micro-architecture and sells patents for its use. As a result, there are much more processors with the MIPS architecture (mostly 32-bit ones, of course) than with the x86 one! Particularly, from one third to half of all processors embedded into various kinds of household appliances are MIPS! Well, this article is not about embedding.

The fastest solution from MIPS is called R16000. It is a 64-bit processor, which, for example, was used by SGI, a firm known for its video-editing workstations.

The R16000 contains 32KB of dual-channel partially-associative L1 instruction cache (curiously, the access time doesn’t depend on whether the instructions in the cache are “leveled up”) and 32KB of dual-channel L1 data cache. The latter is divided into two independent banks. The processor also contains two 64-bit ALUs and two FPUs, one of which is responsible for addition and another for multiplication (and some other more complex operations). The R16000 supports out-of-order execution and a register renaming mechanism. The pipeline of the R16000 has only 6 stages.

The L2 cache is external, DDR SRAM, and works at half the CPU frequency, using the DDR protocol. That is, the data come in at the frequency of the processor. The bus of the L2 cache is 128 bits wide (plus 16 bits for error correction). The size of the L2 cache is up to 16MB, typically 4-8MB. The branch-prediction unit uses a 2K branch history table.

The system bus is 64 bits wide and works at 200MHz. The fastest variety of the R16000 works at 700MHz clock rate. Architecturally, the R16000 (like its predecessors) has a large register file - 64 integer resisters and 64 floating-point registers.

Unfortunately, I couldn’t find the performance results of the R16000 among the official results of the SPEC committee. I did find those of its predecessor, the R14000 model. They differ in the system bus frequency (200MHz against 100MHz). The maximum frequency of the R14000 is 600MHz.

As you see, the R14000 doesn’t show a sparkling performance by the today’s show-business standards. Even if we multiply the results by one and a half (the R16000 has a faster system bus), we won’t get an acceptable number. Anyway, the strong point of workstations and servers from SGI has always been the NUMA-flex architecture, which allows uniting up to 1024 processors into one computer under control of one copy of an operation system (called Irix). Besides that, four processors can be directly attached to each other, without switches, but such systems never really took off.

Today SGI has switched to the Itanium, so there is little hope we will see the MIPS R16000 platform developing any further.

SUN UltraSPARC

We’ve reached microprocessors manufactured by Sun Microsystems. That’s a well-known and respectable company, occupying one of the top positions in the server, workstation field. Sun relies on 64-bit processors of their own development – the UltraSPARC III (and the IIIi model, which differs with its integrated four-channel 1MB L2 cache).

The UltraSPARC III contains about 29 million transistors (the IIIi model of course has more due to its 1MB L2 cache – something like 85 million transistors). This architecture is designed to perform four instructions per clock cycle (the maximum rate of fetching instructions from the cache). The processor contains a 64KB four-channel L1 cache for instructions and a 64KB four-channel L1 cache for data. The L1 cache is quite fast, the access latency is only two cycles – good for that size. Execution units: two 64-bit ALUs, one branch-prediction unit (its branch history table remembers as many as 16K of previous branches), one load/store unit, two floating-point units (one performs addition/subtraction, another – multiplication/division). It also contains a special 2K buffer for writing preliminary data (I’ll explain to you its role shortly). The UltraSPARC II model works with an external cache up to 8MB, and the tags of the L2 cache are located in the processor for faster processing. The external cache works at one fourth of the CPU frequency (300MHz for a 1200MHz CPU).

The processor doesn’t support out-of-order execution. The instruction buffer is 16 instructions long and they are all waiting for the appropriate execution units to become free. Of course, the performance degrades without such a mechanism, but the UltraSPARC III has something else instead.

First of all, the result of an operation is available at the stage, which is the next after the result is achieved, rather than after the passing of the entire pipeline. For example, we have got a result of multiplication of two numbers at stage 8. The next command that uses this result won’t wait for 6 cycles more, but will go for execution in the next cycle. This opportunity arises due to a register file hidden from programmers – its auxiliary registers store the intermediary results.

Secondly, when processing branches in a special buffer, the processor handles the most probable branch and saves up to four commands from the alternative branch. Thus, the incorrect prediction allows the processor to continue working without waiting for the code to be loaded from the memory.

The length of the pipeline is 14 stages; the maximum frequency of the UltraSPARC III – 1200MHz, of the IIIi model – 1280MHz. By the way, the clock rate is not very high considering the length of the pipeline and the 130nm technology with copper interconnects.

For example, the Athlon MP has a shorter pipeline (10 stages) and the same manufacturing technology – its maximum clock rate lies around 2.2GHz. I don’t even mention the Xeon with its 3.2GHz. Manufacturers of RISC processors could learn something from their x86 counterparts in this respect. Of course, the frequency is not the only factor that matters - let’s see what the UltraSPARC III offers us in the way of performance.

Both modifications of the CPU contain an on-die memory controller, which makes them look similar to the Opteron. The UltraSPARC III uses SDRAM clocked at 150MHz with a 128-bit bus and 2.4GB/s bandwidth – quite good even by the modern standards. Special 144-pin ECC SDRAM modules are used and they cost a lot. Each processor can support up to 16GB of memory. The UltraSPARC IIIi uses DDR SDRAM, but via a narrower (64 bits) bus, which makes it about equal to the previous variant in performance.

Processors are linked with a broad and fast Fireplane bus (128 bits, 150MHz). The UltraSPARC IIIi allows building systems with 1-4 processors. The UltraSPARC III supports over a thousand processors per system, of course in a switch-based architecture.

The performance numbers follow: 642 points in SPEC_int base 2000 and 1074 in SPEC_fp base 2000. By the way, there is a big difference between “base” and “peak” modes in the SPEC_fp test (1344 for peak). It seems like the compiler from SUN is not perfect. Unlike other RISC processors, there is also a difference between floating-point and integer calculations (other RISC architectures usually have these numbers about equal). The lack of out-of-order execution probably accounts for this, but that’s only my supposition.

Clearly, the UltraSPARC III doesn’t show a miraculous performance in SPEC_int base 2000. It does better in SPEC_fp base 2000, but only against other RISC architectures. The Itanium is the leader here, and no one is likely to challenge its superiority in the near future.

In fact, it is the system architecture, rather than processor performance, that’s the strong point of the Sun concept. Thanks to the intelligent switch architecture, a big external cache and broad buses, Sun systems don’t stagger under increasing workloads, and that’s why they enjoy success.

The SPARC.V9 instruction set, employed in the UltraSPARC III, is licensed freely. Thus, there are processor clones that are compatible with the UltraSPARC III in the instruction set. The most popular and successful of them is the SPARC 64-GP from Fujitsu-Siemens. It has a somewhat different cache topology – it has a zero-level cache instead of L1 (its size is 16KB), then goes the L2 cache (L1 in the original topology) with a capacity of 256KB (128KB for instructions and data each), while the off-chip cache becomes a L3 cache with 8MB capacity.

There is an UltraSPARC IV processor project underway that’s going to be two UltraSPARC III CPUs in one die. So far, there’s no info about its performance and availability, so I don’t include this solution into this article.

IBM Power4

You feel embarrassed applying a name like “microprocessor” to the IBM Power4. The die is monstrous – an assemblage of four processors with a L3 cache is a square of 115x115mm! That’s the size – 13225 square millimeters! The “micro” has nothing to do with this microprocessor.

Well, if someone makes processors of that size, someone certainly needs them. Let’s see what it has inside. First of all, the Power4 contains two processor cores. You can see them in the following figure:

You see that the internal structure of the processor is nontrivial. Two processor cores are linked with a special high-speed switch. In fact, we have an SMP system within one CPU – the cores are joined with a bus that works at 500MHz!

Other subsystems are impressive, too: the L2 cache uses three independent cache controllers, three banks (you see them in the figure) of a total capacity of 1536KB and has a bandwidth of over 100GB/s working at 1.7GHz (the frequency of the flagship Power4+ model).

The processor core is curious in itself. First of all, the IBM Power4 decodes the external instruction set into internal microinstructions like x86 CPUs do. The reasons for this solution are obvious: there’s too much software written for the previous CPU generation, which costs more than hardware. They just couldn’t abandon that software baggage. Thus, the same problems met the same solutions.

The micro-architecture is designed to perform up to eight instructions per cycle – that’s an impressive degree of parallel execution.

Let’s now see what a single core looks like. The decoder translates external instructions into a set of elementary operations (ops) that are then packed into groups. One command is usually unfolded into two or three ops. A group contains five commands – the first four cells are distributed freely, while the fifth cell always contains a branch prediction instruction. Commands go for execution in such groups, moving along the pipeline.

Each core has two ALUs, two FPUs (with slightly different functions; for example, division is only performed by the FPU2), two load/store units, two branch prediction units. Overall, we have eight functional blocks. Out-of-order execution is supported – the Group Completion Table (an analog of the Reorder Buffer in Xeon processors) can contain up to 20 groups of elementary operations (i.e. about 100 ops), sending them to the execution units as they are ready. Overall, the processor can have as many as 215 instructions at various execution stages in a given moment.

Besides that, the core can launch “addition plus shift” operation each cycle on each FPU. This operation often occurs in various programs. Thus, we have four FPU operations per cycle, which is an absolute record among all processors (well, nearly each characteristic of the IBM Power4 CPU aspires to be record-breaking). It’s also possible to launch two floating-point addition or multiplication operations at a time, which none other micro-architecture allows.

The cache subsystem tries to match this record-setting trend. Each core has 32KB of dual-channel data cache (with an access latency of only 1 cycle!) and 64KB of dual-channel instruction cache. Each cache consists of 128-byte lines; the data cache is organized as four 32-byte sectors, which can be read independently (it’s possible to write into one sector and read from two others, without jams). The instruction cache can write or read 32 bytes each cycle. The L2 cache is eight-channel, partially-associative, 128 bytes per line, 1536KB size. Each processor also contains an L3 cache controller. The amount of L3 cache memory can be up to 32MB per processor (per two cores). The processor also has a memory controller with a bandwidth of 11GB/s per processor. The maximum amount of memory supported by each processor is 16GB.

The pipeline is 17 stages long, which is very long for a RISC processor. For this long pipeline not to be idle, the Power4 uses an advanced branch-prediction system. It bases on three tables, each of which contains up to 16K of branching history. The first table uses a traditional branch history buffer with information about whether the branch prediction was successful. The second table (16K too) uses a global, rather than local, branch table. Each entry in this table is associated with an 11-bit counter that remembers which branch was chosen in the last eleven times when instructions were taken from the L1 cache (the load unit loads eight instructions from the L1-I) and whether the prediction was correct. The results of processing this information become the foundation for the results of the next branch prediction.

Let me stress the difference between the two methods: in the first variant we follow each branch instruction without its connection to the others; in the second variant we do directly otherwise, dealing with a sequence of results, without following any particular instruction. That’s why we have two names for the tables: local and global. Now, there’s also a third table that notices which method has been most efficient (caused less errors)! As a result, the Power4 can change the branch prediction method in a few hundreds of CPU cycles.

This Leviathan processor has several variegated system buses: a 32-bit I/O bus (working at one third of the CPU clock rate) and three 128-bit bidirectional buses (working at half of the CPU clock rate) for linking to other processors in the “assemblage”. A 64-bit bus for linking different assemblages crowns the structure. This abundance of buses and the advanced cache hierarchy serve one purpose – making the processors always busy with work. Thanks to the appropriate protocol, the processors can get to each other’s cache (L2 and L3).

Let me explain what I mean by the word “assemblage”. IBM surprised us in the manufacturing aspect too as they managed to produce four processors in one die, with all their buses and 128MB of L3 cache. That’s a manufacturing achievement – the area of the assemblage is 13,225 sq. mm! By the way, each processor (of the four) links to other processors through a “point-to-point” bus.

This technological miracle is of course priced accordingly – about $10,000. However, this is not a high price for a processor of this class.

The topology of systems that use this curious processor is also out of the beaten track. IBM calls it Distributed Switch. Such a topology has no clear center. In fact, the links of the processor assemblages are closed, forming two parallel circles. Thus, it’s possible to get to each processor in several ways, which eliminates jams in the bus. The maximum number of assemblages is 4, or 32 processors in a system. The highest efficiency of this organization allows such a system to perform as fast as 64-128-way systems from other manufacturers.

So we now only have to see what the Power4 shows in SPEC CPU 2000. Note, though, that this test is for a single processor, never focusing on the nuances of the CPU organization. In other words, the total performance in real applications will be higher due to the remarkable system organization.

So, this processor scores 1077 points in SPEC_int base 2000. This is an average result. In any case, that’s more than any other RISC processor scored (save for the Itanium, which is not a pure RISC).

The result is better in SPEC_fp base 2000 – 1598 points. In fact, only the Itanium managed to outperform it. The Power4 is good in this kind of tests.

Once again, this test cannot capitalize on the main advantages of Power4-based systems. In real applications (and in real systems), the Power4 is the world’s fastest CPU in the number of transactions per processor.

Conclusion

This article contains brief reviews of different processors that you are likely to see in a modern server. I focused mostly on widespread and currently manufactured products and thus omitted the Pentium M and Crusoe, which are used in blade servers - they are not very popular. Of course, it’s impossible to cover everything in a single article – it is already very long.

Let’s not forget, too, that the performance of an end-system depends not only on the processor performance, but also on the system topology, the method of implementing the multiprocessor support and other nuances. That’s why you can see systems on the same processor, but from different manufacturers showing differing performance.

Well, that’s a topic for a next article. Now, I’d like to put all performance data into one table and show it to you. I also include data on x86 processors (Xeon, Athlon MP and Opteron), just for the sake of comparison. This table makes it easy to compare the existing server processors.

CPU

Frequency

SPEC_int base 2000

SPEC_fp base 2000

Xeon DP with 1MB L3 cache

3.2GHz

1274

1200

Athlon MP 2600+

2.13GHz

751

602

Opteron 148

2.2GHz

1405

1505

Itanium with 6MB L3 cache

1.5GHz

1322

2119

PA8700+*

875MHz

642

600

Alpha 21264 (16MB ext. L2)

1200MHz

845

1019

MIPS R14000**

600MHz

483

499

SUN UltraSPARC III

1200MHz

642

1074

IBM Power4 +

1700MHz

1077

1598

* I couldn’t find info about the PA8700+ working at 1GHz

** I couldn’t find the results of the 700MHz R16000, too; this processor is not present among the official results of the SPEC committee.