Gear & Gadgets —

Clearing up the confusion over Intel’s Larrabee, part II

Another secret presentation from Intel has surfaced, this time with fresh …

Another Larrabee-related presentation from Intel has surfaced, this time with even more details on the first iteration of the company's forthcoming GPU/HPC processor and add-in board. Because the information presented in the slideshow dovetails perfectly with what I laid out in my previous article on Larrabee, I'm presenting this short article as a sequel to that one. I'll presume that you've read the first article, so instead of recapping it, I'll just dive right into the new details.

Update: The Larrabee-related slides have now been removed from the linked presentation. Clearly, they weren't supposed to be public, and Intel has now remedied the mistake.

Larrabee as a GPU

The new slides, which actually date to a March 7 presentation by Intel's Ed Davis, indicate that the first Larrabee products will have the following characteristics:

Package size: 49.5mm x 49.5mm

Process: 45nm

Clockspeed: 1.7-2.5GHz

Power: >150W

I mentioned in the previous article that that Larrabee would have a "fixed-function" unit that, in its GPU incarnation, would contain some sort of raster hardware. One of the Larrabee slides, excerpted below, shows the texture sampler situated next to each of the chip's two memory controllers.

A Larrabee GPU

In a nutshell, a texture sampler represents one stage in the standard DirectX and OpenGL 3D rendering pipeline. The texture sampler loads texture maps from memory, filters them as necessary for level-of-detail, and feeds them into the pixel processing portion of the pipeline (i.e., shading and rendering).

Larrabee's "pixel/vertex shaders" are implemented by the in-order cores described in the previous article. Note that in the previous article, I stated that a Larrabee GPU product would have at least 10 such cores. The new slide says that Larrabee products will have from 16 to 24 cores and adds the detail that these cores will operate at clockspeeds between 1.7 and 2.5GHz (150W minimum). The number of cores on each chip, as well as the clockspeed, will vary with each product, depending in its target market (mid-range GPU, high-end GPU, HPC add-in board, etc.).

These simple, in-order cores can do two double-precision scalar floating-point operations per cycle, in addition to the SIMD capabilities that I described in the previous article. Each core contains 32KB of split L1 cache, accessible with 1 clock cycle of latency. It also appears that each core will also have 256KB of the chip's large, shared pool of L2 for private (read-write for that core, read-only for the other cores) use, with that cache having a 10-cycle access latency. The cache line width is 64B.

To return to slide 16, all of Larrabee's components (cores, texture sampler, memory controller) will be connected by a ring bus that will be familiar to students of IBM's Cell processor. This ring bus has a width of 256 bytes/cycle.

In line with what I previously described, Intel makes clear in slide 16 that Larrabee will do some amount of ray-tracing with its in-order cores.

From GPUs to HPC

An "easter egg" on slide 16, accessible with any PDF editing software (I used Illustrator) reveals a board-level layout of a Larrabee GPU.

Board layout of a Larrabee GPU

The board shows a Larrabee chip connected to eight banks of GDDR and mounted on a PCIe 2-compatible daughterboard. The board has two power connectors: one for 150 watts and another for 75 watts. There are also two display outs and what appears to be a video in. The apparent package size given is 49.5 x 49.5mm.

In contrast to the daughtercard/GPU design shown in slide 16, a later slide gives a block diagram of a higher-end, HPC-oriented variant of Larrabee that uses Intel's forthcoming common systems interconnect (CSI) to gang together four 24-core processors. It's fairly clear from the block diagram that this layout shows a four-socket server design where all four sockets contain Larrabee parts. Such a design would be one node in a compute cluster that would almost certainly contain general-purpose CPUs from Intel as well.

Interestingly, Intel sees a role for Turbo Memory to boost the performance of these nodes. I'll be talking about Turbo Memory (a.k.a., Robson technology) in a future article, so I'll save that discussion for later.

Overall, the presentation, which was originally unearthed (and subsequently misunderstood) by TGDaily, is an attempt to show part of Intel's developing vision for the era of many-core computing, except Intel prefers the term "tera-scale" to "many-core." The presentation also includes some discussion of the company's research from the 80-core "Polaris" prototype project. Recall that, unlike Larrabee, the Polaris chip has a 2D mesh interconnect that makes use of a crossbar switch. The fact that Polaris is as much about system-level interconnects as it is about on-chip interconnects is reinforced in slide 14, which shows a mix of optical fiber and hybrid lasers that are used to get data onto and off of the chip. (For more on Polaris and interconnects, see "Beyond the teraflops: Why Intel really put 80 cores on a single chip").

The presentation also includes some details about Intel's 32nm "Gesher" CPU, due out in 2009. In brief, it's 4-8 cores, 4GHz, 7 double-precision FLOPs/cycle (scalar + SSE), 32KB L1 (3 clocks), 512KB L2 (9 clocks), and 2-3MB L3 (33 clocks). The cores are arranged on a ring bus, just like Larrabee's, that transmits 256 bytes/cycle. Gesher is due out sometime in 2009.