AMD's Bulldozer Architecture Preview: New from the Ground Up

Slimmed Down but Double Wide!

A good way to express what Bulldozer is can be summed up as “slimmed down, but double wide”. For each traditional core, AMD has instituted a dual ALU design with robust floating point and SSE units. Each core can handle two threads, like SMT, but actually has separate execution units which each process individual threads without sharing execution resources.

Each unit features a single fetch and decode stage. The decode stage is comprised of four units, but we do not yet know their inner workings. In the previous K7/K10.5 generations of parts, there are three complex decode units. On the Intel side with Core 2 and Nehalem, there are three simple decode units and a single complex. AMD also did not cover subjects such as macro-ops and macro-op fusion. AMD has beefed up their decode stage significantly though. It simply had to, because it is now feeding dual integer schedulers and a floating point scheduler feeding 2 x 128 bit FMACs and MMX units.

Fetch, decode, floating point/SSE, and the L2 cache are the shared components. Since most workloads are integer based, AMD doubled the integer units. These 128 bit packed integer pipes are a step above what was offered in the Phenom II. In theory, there should be a sizeable per clock increase in integer and floating point apps on Bulldozer over the Phenom II. When something is more heavily threaded, then we will see dramatic improvements in performance. Each integer core features its own L1 D-cache. AMD has again not clarified how much L1 or L2 cache there is for each discrete unit, or L3 cache sizes for the entire processor.

Branch prediction is one area that has not seen big jumps in the past decade, but due to the shared components and their greater data requirements, it is getting a major makeover. AMD did not cover details of this unit, other than it is new and much more robust than the older unit in previous generations of chips. I would be curious if it held more in common with the overbuilt K6 unit than the smaller and simplified unit developed for the Athlon family of products.

In most workloads, a four unit chip can natively handle eight threads, and the chip will show up with eight logical processors. But the workflow will be significantly different due to the shared components and how they burst out data for the execution units.

The floating point and SSE/SIMD capabilities of Bulldozer have also been given a boost. The Phenom and Phenom II had a single, 128 bit unit. This was an upgrade from the Athlon 64, which featured 2 x 64 bit units. Bulldozer now has 2 x 128 bit units which can be utilized as a single 256 bit unit under situations using AVX (Advanced Vector eXtensions). It also can also be utilized as 2 x 128 bit units, and can do 4 x 64 bit operations when needed.