Saturday, April 21, 2012

Wherefore art thou AMD?

33-year-old techies like myself, and the giants who came before (upon whose shoulders I enjoy standing) can remember back to a few times where AMD got to eat cake. K6-2, Athlon, and AMD 64 processors all did well... until Intel's Core 2 microarchitecture blew AMD out of the water, a volley AMD has never recovered from.

One of the issues is clock speed. I was reading through material (kudos to David Kanter over at realworldtech) today and realized there are stark differences in Intel's Core 2 memory management unit (MMU) and AMD's Barcelona (the poor gladiator that got thrown into the ring with the Core 2 titan :-(

Somebody must have sneaked a Scooby Snack into my Taco Bell burritos because I decided to follow a hunch..

The purpose of the MMU is to translate virtual memory addresses (the ones that a user program thinks are real) into physical addresses (the ones the operating system knows are real). The translation-lookaside-buffer (TLB) is a content-addressable memory, which works like a hash table in order to map an input A to an output, where there are too many possibilities for the value of A to fit all possible mappings in the TLB, so you hold just the few you can fit in hardware.

Given a bag of useful A values, determining which one corresponds to the current MMU address input is tough (i.e. power, time, and silicon hungry). You could look through them one by one but that would be really time consuming. Instead, two techniques are used: 1) the least significant bits of A are used for lookup into the TLB, and 2) Multiple values are fetched from the corresponding TLB entry (called "ways" of "set associativity") and verified to match the higher bits of A. If one of the "ways" matches then you have found a proper mapping to physical memory and can proceed to do the physical memory operation. If none of them match, then you have a "TLB miss" and have to resort to plan B. Plan B is an exception handler written in software or firmware (microcode) that looks into a larger memory for a match (slow).

From David's excellent article (his greatest work to date IMHO) we see that Barcelona implemented a grand total of 48 ways, and uses NO BITS to sort between them using step #1 above. In contrast, Intel's Core 2 has 4 ways, and uses 2 bits to sort a total of 16 entries into 4 sets of 4 ways each.

It is obvious that Intel's strategy is more conservative. In this case Intel's method delivered more speed using less silicon area and power. As I discovered this tidbit I remembered that Barcelona had problems, not the least of which was lower-than-expected clock speeds. This is best exemplified by AMD's switch at that time from frequency comparisons with Intel to PR Ratings. Oh brother, they also switched to giving power consumption in terms of ACP rather than just TDP (the term upon which Barcelona was unfit to compete).

Barcelona arrived later than late to the party. Truly unfashionably late. During my research for this blog post (Oy vey, we are getting long aren't we) I rediscovered that Barcelona's production was reduced for many more months after its release due to, what else... a bug in the TLB!

Looks like our hunch was right, 48 fully associative entries in the primary TLB was the wrong design.