Along with Scott Wasson of The Tech Report and Kyle Bennett of HardOCP, we recently had some time to sit down and talk with AMD's CTO Fred Weber about his vision of the future of microprocessors. We took the opportunity to compare and contrast his vision with our discussions from this year's Spring IDF that we had written about.

The ILP/TLP Debate in AMD's Shoes

When we talked to Intel at IDF, we had the distinct impression that the focus on improving microprocessor performance as a whole had shifted pretty significantly from ILP to TLP. To put it plain and simple, making individual cores faster was no longer top priority; rather, getting multiple cores to work together was the new focus.

Weber's stance on ILP vs. TLP tended to agree with what we had heard from Intel; TLP is the future and using ILP to increase performance is at a point of extremely diminished returns. That being said, we asked Fred where he thought the improvements in ILP would be going forward and he responded with the following four areas:

Frequency

Reducing Memory Latency

Instruction Combining

Branch Prediction Latency

Fred's number one increase for single core, single thread performance was clock frequency, so we will inevitably see that clock speed will go up as time goes on. It is quite possible that combined with a reduction in branch prediction latency, future versions of the Athlon 64 will use a lengthened pipeline to reach higher operating frequencies. If paired with Prescott-caliber branch predictors, a somewhat deeper pipelined K8 would provide additional frequency headroom without too much worry.

Behind clock frequency, Weber saw reducing memory latency as the other major way of increasing single core performance. Reducing memory latency in this sense basically means two things:

higher levels of cache hierarchy, and

better prefetching.

More than once during our conversations with Weber, it became clear that future multi-core AMD processors will continue to have their L1 and L2 caches separate, but a shared L3 cache will eventually be introduced to help reduce memory latency and keep those cores fed.

To Weber's second point, the use of helper threads (compiler or application generated threads that go out and work on prefetching useful data into cache before it's requested) will also improve single core performance. Intel has been talking about using helper threads since before Hyper Threading, but there is no idea of when we can expect real world implementation of helper threads at this point.

The topic of instruction combining was also interesting because it is something that we have only seen used in the Pentium M (Micro-Ops Fusion). Weber couldn't elaborate on an AMD implementation of some form of instruction combining, but we did get the distinct impression that it's something that's in the cards going forward. It looks as if elements from both AMD's and Intel's present day architectures will shape tomorrow's designs.

In the end, Fred left us with the following: if you see single core performance improving at a rate of 40% per 12 - 18 months, it will now improve at about half that rate for the foreseeable future.

Post Your Comment

35 Comments

I'm a bit confused by the terminology in places. Doesn't ILP mean "Instruction-Level Parallelism", i.e. that applies to distributing instructions between different execution units, and perhaps other tricks like out-of-order execution, branch prediction etc. But it certainly does NOT include "frequency", as seems to be implied by the first page! Unless it means that the longer pipeline will be interpreted as more parallelism (which is true). But that's not the only way to increase clock speed... a lot comes from the process technology itself.Reply

I just realized that the link to "IDF Spring 2005 - Predicting Future CPU Architecture Trends" requires that you go to the next page, and not the one the link points to, and it is there where ILP/TLP is explained.Reply

Thanks Filibuster. The article confirms the up to 30% gain in processing power under certain multithreaded scenarios. But I am still confused to why this is a waste of resources specially when HT was design for multiple thread use.

The point of hyperthreading being a waste of resources is that it costs A LOT to put features like that into hardware, and the die space and tranistors used to do HT could probably have been used in better way to create a more consistent performance gain, or could have been left out all together, reducing the complexity, size, power use/heat output of the processor and putting a little bit more profit per chip sold at the same price into Intels pocket. That is why it is a misuse of resources.

"not sure what you mean by "processing efficiency". all HT does is virtually separate the processor into two threads. maybe I'm missing something, but I can't figure out why everyone associates HT with performance gain. "

There are supposedly fewer misprediction in the pipeline since there are two threads sharing the same pipes. Even when the total processing power is cut in halth, the sum of the two appears to be greater with HT. It has been reported up to 30% increase in total output when running two intances of folding@home and HT.

So I am still wondering why Fred is calling it a "misuse of resources". Maybe he knows something we don't. It would be interesting to know more about this. Maybe someone at AnandTech get get a clarification from Fred?Reply