Along with Scott Wasson of The Tech Report and Kyle Bennett of HardOCP, we recently had some time to sit down and talk with AMD's CTO Fred Weber about his vision of the future of microprocessors. We took the opportunity to compare and contrast his vision with our discussions from this year's Spring IDF that we had written about.

The ILP/TLP Debate in AMD's Shoes

When we talked to Intel at IDF, we had the distinct impression that the focus on improving microprocessor performance as a whole had shifted pretty significantly from ILP to TLP. To put it plain and simple, making individual cores faster was no longer top priority; rather, getting multiple cores to work together was the new focus.

Weber's stance on ILP vs. TLP tended to agree with what we had heard from Intel; TLP is the future and using ILP to increase performance is at a point of extremely diminished returns. That being said, we asked Fred where he thought the improvements in ILP would be going forward and he responded with the following four areas:

Frequency

Reducing Memory Latency

Instruction Combining

Branch Prediction Latency

Fred's number one increase for single core, single thread performance was clock frequency, so we will inevitably see that clock speed will go up as time goes on. It is quite possible that combined with a reduction in branch prediction latency, future versions of the Athlon 64 will use a lengthened pipeline to reach higher operating frequencies. If paired with Prescott-caliber branch predictors, a somewhat deeper pipelined K8 would provide additional frequency headroom without too much worry.

Behind clock frequency, Weber saw reducing memory latency as the other major way of increasing single core performance. Reducing memory latency in this sense basically means two things:

higher levels of cache hierarchy, and

better prefetching.

More than once during our conversations with Weber, it became clear that future multi-core AMD processors will continue to have their L1 and L2 caches separate, but a shared L3 cache will eventually be introduced to help reduce memory latency and keep those cores fed.

To Weber's second point, the use of helper threads (compiler or application generated threads that go out and work on prefetching useful data into cache before it's requested) will also improve single core performance. Intel has been talking about using helper threads since before Hyper Threading, but there is no idea of when we can expect real world implementation of helper threads at this point.

The topic of instruction combining was also interesting because it is something that we have only seen used in the Pentium M (Micro-Ops Fusion). Weber couldn't elaborate on an AMD implementation of some form of instruction combining, but we did get the distinct impression that it's something that's in the cards going forward. It looks as if elements from both AMD's and Intel's present day architectures will shape tomorrow's designs.

In the end, Fred left us with the following: if you see single core performance improving at a rate of 40% per 12 - 18 months, it will now improve at about half that rate for the foreseeable future.

#13, HT is kind of like hardware allowing context switching at instruction speed levels. Tyipcally, a thread that stalls on IO (like a hard drive) or something gets swapped out and another thread runs until the IO request completes. However, if a thread just can't use a cache well (streaming data, for example) all of those stalls due to memory loads just cause the CPU to sit and wait. These stalls are on the order of 10s of clock cycles. Other IO is on the order of 1000s of clock cycles (or more). A context switch is on the order of 100s of clock cycles. Obviously, you don't want to swap threads just because of a L2 cache miss. However, HT allows two thread contexts to be loaded so that when one thread stalls on a L2 cache miss, for example, the other thread can execute instructions with no delay. It's like shuffling cards. Basically, it allows the CPU to execute two contexts on the granularity of a clock cycle or two rather than on 100s of clock cycles.

So, as an example, the worst case for a thread is that every piece of data it wants will generate an L2 cache miss. On a non-HT processor, this means that this thread will not be swapped out until its scheduling quantum is met. But, during that time, the CPU will in effect be idle for probably 90% of the time due to all the cache misses. Since the thread won't be swapped out, your CPU will effectively be used for only 10% of the time during that quantum, then the next thread is allowed to run. With HT, both threads are loaded and those 90% of the cycles that the "bad" thread would waste can actually be used by the other thread.Reply

#11-not sure what you mean by "processing efficiency". all HT does is virtually separate the processor into two threads. maybe I'm missing something, but I can't figure out why everyone associates HT with performance gain.Reply

Could it be that possibly the reason for the slowdown in clock increases is not due to AMD/Intel R&D but rather software companies that are not keeping up.... As far back as I can remember many programs were able to utilize the new speed increases effectivly whereas now, a budget "3000" cpu is already kinda overkill for many office apps....
gaming is the only arena where the software is pushing the hardware (maybe video editing too but that market is much smaller?)

there needs to be more innovation on the software front to utilize the added hardware benefits... is a positive reinforncement routine....
If there was that push, I have no doubt that the speed increases would happen at a much better rateReply

An architecture with several cores, with one more powerful than the others, requests the programmer to tell to each thread what kind of performance it needs. While this could be accepted by console developers (that work very close to the hardware layers), you can say bye bye to easy porting to that platform.
while the performance increase can be substantial, the trade off is very specific code even at the highest levelReply

"Fred’s response to this question was thankfully straight forward; he isn’t a fan of Intel’s Hyper Threading in the sense that the entire pipeline is shared between multiple threads, in Fred’s words 'it’s a misuse of resources.'"Reply

Wish some comment was made on HT, intel only real saving grace for last couple years. Guess with DC it becomes a non-issue though at that point.

Hehe nice to see CPU world going full circle... AMD copied Intel like nobodies biz now it's the other way around. Props to AMD for innovating dispite thier punny size.. they definity should be rewarded by sales. I know I made the right choice with A64, the latency he mentions you can feel all the time, hard to "benchmark" it other than system just feels snappy compared to any other CPU I've used to including a P4C oC'ed to 3.4, A-XP OCed to 2.7, and IBM chips from apple at 2.5.Reply