The Mystery of the Missing Performance

As with past experience, we saw some very odd system behavior in testing and determined that Cool & Quiet may have had something to do with it. In testing this theory, we pulled out our Photoshop CS3 benchmark. We ran it once with Cool & Quiet off and once with it on. Our results were staggeringly different.

Our goal was to test both with C&Q disabled and enabled. It so happened that our first run through the benchmarks was with the power saving feature disabled. Our numbers looked much better than in previous tests and it seemed like everything made sense once again.

When we enabled C&Q the second time, however, the issue seemed to have disappeared (as has randomly happened in the past as well). We did install AMD's Power Meter in order to verify that C&Q was working (and it was), and it is possible that installing this software somehow fixed the issue. But since the issue has randomly come and gone in the past we really can't suggest this as a sure fire fix either.

In trying to re-reproduce the problem, we uninstalled the power meter, we rebooted and disabled CnQ, then re-enabled CnQ again. None of this brought back the poor performance we saw, but in another odd twist CnQ didn't really provide any power advantage either. It's entirely possible, since we didn't measure power when the problem was apparent, that the power savings of CnQ are also afflicted by whatever is underlying here.

In fact, if both performance and power savings were negatively affected by whatever is happening, we would not be surprised. AMD has informed us that our power numbers don't show as much of a savings as they would expect from CnQ (interestingly enough, Johan saw similar behavior in his latest piece). We've asked AMD to help us track down the issue, but their power guy is currently on vacation so it will be a little while.

Because the Photoshop test took so much less time without CnQ, we actually wanted to measure power usage over the test and compare energy used (watts * secconds to give joules). We fully expected the non-C&Q mode to be so much more efficient in completing the test quickly that it would use less total energy to perform the operation. Unfortunately, we were unable to verify this theory.

One thing is for certain, something is definitely not working as it should.

We do have a couple theories, but nothing confirmed or that really even makes a lot of sense yet. Why not share our thoughts and musings and see what comes of that though. It worked fairly well to help us find the instruction latency of the GT200 right?

One of the first things we thought was that it took longer to come out of its low power state than it should, but AMD did say that there's no reason why the X2 should be able to do this faster.

Our minds then wandered over to what we saw when we looked at the AMD Power Meter. Since Windows Vista takes it upon itself to move threads between cores in fairly stupid ways, during the Photoshop test we saw what looked like threads bouncing around between cores or cycling through them in rapid succession. Whatever was actually being done, the result was that one processor would ramp up to full power (1GHz up to 2GHz) and then drop back down as the next CPU came up to speed.

We talked about how it's possible that threads moving between these different cores, needing to wake the next one up rather than running on an already at speed core, could possibly impact performance. As the Phenom is the only CPU architecture we currently have access to with individual PLLs per core (Intel's CPUs must run all cores at the same frequency), the CnQ issues could be related to that.

There has to be a factor that is AMD specific that causes this problem -- and not only that but Phenom specific because we've never seen this problem on other AMD parts.

Or have we?

AMD GPUs last year exhibited quite an interesting issue with their power management features that were clearly evident in specific locations while playing Crysis. The culprit was the dynamic clocking of the GPU based on the graphics load. Because the hardware was able to switch quickly between modes, and due to the way Crytek's engine works, AMD GPUs were constantly speeding up and slowing down in situations where they should have been at full speed the entire time.

It is entirely possible that the CPU issue is of a similar nature. Perhaps the hardware that controls the clock speed is slowing down and then speeding up each core when it should just keep the core at full speed for a short time longer. The solution to the GPU issue was to increase the amount of time the GPU had to have lowered activity before the processor was clocked down. This meant that an increase in activity would result in an instant speed bump while the GPU had to be relatively lightly used for a longer period of time (still less than a second if I recall correctly) before it was clocked back down.

Yes, it's the same company, but the similarities do go a bit deeper. We really don't know what the heart of the matter is, but this kind of problem certainly is not without precedent. We will have to wait for AMD to help us understand better what is happening and if there is anything that could be done about it. We do hope you've enjoyed our best guesses though, and please feel free to let us know if you've got any other plausible explanations we didn't address.

Post Your Comment

36 Comments

This article got me thinking, why would you purchase a 9350e for a HTPC (which was what I was planning) when for the same or less money, you could get a 9850 Black Edition and just set the muliplyer to 10 (instead of 12.5). You'd have a CPU that you could use later on at full speed or OC'd, but for now on a HTPC underclock it to the 9350e speed and you would still have the bus at 4000 (not 3600) and the NB and HT speed at 2.0GHz (not 1.8GHz). Anything I would like to know is what the power usage would be if you did this (125w), compared to a 9350e (65w). Reply

An excerpt:
Possible decrease in performance during demand-based switching
Demand-Based Switching (DBS) is the use of ACPI processor performance states (dynamic voltage and frequency scaling) in response to system workloads. Windows XP processor power management implements DBS by using the adaptive processor throttling policy. This policy dynamically and automatically adjusts the processor’s current performance state in response to system CPU use without user intervention.

When single-threaded workloads run on multiprocessor systems that include dual-core configurations, the workloads may migrate across available CPU cores. This behavior is a natural artifact of how Windows schedules work across available CPU resources. However, on systems that have processor performance states that run with the adaptive processor throttling policy, this thread migration may cause the Windows kernel power manager to incorrectly calculate the optimal target performance state for the processor. This behavior occurs because an individual processor core, logical or physical, may appear to be less busy than the whole processor package actually is. On performance benchmarks that use single-threaded workloads, you may see this artifact in decreased performance results or in a high degree of variance between successive runs of identical benchmark tests.
..

It also explains how to change the policy in windows regarding this behavior, and while this is about XP I would not be surprised vista inherited this. Reply

I see that they have a lot of issues still. They need make sure that their board partners and chipset designs are really stable and ready for the "future"... I thought that was the whole point of AMD's spider platform, that they weren't gonna be playing around with switching sockets on users and such.

Well, anyway... I love having a quad-core (Intel) and I think that these products are all at a nice performance level. I do believe it helps for them to have ATI right now. Because if I were going to buy a cross-fire system, which I think makes more sense now than at any other point of time, it would be tempting (though still ultimately wrong) to get a 2.5 AMD quad-core to have everything match.

Wow, it's a great time to be a computer shopper right now. Today's AMD would have trounced yesterday's Intel, but Intel is just so lean and so efficiently producing better and better stuff that it's a tall order to even compete. AMD does seem to be making their appeals to the big guys pretty effectively, seems like when you go to brick and mortar stores about half of the systems are AMD based.

Looking into the future, AMD has what it takes to remain relevant in the market. They need to switch to a smaller process and save energy, and keep the 3 or 4 core thing going... no dual cores at all is the right thing to do.

I'm thinking that AMD could really grow into a huge beast with their AMD/ATI combo, just look at the beating NVIDIA is getting all of a sudden. The Radeon 4850 is going for $170.00??? And it's in abundant supply... wow, sucks to be lil' NVIDIA right now. AMD beat Intel when Intel was doing all kinds of illegal stuff that made it irrelevant, next time AMD beats Intel (if) it's going to be a very very big deal.

AMD has market share and name recognition, all they need now is a killer product, and I think they're getting really close to having one.
Reply

VIA is playing in a league of its own (or played, until the Intel Atom). Even with the Atom, the northbridge is using a lot of power, so VIA could still compete.
VIA is now all about integrated platform with low performance at very low power draws (very low compared to the x86 world of today).
As for the 64-bitness of the S3 Chrome memory controller, this too helps save both costs and power. Reply

I realize that the Windows world is still 32bit, but we're talking about processors here. I run a pure 64 bit Linux (Debian/Lenny) with (almost) no 32 bit applications - certainly not the ones I use frequently that need full processor speed. Why are the tests all done with 32 bit software?

Even if you can't run all the tests in 64 bit mode, surely a few benchmarks are available in 64 bit so we can get an idea of how the various processors perform in their native modes?
Reply

The first thing to notice is that AMD is launching a new model set 100 MHz higher for the same price as the previous 9850 at 2.5 GHz. The small boost in frequency should enable this CPU to be equivalent to Intel’s Q6600. Unfortunately, this is to the detriment of power consumption and the TDP gains 15W compared to the previous model for a total of 140W. Therefore, even with its Q6600 Intel holds a clear advantage in terms of power consumption all the while being slightly less expensive (officially $224)....

I was just noting how Anandtech's articles are often a little slow to come by (new ones) but WOW! CHock full of great info, insight and indepth information. Its like a good book you just can't quit reading till its done. Great Job GUYS! Reply

1. Are there no CPU drivers for the Phenom? These should allow changing C&Q configuration with Windows' energy options, better than fiddling in the BIOS. This works on my Athlon64 3200+ on WinXP Pro.

I'm also running AMD Power Monitor, which allows quickly switching the used energy setting from its tray icons popup menu, so I don't have to open the energy options every time.
This is great for debugging C&Q-related problems, which can happen on single-core CPUs too.

You can set the VID and Frequency with these registers. A frequency ramp-up or ramp-down usually consists of several steps of changing voltage and frequency until the target frequency is reached. There is a short waiting period after each change until the new voltage step has stabilized.
These waiting periods seem to be contained in some more CPU registers; if these values are wrong, C&Q will obviously not work as planned.

Now I don't know if these waiting values are determined by the CPU or written to it by the BIOS/kernel driver during its respective initialization (I only skimmed this section in the BKDG, not enough time ATM). If they are determined by the CPU itself and C&Q goes wrong, this would indicate a CPU bug. If they are determined by the BIOS however, this would be a BIOS bug and fixable with a BIOS update.
If you find a Phenom motherboard which doesn't exhibit these problems, then the BIOS would obviously be the culprit.

Oh, and a wild-ass-guess: maybe these waiting periods are (in part) influenced by temperature, and vary with changing CPU temperatures - this would of course imply that they aren't determined once at startup, but recomputed at given intervals. This should be testable, too.

3. Even if it's just a BIOS bug, a fix shouldn't completely solve the performance loss with PS, due to the split power planes of the Phenom. As long as the kernel throws threads around the cores willy-nilly, you will get a performance loss if C&Q does work properly. Frequency ramp-up does always take some time if C&Q actually works. This specific problem cannot occur on CPUs with a single power plane (which the C2Qs still have, I think), as all CPUs will always have full voltage as long as a single core runs at 100%; only the frequency might need to be adjusted. Of course the performance loss you measured is extreme and indicates broken C&Q ramp-up speed.

BTW, this wild thread-changing you observed will always cost some performance for most code, as a core change makes the L2 contents useless. This is only 512K for the K10, but on C2Q a change between cores 0/1 and 2/3 will invalidate up to 6MB, depending on CPU model and cache usage. This doesn't matter much with todays memory bandwidths, but I'd still regard this a kernel bug. Reply