Introduction

In our first article, we explained that dynamic power, power leakage, the memory wall and wire delay have forced CPU designers to rethink the methods that they use to achieve higher performance CPUs.

In Part 2, we will investigate the advantages and disadvantages of the new market trend: multi-core CPUs. Will dual core enhance your gaming experience? Tim Sweeney, the leading developer behind the Unreal 3 engine, was so kind to answer our questions about multi-threaded development with concise answers. There is more - in the third part of this series, we will investigate what future multi-core and single core architectures will bring. We examine if the stories about "the new era of multi-threaded multi-core CPUs" are true and whether or not this will really benefit the consumer.

Should you care?

Should you care whether or not we are moving to multi-core and multi-threaded CPUs? After all, the past decades, we were able to get consistently more performance for lower prices. However, it is pretty unclear whether or not multi-cores will benefit all consumers. We will explain this statement in more detail, but it is very interesting to see whether or not it will benefit you. The last spring IDF was all about multi-core CPUs, but there was very little information on how this is going to benefit the consumers. Let us take a critical look at this new direction that the desktop CPUs have taken.

Multi-core, multi-expensive?

Dual cores are expensive to manufacture. Yields (the number of working chips on one wafer) are roughly proportional to size. Larger, dual core chips will always have lower yields than smaller, single core chips on the same process technology. But that is only a small problem. A bigger and more obvious problem is that you have only half the number per wafer (even slightly less). So, dual cores (such as Pressler) cost at least twice as much to manufacture compared to a single core chip - most likely more (such as Yonah, Pentium-D). Dual and multi-cores might not increase the thermal density (dissipated power per mm²), but they do increase the total power. Granted, from the viewpoint of a heat sink designer, it is not much harder to cool a 112 mm² Prescott chip that dissipates +/- 90 Watt than a theoretical 206 mm² Pentium-D with 180 Watt. However, making sure that those 180 Watts do not cook all the components inside your computer is almost an impossible task for the system designer who wants to design a relatively silent PC. The result is that multi-core CPUs will run at lower clockspeeds than their single core counterparts. The Pentium-D, the dual core Prescott, is limited to 130 Watt and 3.2 GHz, while the current Prescott dissipates up to 115 Watt and runs at 3.8 GHz. And last, but not least, dual core CPUs need more bandwidth than a single core to make a difference and increase the "CPU perceived" latency. Cache coherency and getting access to the same memory bus all increase the total latency that the CPU sees and thus, lowers performance.

Multi-core, multi-performance?

The advantages of multi-core and multi-threaded CPUs far outweigh the disadvantages in the server market. While most server applications produce a lot of threads and processes, performance scales close to linear as more cores are added to the die. This is in sharp contrast with the superscalar CPU where increasingly complex designs require exponentionally more transistors, and power show diminishing returns, especially in server applications where the IPC can go below 1. While Dual core CPUs are more expensive to manufacture, they are far easier to design than turning a single core CPU into an even wider issue, complex CPU. Development costs for a new CPU design are astronomically high. So, it does not surprise us at all that Server CPU manufacturers have turned en masse towards multi-core CPU designs: significant power gains with a fraction of the time and money invested. And the same can be said about a big part of the HPC market.

A good example of how well server applications can scale with more CPUs, refer to our DB2 tests, which showed up to a 96% performance increase going from single to dual, and a boost of up to 89% when we increased the number of Opterons from two to four. Most desktop and many workstation applications are single-threaded, however. Or more accurately, they might be multithreaded to be more responsive, but there is only one thread that really needs CPU power.

Even some workstation applications that are supposed to be prime examples of multi-threaded applications are not as multi-core friendly as they appear to be. I ran a lot of Adobe Premier benchmarking with different video formats, and I found out that the second CPU offered a meagre 10% to 40% speed increase in video editing (rendering). 3DSMax shows only big increases when you use very complex scenes. When using a relatively light animation scene, the second CPU adds about 20% to 50%. One of the best scenes, the architecture scene of the Spec test, shows an 89% increase when adding a second Opteron, but two extra Opterons already show some diminishing returns - performance went up to 72%.

Multitasking scenarios might be another way to use the power of dual and multi-cores. However, many of the CPU heavy applications that desktop and workstation users like to run in the background - archiving, encoding - also operate on the hard disk. And despite the merits of NCQ (Native Command Queuing), high rotation speeds, and lower seek times, disk heavy tasks and especially multithreaded ones can bring a whole system to a crawl when there is too much hard disk activity. So, it is clear that there are big challenges ahead before multi-core CPUs will really bring benefits to most consumers and employees.

Post Your Comment

49 Comments

I think there is a few things that most people overlook when looking at multi-cpu/multi-core, almost all benchmarks that I have seen are written and tested on systems with clean installs, and have no other programs running (anti-virus, aim, msn, teamspeak, IRC, p2p software, firewall, decode human genome :b, etc). I would think that most people do leave many programs open, such as those above, when playing games.

With this in mind, people will find an increase of system performance when leaving multiple programs running. It wont be an increase for performance for benchmark testbeds so much, as an increase in real world performance.

So basically it won't increase speed in these circumstances, but limit the decrease of fps while running many different programs.Reply

Article: "Be warned that Intel was already showing performance increases, which are not realistic "up to 124%"."

#5, there's another explanation as well, but it's a more rare condition. Suppose you had two processors (doesn't even have to be dual core), each with 1M L2 cache. Suppose you also had a problem that has data that is 1.5M in size and is very coarse grained (very parallelizable). One processor cannot fit all the data into L2 cache so it will have to run at main memory speeds most/all of the time. With two processors, each gets 768K, which can easily fit into its L2 cache, which enables each processor to run at L2 cache speeds. This would show up as a superlinear speedup (two cores = more than 2X as fast). This is an extreme example, but one I expect to find in published marketing propaganda.

#13 " A though! I still think threads are rubbish, that processes and better schedulers are the way forward. "

Well, with threads you get shared memory for "free", if you've ever written processes that use shared memory, well, there you are. However, since a threaded kernel and a process based kernel are pretty much the same when a process has only one thread, there's little difference between the two for single-threaded executables and you can continue to use your multi-process model without any problems.

As with #17... like it or not, multi-core/multi-processor boxes are what's coming. You can choose to use what resources are available to you or you can stick to one process programming. Some groups will choose to use what resources are available and some won't. The marketplace will sort out the winners/losers based on which solution is better.

#18 The PPU is just another form of multiprocessing (just like GPUs are). It's just Asymmetric Multiprocessing (AMP) instead of Symmetric Multiprocessing (SMP). It's not new or anything. I do agree, though, that the PPU has a lot of potential and, just out of my own preferences, goes by the idea that adding specialized hardware (cheaply) usually is a bigger win than adding more generalized hardware. Just think of graphics cards today. Adding a relatively cheap graphics card will make your game run much better/prettier than adding another P4 or Opteron.

Basically, my thoughts are this: The gaming industry has already gone "multi-threaded" in an asymmetric way simply because of 3D video cards. They already have solved some problems by abstracting parts of their problem. This is simply adding more resources that they can take advantage of, or not, as they see fit. Having dual-core or dual processor systems doesn't prevent them from writing as they've done today. The main issue, for the short term, is that they will need to know whether or not they are on a dual core machine and write accordingly. The main reason that multithreaded games haven't really caught on as of yet is because 99% (or more) of the target audience has only one core. Spending the amount of time/effort to optimize for dual processors for less than 1% of your target market doesn't make sense. If 90% of the market had dual processors, then it would probably be worth the effort to plan to use the resources available. Since both major CPU houses are going dual core and it looks like that's the "way it's gonna be", there will be a rocky period for a while while dual core machines are rare, but they will get more common until the point where they are in the majority. At that time, it will make sense to consider single core machines as the degenerate case and, basically, make single cores the exception instead of the rule.Reply

#20, a multicore implementation could have shared cache, and also have very fast inter processor communications. You could write a program with small interdependent threads that wait to end both and update parts of some common data. The data used stays in the common cache, and every update is made extremely fast.
Compare this to a dual processor, that must maintain its caches in synchronization. After a fraction of a millisecond (or less) or work, the processors update different portions of the common data. And there goes: invalidation of cache lines, writing of modified cache lines to memory, the processors must fight for a single FSB (the case with Intel Pentium processors), and so on. You can see that there are some cases (even if somehow artificial) when a proper implementation of dual core can be much faster than multiprocessor.
The best advantage the multicore will have over multiprocessor would be in numerical tasks like weather prediction, and other highly interdependant computation tasksReply

"a multicore implementation could have shared cache and also have very fast inter processor communications... Compare this to a dual processor, that must maintain its caches in synchronization." Is this the real reason that multi-core multiprocessing is better than multi-chip multiprocessing (the traditional SMP)? A multi-core chip can have dedicated caches (per core) too, and that requires synchronization. And multi-chip SMP could also have shared cache and fast inter-chip communication. Well, you may argue that it is easier to make inter-core communication faster than inter-chip communication. But is this really the fundamental reason why multicore is better than multichip? Could someone explain why a processor manufacturer and a consumer would prefer making/buying a multicore than multichip processors? As far as power consumption and leakage is concerned, isn't it true that multichip is more manageable? In a paper "Planning Considerations for Multicore Processor Technology" by John Fruehe (May 2005) in dell.com/powersolutions, the author compares the effective performance level of a multicore and multichip processors. (But he does not address my question.) Without giving reason, he assume that the core-to-core scalability is 70% (that is, the second core delivers 70% of its processor power due to overhead) whereas the estimated socket-to-socket (i.e. chip-to-chip) scalability is 80% (that is, the dual processors achieve 180% of their combined processing power). That is kind of interesting. I really want to see a comparison between multi-core multiprocessing vs. multi-chip multiprocessing. Reply

High IPC is not the form of parallelism from the article - the focus of the article was on running a process on two (or more) different cores. The idea is that high IPC profits all the programs, no matter how written. Multi thread is different - the idea is to have parts of a program that execute simultaneously but with very few interrelations (you can have a thread to paint the interface in a game, while having another thread to paint the rest of the screen. The threads would be with almost no correlations (except for sending commands).
High IPC is not a solution in x86 world because the code tends to have dependencies close to each other, so you can start executing 100 instructions at a time, but 99 of them needs to wait for the execution of one. You simply have those moments when all execution must wait for an instruction to end.
EPIC (Itanium) will help with that, as the high IPC could be guaranteed by the instructions - at every clock you can execute one instruction = equivalent to several x86 instructions. So, the performance would be the clock speed multiplied by an IPC of 3 or 4, unlike the Athlon (let's say) that have a performance generated by its larger clock speed multiplied by 1 IPC or something.Reply

Wonderful article! I loved the "hardware meets software" focus of this piece. I've had many questions about the practicality of multi-threaded applications and this article answered many of them. Also, loved the interview with Sweeny.

#20:MarriedMan - Yes, I think so. This is actually an interesting question. As I understand it, I think both AMD and Intel are using pretty much the same technology in both, so that communication channels on the motherboard (in the dual CPU case) will be replaced by communication channels on the CPU die. I think AMD's approach is better only because HT etc. lends itself to dual-core much better than Intel's older technology. I guess the next generation of dual-core chips might be somewhat different though. Anyone else know anything?Reply