The bright side of dark silicon

It's been a decade or so since the end of frequency scaling, and multicore has become ubiquitous, there being no other means to increase a chip's performance.

Some multicore systems are symmetric – all cores are identical, so you can easily move work from one core to another. Others are asymmetric – as in CPU cores and GPU cores, where it's harder to move work between different types of cores.

Which is better – symmetric or asymmetric multicore?

Why symmetric is better

Three main reasons that I see:

Better load balancing

Less work for everyone

More redundancy

Better load balancing

Asymmetric multicore makes load balancing harder, because a GPU can't easily yank a job from a queue shared with a CPU and run that job. That's because some of those jobs are simply impossible to run on a GPU. Others run so badly that it's not worth the trouble.

And those CPU codes that could run OK on GPUs would have to be compiled twice – for the CPU and the GPU – and even then you can't make things like function pointers and vtables work (though I can imagine a hardware workaround for the latter – a translation table of sorts; maybe I should patent it. Anyway, we're very far from that being our biggest problem.)

And then you need a shared queue between the CPU and the GPU – how does that work? – or you partition the work statically (each of the 4 CPUs processes 10% of the pixels, the remaining 60% of the pixels go to the GPU cores).

But static partitioning, often quite lousy even with symmetric multicore, is awful with asymmetric multicore because how do you choose the percentages? You need to know the relative strength of the cores at each task. How do you do that – dynamically figure out the first time your program runs on a new device?

So this is all close to insane. What people actually do instead is task parallelism – they look at their different jobs, and they figure out which should run on each type of core, and optimize each task for the respective core.

But task parallelism never load-balances very well. Let's say you look for faces in an image on the GPU and then try to figure out whose faces these are on the CPUs. Then sometimes the GPU finds a lot of faces and sometimes just a few, taking roughly the same time to do so. But the CPU then has either a lot of work or just a little. So one of them will tend to be the bottleneck.

Less work for everyone

We actually touched on that above. If you wanted to do data parallelism, running the same task on all your cores but on different subsets of the data, one problem would be to optimize your code for each type of core. That's more work. Someone at the OS/system level would also need to help you with sharing task queues and vtables – still more work.

Generally, more types of core means more hardware design, more compilers, assemblers, linkers and debuggers, more manuals, and more integration work from bus protocols to program loaders, etc. etc. And, for programmers, not only more optimization work but more portability problems.

More redundancy

That's a bit futuristic, but I actually heard this argument from respectable people. The idea is, chip manufacturing yields will significantly drop at, say, 8nm processes. And then your chance to get a chip without a microscopic defect somewhere will become so low that throwing away every defective chip will be uneconomical.

Well, with symmetric multicore you don't have to throw away the chip. If the testing equipment identifies the core that is no longer useable and marks the chip accordingly using fuses or some such (which is easy to do), an OS can then run jobs on all cores but the bad one.

Nifty, isn't it?

With asymmetric multicore, you can't do that, because some type of work will have no core on which it can run.

Why asymmetric is inevitable

In two words – dark silicon.

"Dark silicon" is a buzzword used to describe the growing gap between how many transistors you can cram into a chip with each advancement in lithography vs how many transistors you can actually use simultaneously given your power budget – the gap between area gains and power gains.

It's been a couple of years since the "dark silicon" paper which predicted "the end of multicore scaling" – a sad follow-up to the end of frequency scaling.

The idea is, you can have 2x more cores with each lithography shrink, but your energy efficiency grows only by a square root of 2. So 4 shrinks mean 16x more cores – but within a fixed power budget, you can only actually use 4. So progress slows down, so to speak. These numbers aren't very precise – you have to know your specific process to make a budget for your chip – but they're actually not bad as a tool to think about this.

With 16x more area but just 4x more power, can anything be done to avoid having that other 4x untapped?

It appears that the only route is specialization – spend a large fraction of the area on specialized cores which are much faster at some useful tasks than the other cores you have.

Can you then use them all in parallel? No – symmetric or asymmetric, keeping all cores busy is outside your power budget.

But, if much of the runtime is spent running code on specialized cores doing the job N times faster than the next best core, then you'll have regained much of your 4x – or even gained more than 4x.

Gaining more than 4x has always been possible with specialized cores, of course; dark silicon is just a compelling reason to do it, because it robs you of the much easier alternative.

What about load balancing? Oh, aren't we "lucky"! It's OK that things don't load-balance very well on these asymmetric systems – because if they did, all cores would be busy all the time. And we can't afford that – we must keep some of the silicon "dark" (not working) anyway!

And what about redundancy? I dunno – if the yield problem materializes, the increasingly asymmetric designs of today are in trouble. Or are they? If you have 4 CPUs and 4 GPU clusters, you lose 25% of the performance, worse than if you had 12 CPUs; but the asymmetric system outperforms the symmetric one by more than 25%, or so we hope.

So the bright side of dark silicon is that it forces us to develop new core architectures – because to fully reap the benefits of lithography shrinks, we can't just cram more of the same cores into a same-sized chip. Which, BTW, has been getting boring, boring, boring for a long time. CPU architecture has stabilized to a rather great extent; accelerator architecture, not nearly so.

GPUs are the tip of the iceberg, really – the most widely known and easily accessible accelerator, but there are loads of them coming in endless shapes and colors. And as time goes by and as long as transistors keep shrinking but their power efficiency lags behind, we'll need more and more kinds of accelerators.

(I have a lot of fun working on accelerator architecture, in part due to the above-mentioned factors, and I can't help wondering why it appears to be a rather marginal part of "computer architecture" which largely focuses on CPUs; I think it has to do with CPUs being a much better topic for quantitative research, but that's a subject for a separate discussion.)

And this is why the CPU will likely occupy an increasingly small share of the chip area, continuing the trend that you can see in chip photos from ChipWorks et al.

P.S.

I work on switching-limited chip designs: most of the energy is spent on switching transistors. So you don't have to power down the cores between tasks – you can keep them in an idle state and they'll consume almost no energy, because there's no switching – zeros stay zeros, and ones stay ones.

Chips which run at higher frequencies and which are not designed to operate at high temperatures (where high leakage would become intolerably high – leakage grows non-linearly with temperature) are often leakage-limited. This means that you must actually power down a core or else it keeps using much of the energy it uses when doing work.

Sometimes powering down is natural, as in standby mode. Powering down midway through realtime processing is harder though, because it takes time to power things down and then to power them back up and reinitialize their pesky little bits such as cache line tags, etc.

So in a leakage-limited design, asymmetric multicore is at some point no better than symmetric multicore – if the gaps between your tasks are sufficiently short, you can't power down anything, and then your silicon is never dark, so either you make smaller chips or programs burn them.

But powering up and down isn't that slow, so a lot of workloads should be far from this sad point.

P.P.S.

I know about GreenDroid, a project by people who make the "dark silicon leads to specialization" argument quite eloquently; I don't think their specialization is the right kind – I think cores should be programmable – but that again is a subject for a separate discussion.

P.P.P.S.

Of course there's one thing you can always do with extra area which is conceptually much easier than adding new types of cores – namely, add more memory, typically L2/L3 cache. Memory is a perfect fit for the dark silicon age, because it essentially is dark silicon – its switching energy consumption is roughly proportionate to the number of bytes you access per cycle but is largely independent of the number of bytes you keep in there. And as to leakage, it's easier to minimize for memories than most other kinds of things.

Another "lucky" coincidence is that you really need caches these days because external DRAM response latency has been 100 ns for a long time while processor clocks tend to 50-200x shorter, so missing all the caches really hurts.

So it's natural to expect memories to grow first and then the accelerator zoo; again consistently with recent chip photos where, say, ARM's caches are considerably bigger the ARM cores themselves.

(Itanium famously spent 85% percent of the chip area or so on caches, but that was more of "cheating" – a way to show off performance relative to x86 when in fact the advantage wasn't there – than anything else; at least that's how Bob Colwell quoted his conversation with Andy Grove. These days however it has become one of the few ways to actually use the extra area.)

27 comments ↓

> And those CPU codes that could run OK on GPUs would have to be compiled twice – for the CPU and the GPU – and even then you can't make things like function pointers and vtables work (though I can imagine a hardware workaround for the latter – a translation table of sorts; maybe I should patent it. Anyway, we're very far from that being our biggest problem.)

I think AMD has beaten you to the punch, they now use a unified address space on their latest GPUs. I believe the GPU uses the CPU's MMU, any pagefaults are handled by the CPU.

A unified address space isn't enough – you still have different code pointers if your code is compiled twice, so if you pass a function pointer from a CPU to a GPU, it won't be able to call the function without translation.

The tradeoffs between some of the problems you indicate, such as load balancing and task parallelism may be offset by the benefits of specialized silicon, though, yes? Two symmetric CPUs in perfect load balancing still won't process certain tasks as fast as a CPU/GPU with the GPU doing 100% of the work, so there's still a net gain for that task asymmetrically without expending work on parallelization (admittedly pushing some hard work to the chip designers). As long as real world tasks benefit from specialized silicon, though, I think asymmetry is to be expected.

There are parallels here with a debate in the early 90's about how best to support dynamic and object-oriented programming on the chips of the day. Do we add new instructions to accelerate method dispatch, or just add cache? The answer seems so obvious now I'm almost embarrassed to admit the time I spent working with a chip designer on the former approach (I was at Apple when it first got involved in the ARM). Of course larger caches improved things dramatically and the ARM instruction set was not extended in this way, but at the time we were not so far removed from Lisp machines and so we had a bias toward custom hardware.

Now we worry more about processing large data sets and less about method dispatch (except that dynamic language performance has at last become a hot topic). Again more cache helps, but specialization allows better task parallelism (as you point out) so the answer will certainly involve a mix of the two.

@Chipmonkey: it's true that you often gain from asymmetry due to specialization more than you lose from it due to poor load balancing; poor load balancing just makes the choice between symmetric and asymmetric harder – if you can do symmetry, that is. Dark silicon makes things easier because you simply can't do perfectly load balanced symmetric systems, so the tradeoff is gone.

@Jim: you worked on OO extensions to ARM at Apple?! Cool stuff! I always thought it doesn't help much simply because you still need to do the same amount of memory indirections and so the number of times you go through memory won't go down and that's your bottleneck to begin with; but maybe it's simplistic and it'd be really interesting to hear details of how it was supposed to work (legally disclosing this kind of thing is of course often impossible, but it won't hurt that I asked…)

Is dark silicon compatible with Microsoft Visual Basic? I realise that you can probably program the dark silicon with advanced languages like C and Jabascript, but for me it is important that it can run Visual Basic.

"Memory is a perfect fit for the dark silicon age, because it essentially is dark silicon – its switching energy consumption is roughly proportionate to the number of bytes you access per cycle but is largely independent of the number of bytes you keep in there."

The way it works with memories is, roughly, yes, distance is important and you can't have a "deep" monolithic memory because of that. So if you need to fetch 32b per cycle, and you want 4M of memory, you have a large bunch of smaller banks of memory, and you use some of the address bits to select the bank, and then fetch your bytes from that bank. The reason your power doesn't grow much is most of these banks are deactivated. One of them might produce data that then must travel a long distance and it costs; but that's offset by not activating all that other stuff at this cycle. Basically memory is cheap power-wise because, well, it doesn't compute anything, it just remembers.

As to what NVIDIA did in a specific chip generation – I'd need to look deeper to try to guess their rationale than being able to state a few general trends which should hold over time across a large set of designs.

Any thought as to whether FPGAs could be made part of the dark silicon balance, one that could tightly integrate with the CPU and GPU cores via signal lines between them and the FPGA switching fabric, and perhaps direct accessability of CPU/GPU registers from the FPGA?

Also, you mentioned that asynchronous is the way to go, but a talk from a few years ago http://www.youtube.com/watch?v=KfgWmQpzD74 suggested that both asynchronous and dynamic cores were useful. The dynamic cores would be regular CPUs run at different frequencies, say 200MHz, 800MHz, and 2GHz, and you would orchestrate turning these on and off to match your power budget. They couldn't all be turned on at once, but provided you had some idea of the relative cost of different calculations and the dependencies between them you could schedule the optimal mix of cores for your load and assign them appropriately. This couldn't be done perfectly, of course, but a Pareto result could be good enough.

It has been interesting to me that the schedulers across cores you now see in erlang VMs (for erlang processes) and golang (for goroutines) could well be the way to realize that orchestration some day.

I think dynamic frequency helps with energy (saving it when you can) but not power (energy spending rate); that is, it helps spend less when you aren't under heavy load, but if you're continuously under heavy load, and therefore dissipating a lot of power without being able to slow anything down, you need asymmetry or you'll burn.

Direct accessibility of registers from the FPGA – I don't think so; integration on the same die – I hope to post soon how I see it, I think it's not going to be in mass produced chips any time soon though it'd be nice if it were.

There's a separate issue from performance on the horizon, and that's availability. If you assign a task to a processor and that task becomes idle, it's certainly a lot faster to start up than if the task is paged out to memory. I think that "suspending" cores when a task indicates that it is idle may be a way of utilizing the extra cores without requiring that they be executing code continuously. This leads to a processor model with very many less-capable cores.

I guess there's value to it in scenarios with a ton of concurrency where you constantly context-switch (network processors tend to be big clusters with many, many "wimpy" cores for that reason I believe.) On end-user machines I'd be surprised if the cost of context switching dominated to that large an extent.

Your "more redundancy" argument isn't so futuristic, it's quite the case on the PS3. The Cell architecture has 8 SPUs, but the PS3's specs require only 7 SPUs, thus the yield is increased.

Imho it was a nice try, but was reverted to a more regular architecture on the PS4. So on one hand, we are encouraged to develop new architectures, and on the other hand, it seems we are not embracing new architectures…

B usually takes five times as long as A does, and C always depends on B and usually on A. A task scheduler that kept JIT-like stats could parallelize A on a 200MHz core and B on a 1GHz core, and have A's result before B's most of the time with less power used over all.

Also, perhaps the dynamic and asynchronous views aren't mutually exclusive. A slower core could have a much simpler in implementation in silicon, meaning far lower power and much less space required. So as transistor density goes up a package might have 64 cores at 200MHz, 16 at 800MHz, and only 4 at 2GHz. Assuming each core frequency level represents 100% of the power budget, a highly parallelizable program might get better performance scheduled across 32 200Mhz, 4 800MHz, and 1 2GHz cores rather than running all 37 tasks across the 4 2GHz cores.

"And those CPU codes that could run OK on GPUs would have to be compiled twice — for the CPU and the GPU — and even then you can't make things like function pointers and vtables work …"

Surely if your code relies on making calls through function pointers or vtables, it's not going to "run OK on GPUs" in the first place, is it? I say this as a total noob in the world of GPGPU, but I thought branches were bad, in which case "load this pointer and jump through it" must be far worse.

Very interesting article.
I used to integrate accelerators into custom SoC chip and the software become really nasty. Not sure what kind of specialization can save the day, like to see the result.

also intel released their 22nm products to the market and it seems the dark silicon issue doesn't matter much actually. I've talked with IBM and TSMC guys about 14nm/16nm process and it seems still leakage is the biggest headache.

…but leakage is the problem leading to dark silicon! That's why you can't lower the voltage, not? Which in turn is why you can't get power efficiency improvements beyond ~L while your area improvement is L^2.

You say Intel released 22nm products and you say dark silicon doesn't matter. You mean they got Dennardian performance improvements – that is, 1.4x the frequency and 2x the core count relative to the previous node?

If you look at newer nodes' stats, don't you see a gap between the area improvements and the power improvements? Leakage is the root cause but the gap is the result, not?

Well, it's both, right? As long as you can shut down things you aren't using anyway you're achieving a goal; once you're forced to shut down things you'd very much like to use – like half your cores whatever those cores are – then you're facing a problem.

Yes availability is an interesting thought area especialy from a security perspective.

As is often said the number of "bugs" in code is proportional to the lines of code writen not the computing power of individual lines. Thus the higher the level language the more productive code with less bugs and the faster it will be written. Taken to a logical conclusion you end up with *nix type shell scripting where each line of code in effect calls an applet.

Thus whilst "code cutter" level programers carve out lots of production code using scripting, the applets are written by security aware engineer level programers using an engineering aproach.

You thus have hundreds of wimpy cores which in effect run applets. Script pipelining is done through IPC mechanisums working through "main memory" whilst applets use memory local to the core.

You have a number of specialised cores that act as hypervisors, these control the wimpy cores in a number of ways. One specific area is each wimpy core uses the equivalent of a reduced function MMU through which IPC happens. However unlike a conventional setup it is not the associated wimpy core that controls the MMU but the hypervisor.

In essence the wimpy core is "jailed" behind the MMU and the applet has know knowledge of the rest of the system. Further the hypervisor can limit the amount of local memory an applet has to work with and further halt the wimpy core and inspect the local memory. Thus malware has little or no space to exist, and no knowledge of the system in general nore does it have a sense of time to establish covert channels between wimpy cores.

Further cores can be run in parellel but fed the same data and using the same algorithms but developed differently. Wimpy cores are then run in triplets and their outputs compared in a voting protocol. If the three outputs agree then it is unlikely there is a fault or malware in the three cores. If however there is a difference then there is either a fault or malware which the hypervisor can look for. Importantly malware would have to appear simultaniously on all three wimpy cores simultaniously, and with a little thought you will see this is not possible for externaly injected malware thus it will get flagged up. Likewise malware at this level cannot be introduced by the "code cutters" only the "applet engineers" which can be limited if not removed by using seperate teams to develop the three different implementations of each applet and appropriate formal methods.