"It's done in hardware so it's cheap"

This is one of these things that look very obvious to me, to the point where it seems not worth discussing. However, I've heard the idea that "hardware magically makes things cheap" from several PhDs over the years. So apparently, if you aren't into hardware, it's not obvious at all.

So why doesn't "hardware support" automatically translate to "low cost"/"efficiency"? The short answer is, hardware is an electric circuit and you can't do magic with that, there are rules. So what are the rules? We know that hardware support does help at times. When does it, and when doesn't it?

To see the limitations of hardware support, let's first look at what hardware can do to speed things up. Roughly, you can really only do two things:

Specialization - save dispatching costs in speed and energy.

Parallelization - save time, but not energy, by throwing more hardware at the job.

Let's briefly look at examples of these two speed-up methods – and then some examples where hardware support does nothing for you, because none of the two methods helps. We'll only consider run time and energy per operation, ignoring silicon area (considering a third variable just makes it too hairy).

I'll also discuss the difference between real costs of operations and the price you pay for these operations, and argue that in the long run, costs are more stable and more important than prices.

Specialization: cheaper dispatching

If you want to extract bits 3 to 7 of a 32-bit word and then multiply them by 13 – let's say an encryption algorithm requires this – you can have an instruction doing just that. That will be faster and use less energy than, say, using bitwise AND, shift & multiplication instructions.

Why – what costs were cut out? The costs of dispatching individual operations – circuitry controlling which operation is executed, where the inputs come from and where the outputs go.

Specialization can be taken to an extreme. For instance, if you want a piece of hardware doing nothing but JPEG decoding, you can bring dispatching costs close to zero by having a single "instruction" – "decode a JPEG image". Then you have no flexibility – and none of the "overhead" circuitry found in more flexible machines (memory for storing instructions, logic for decoding these instructions, multiplexers choosing the registers that inputs come from based on these instructions, etc.)

Before moving on, let's look a little closer at why we won here:

We got a speed-up because the operations were fast to begin with – so dispatching costs dominated. With specialization, we need 4 wires connected directly to bits 3 to 7 that have tiny physical delay – just the time it takes the signal to travel to a nearby multiplier-by-13. Without specialization, we'd use a shifter shifting by a configurable amount of bits – 3 in our case but not always – which is a bunch of gates introducing a much larger delay. On top of that, since we'd be using several such circuits communicating through registers (let's say we're on a RISC CPU), we'd have delays due to reading and writing registers, delays due to selecting registers from a large register file, etc. With all this taken out by having a specialized instruction, no wonder we're seeing a big speed-up.

Likewise, we'll see lower energy consumption because the operations didn't require a lot of energy to begin with. Roughly, most of the energy is consumed when a signal value changes from 1 to 0 or back. When we use general-purpose instructions, most of the gate inputs & outputs and most flip-flops changing their values are those implementing the dispatching. When we use a specialized instruction, most of the switching is gone.

This means that, unsurprisingly, there's a limit to efficiency – the fundamental cost of the operations we need to do, which can't be cut.

When the operations themselves are costly enough – for instance, memory access or floating point operations – then their cost dominates the cost of dispatching. So specialized instructions that cut dispatching costs will give us little or nothing.

Parallelization: throwing more hardware at the job

What to do when specialization doesn't help? We can simply have N processors instead of one. For the parts that can be parallelized, we'll cut the run time by N – but spend the same amount of energy. So things got faster but not necessarily cheaper. A fixed power budget limits parallelization – as does a fixed budget of, well, money (the price of a 1000-CPU rack is still not trivial today).

[Why have multicore chips if it saves no energy? Because a multicore chip is cheaper than many single core ones, and because, above a certain frequency, many low-frequency cores use less energy than few high-frequency ones.]

We can combine parallelization with specialization – in fact it's done very frequently. Actually a JPEG decoder mentioned above would do that – a lot of its specialized circuits would execute in parallel.

Another example is how SIMD or SIMT processors broadcast a single instruction to multiple execution units. This way, we get only a speed-up, but no energy savings at the execution unit level: instead of one floating point ALU, we now have 4, or 32, etc. We do, however, get energy savings at the dispatching level – we save on program memory and decoding logic. As always with specialization, we pay in flexibility – we can't have our ALUs do different things at the same time, as some programs might want to.

Why do we see more single-precision floating point SIMD than double-precision SIMD? Because the higher the raw cost of operations, the less we save by specialization, and SIMD is a sort of specialization. If we have to pay for double-precision ALUs, why not put each in a full-blown CPU core? That way, at least we get the most flexibility, which means more opportunities to actually use the hardware rather than keeping it idle.

(It's really more complicated than that because SIMD can actually be a more useful programming model than multiple threads or processes in some cases, but we won't dwell on that.)

What can't be done

Now that we know what can be done – and there really isn't anything else – we basically already also know what can't be done. Let's look at some examples.

Precision costs are forever

8-bit integers are fundamentally more efficient than 32-bit floating point, and no hardware support for any sort of floating point operations can change this.

For one thing, multiplier circuit size (and energy consumption) is roughly quadratic in the size of inputs. IEEE 32b floating point numbers have 23b mantissas, so multiplying them means a ~9x larger circuit than an 8×8-bit multiplier with the same throughput. Another cost, linear in size, is that you need more memory, flip-flops and wires to store and transfer a float than an int8.

(People are more often aware of this one because SIMD instruction sets usually have fixed-sized registers which can be used to keep, say, 4 floats or 16 uint8s. However, this makes people underestimate the overhead of floating point as 4x – when it's more like 9x if you look at multiplying mantissas, not to mention handling exponents. Even int16 is 4x more costly to multiply than int8, not 2x as the storage space difference makes one guess.)

We design our own chips, and occasionally people say that it'd be nice to have a chip with, say, 256 floating point ALUs. This sounds economically nonsensical – sure it's nice and it's also quite obvious, so if nobody makes such chips at a budget similar to ours, it must be impossible, so why ask?

But actually it's a rather sensible suggestion, in that you can make a chip with 256 ALUs that is more efficient than anything on the market for what you do, but not flexible enough to be marketed as a general-purpose computer. That's precisely what specialization does.

However, specialization only helps with operations which are cheap enough to begin with compared to the cost of dispatching. So this can work with low-precision ALUs, but not with high-precision ALUs. With high-precision ALUs, the raw cost of operations would exceed our power budget, even if dispatching costs were zero.

Memory indirection costs are forever

I mentioned this in my old needlessly combative write-up about "high-level CPUs". There's this idea that we can have a machine that makes "high-level languages" run fast, and that they're really only slow because we're running on "C machines" as opposed to Lisp machines/Ruby machines/etc.

Leaving aside the question of what "high-level language" means (I really don't find it obvious at all, but never mind), object-orientation and dynamic typing frequently result in indirection: pointers instead of values and pointers to pointers instead of pointers. Sometimes it's done for no apparent reason – for instance, Erlang strings that are kept as linked lists of ints. (Why do people even like linked lists as "the" data structure and head/tail recursion as "the" control structure? But I digress.)

This kind of thing can never be sped up by specialization, because memory access fundamentally takes quite a lot of time and energy, and when you do p->a, you need one such access, and when you do p->q->a, you need two, hence you'll spend twice the time. Having a single "LOAD_LOAD" instruction instead of two – LOAD followed by a LOAD – does nothing for you.

All you can do is parallelization - throw more hardware at the problem, N processors instead of one. You can, alternatively, combine parallelization with specialization, similarly to N-way floating point SIMD that's somewhat cheaper than having N full-blown processors. For example, you could have several load-store units and several cache banks and a multiple-issue processor. Than if you had to run p1->q1->a and somewhere near that, p2->q2->b, and the pointers point into different banks, some of the 4 LOADs would end up running in parallel, without having several processors.

But, similarly to low-precision math being cheaper whatever the merits of floating point SIMD, one memory access is always twice cheaper than two despite the merits of cache banking and multiple issue. Specifically, doubling the memory access throughput roughly doubles the energy cost. This can sometimes be better than simply using two processors, but it's a non-trivial cost and will always be.

A note about latency

We could discuss other examples but these two are among the most popular – floating point support is a favorite among math geeks, and memory indirection support is a favorite among language geeks. So we'll move on to a general conclusion – but first, we should mention the difference between latency costs and throughput costs.

In our two examples, we only discussed throughput costs. A floating point ALU with a given throughout uses more energy than an int8 ALU. Two memory banks with a given throughput use about twice the energy of a single memory bank with half the throughput. This, together with the relatively high costs of these operations compared to the costs of dispatching them, made us conclude that we have nothing to do.

In reality, the high latency of such heavyweight operations can be the bigger problem than our inability to increase their throughput without paying a high price in energy. For example, consider the instruction sequence:

c = FIRST(a,b)
e = SECOND(c,d)

If FIRST has a low latency, then we'll quickly proceed to SECOND. If FIRST has a high latency, then SECOND will have to wait for that amount of time, even if FIRST has excellent throughput. Say, if FIRST is a LOAD, being able to issue a LOAD every cycle doesn't help if SECOND depends on the result of that LOAD and the LOAD latency is 5 cycles.

A large part of computer architecture is various methods for dealing with these latencies – VLIW, out-of-order, barrel processors/SIMT, etc. These are all forms of parallelization – finding something to do in parallel with the high-latency instruction. A barrel processor helps when you have many threads. An out-of-order processor helps when you have nearby independent instructions in the same thread. And so on.

Just like having N processors, all these parallelization methods don't lower dispatching costs - in fact, they raise them (more registers, higher issue bandwidth, tricky dispatching logic, etc.) The processor doesn't become more energy efficient - you get more done per unit of time but not per unit of energy. A simple processor would be stuck at the FIRST instruction, while a more clever one would find something to do – and spend the energy to do it.

So latency is a very important problem with fundamentally heavyweight operations, and machinery for hiding this latency is extremely consequential for execution speed. But fighting latency using any of the available methods is just a special case of parallelization, and in this sense not fundamentally different from simply having many cores in terms of energy consumed.

The upshot is that parallelization, whether it's having many cores or having single-core latency-hiding circuitry, can help you with execution speed – throughput per cycle – but not with energy efficiency – throughput per watt.

The latency of heavyweight stuff is important and not hopeless; its throughput is important and hopeless.

Cost vs price

"But on my GPU, floating point operations are actually as fast as int8 operations! How about that?"

Well, a bus ticket can be cheaper than the price of getting to the same place in a taxi. The bus ticket will be cheaper even if you're the only passenger, in which case the real cost of getting from A to B in a bus is surely higher than the cost of getting from A to B in a taxi. Moreover, a bus might take you there more quickly if there are lanes reserved for buses that taxis are not allowed to use.

It's basically a cost vs price thing – math and physics vs economics and marketing. The fundamentals only say that a hardware vendor always can make int8 cheaper than float – but they can have good reasons not to. It's not that they made floats as cheap as int8 – actually, they made int8 as expensive as floats in terms of real costs.

Just like you going alone in a bus designed to carry dozens of people is an inefficient use of a bus, using float ALUs to process what could be int8 numbers is an inefficient use of float ALUs. Similarly, just like transport regulations can make lanes available for buses but not cars, an instruction set can make fetching a float easy but make fetching a single byte hard (no load byte/load byte with sign extension instructions). But cars could use those lanes – and loading bytes could be made easy.

As a passenger, of course you will use the bus and not the taxi, because economics and/or marketing and/or regulations made it the cheaper option in terms of price. Perhaps it's so because the bus is cheaper overall, with all the passengers it carries during rush hours. Perhaps it's so because the bus is a part of the contract with your employer – it's a bus carrying employees towards a nearby something. And perhaps it's so because the bus is subsidized by the government. Whatever the reason, you go ahead and use the cheaper bus.

Likewise, as a programmer, if you're handed a platform where floating point is not more expensive or even cheaper than int8, it is perhaps wise to use floating point everywhere. The only things to note are, the vendor could have given you better int8 performance; and, at some point, a platform might emerge that you want to target and where int8 is much more efficient than float.

The upshot is that it's possible to lower the price of floating point relative to int8, but not the cost.

What's more "important" – prices or costs?

Prices have many nice properties that real costs don't have. For instance, all prices can be compared – just convert them all to your currency of choice. Real costs are hard to compare without prices: is 2x less time for 3x more energy better or worse?

In any discussion about "fundamental real costs", there tend to be hidden assumptions about prices. For example, I chose to ignore area in this discussion under the assumption that area is usually less important than power. What makes this assumption true – or false – is the prices fabs charge for silicon production, the sort of cooling solutions that are marketable today (a desktop fan could be used to cool a cell phone but you couldn't sell that phone), etc. It's really hard to separate costs from prices.

Here's a computer architect's argument to the effect of "look at prices, not costs":

While technical metrics like performance, power, and programmer effort make up for nice fuzzy debates, it is pivotal for every computer guy to understand that “Dollar” is the one metric that rules them all. The other metrics are just sub-metrics derived from the dollar: Performance matters because that’s what customers pay for; power matters because it allows OEMs to put cheaper, smaller batteries and reduce people’s electricity bills; and programmer effort matters because it reduces the cost of making software.

I have two objections: that prices are the effect, not the cause, and that prices are too volatile to commit to memory as a "fundamental".

Prices are the effect in the sense that, customers pay for performance because it matters, not "performance matters because customers pay for it". Or, more precisely – customers pay for performance because it matters to them. As a result – because customers pay for it – performance matters to vendors. Ultimately, the first cause is that performance matters, not that it sells.

The other thing about prices is that they're rather jittery. Even a price index designed for stability such as S&P 500 is jumping up and down like crazy. In a changing world, knowledge about costs has a longer shelf life than knowledge about prices.

For instance, power is considered cheap for desktops but expensive for servers and really expensive for mobile devices. In reality, desktops likely consume more power than servers, there being more desktops than servers. So the real costs are not like the prices – and prices change; the rise of mobile computing means rising prices for power-hungry architectures.

It seems to me that, taking the long view, the following makes sense:

It's best to reason in costs and project them to the relevant prices – not forget the underlying costs and "think in prices", so as to not get into habits that will become outdated when prices change.

If you see a high real cost "hidden" by contemporary prices, it's a good bet to assume that at some point in the future, prices will shift so that the real cost will rear its ugly head.

For example, any RISC architecture – ARM, MIPS, PowerPC, etc. – is fundamentally cheaper than, specifically, x86, in at least two ways: hardware costs – area & power – and the costs of developing said hardware. [At least so I believe; let's say that it's not as significant in my view than my other more basic examples, and I might be wrong and I'm only using this as an illustration.]

In the long run, this spells doom for the x86, whatever momentum it otherwise has at any point in time – software compatibility costs, Intel's manufacturing capabilities vs competitors capabilities, etc. Mathematically or physically fundamental costs will, in the long run, trump everything else.

In the long run, there is no x86, no ARM, no Windows, no iPhone, etc. There are just ideas. We remember ideas originating in ancient Greece and Rome, but no products. Every product is eventually outsold by another product. Old software is forgotten and old fabs rot. But fundamentals are forever. An idea that is sufficiently more costly fundamentally than a competing idea can not survive.

This is why I disagree with the following quote by Bob Colwell – the chief architect of the Pentium Pro (BTW, I love the interview and intend to publish a summary of the entire 160-something page document):

…you might say that CISC only stayed viable because Intel was able to throw a lot of money and people at it, and die size, bigger chips and so on.

In that sense, RISC still was better, which is what was claimed all along. And I said you know, there's point to be made there. I agree with you that Intel had more to do to stay competitive. They were starting a race from far behind the start line. But if you can throw money at a problem then, it's not really so fundamental technologically, is it? We look for more deep things than that, so if all the RISC/CISC thing amounted to was, you had a slight advantage economically, well, that's not as profound as it seemed back in the 80s was it?

Well, here's my counter-argument and it's not technical. The technical argument would be, CISC is worse, to the point where Intel's 32nm Medfield performs about as well as ARM-based 40nm chips in a space where power matters. Which can be countered with an economical argument – so what, Intel does have a better manufacturing ability so who cares, they still compete.

But my non-technical argument is, sure, you can be extremely savvy business-wise, and perhaps, if Intel realized early on how big mobile is going to be, they'd make a good enough x86-based offering back then and then everyone would have been locked out due to software compatibility issues and they'd reign like they reign in the desktop market.

But you can't do that forever. Every company is going to lose to some company at some point or other because you only need one big mistake and you'll make it, you'll ignore a single emerging market and that will be the end. Or, someone will outperform you technically – build a better fab, etc. If an idea is only ("only"?) being dragged into the future kicking and screaming by a very business-savvy and technically excellent company, then the idea has no chance.

The idea that will win is the idea that every new product will use. New products always beat old products – always have.

And nobody, nobody at all has made a new CISC architecture in ages. Intel will lose to a company or companies making RISC CPUs because nobody makes anything else – and it has to lose to someone. Right now it seems like it's ARM but it doesn't matter how it comes out in this round. It will happen at some point or other.

And if ARM beats x86, it won't be, straightforwardly, "because RISC is better" – x86 will have lost for business reasons, and it could have gone the other way for business reasons. But the fact that it will have lost to a RISC – that will be because RISC is technically better. That's why there's no CISC competitor to lose to.

Or, if you dismiss this with the sensible "in the long run, we're all dead" – then, well, if you're alive right now and you're designing hardware, you are not making a CISC processor, are you? QED, not?

Getting back to our subject – based on the assumption that real costs matter, I believe that ugly, specialized hardware is forever. It doesn't matter how much money is poured into general-purpose computing, by whom and why. You will always have sufficiently important tasks that can be accomplished 10x or 100x more cheaply by using fundamentally cheap operations, and it will pay off for someone to make the ugly hardware and write the ugly, low-level code doing low-precision arithmetic to make it work.

And, on the other hand, the market for general-purpose hardware is always going to be huge, in particular, because there are so many things that must be done where specialization fundamentally doesn't help at all.

Conclusion

Hardware can only deliver "efficiency miracles" for operations that are fundamentally cheap to begin with. This is done by lowering dispatching costs and so increasing throughput per unit of energy. The price paid is reduced flexibility.

Some operations, such as high-precision arithmetic and memory access, are fundamentally expensive in terms of energy consumed to reach a given throughput. With these, hardware can still give you more speed through parallelization, but at an energy cost that may be prohibitive.

37 comments ↓

You're totally confused. Memory indirection can be optimized perfectly well – every single CPU does precisely that with its virtual memory system. TLB caches are far older than regular memory caches, and if you were right then any modern operating system would be far slower than running DOS on the same hardware. TLB caches make all memory indirection related to virtual memory access essentially free. Without them computers would be how many times slower than they are now?

Lisp hardware had similar optimizations for memory indirection built into their cache systems. And modern CPUs do various indirect branch prediction etc.

I'm talking about, specifically, p->q->a vs p->a or something along those lines. What you're talking about is (1) TLB optimizations and (2) latency of dependent instructions (that's what branch prediction is about). I'm talking about the raw access to cache memory banks – not the cost of missing the cache, not the cost of address translation, not the cost of waiting for the previous instruction (though I discussed the latter). I'm talking about the raw cost in energy of local memory throughput.

Tomasz brings up an interesting point – although not free (address translation + memory coherence is the main reason why L1 caches in most processors are limited in size to the page size times the number of sets in the cache), virtual address translation are an example of hardware optimizing a chain of loads into a page table in a manner that is more efficient than what you could do in software.

TLBs cache the result of the full set of lookups into the page table.
However, TLBs work for the following reasons:

(1) There's a high probability that the entire set of loads will be reused in the future, (i.e. you're accessing the same page again), and the working set is tiny. In order to provide the illusion of being free, a TLB that runs in parallel with the L1 cache needs to be implemented in flip-flops rather than an SRAM, which limits the practical size to 16-64 entries.

(2) There's a very high read to write ratio, so the TLB can assume the page tables are effectively read-only, and drop the set of cached entries on a page table update

If we try to apply this to e.g. strings stored as single-linked lists, the cacheability and working set assumption (1) is easily violated. Similarly, if we apply this to a high-level language virtual machine like CPython where every value is a pointer to an object, both assumption (1) and (2) are violated.

I think the take-away from this is that while there are very specific circumstances where memory indirection can be optimized with a hardware feature, the resulting hardware feature is going to come with many caveats, rendering them hard to use for typical high-level language implementations.

I don't agree on some areas. I'm electronics engineer and had designed hardware too.

We agree that parallel means losing flexibility but we don't agree in that it does not take less power. If you use 1000 units to do the same that a CPU does at 1000x less frequency and power consumption models to the square of frequency you are wasting 1000x/((1000)*(1000))=> 1000 times less energy.

You only need 100-120Hz for video, less for audio, because the brain is a low frequency, massive parallel computer itself. If I do a CAD/CAE simulation for a design I don't care about 1/100 delay and so on. There are lots of applications for hardware acceleration.

Energy does not model exactly to the square of frequency, specially at low frec but you can also shut down hardware modules you don't use and all that combined with what you already stated.

You could design a real module that spends 300-200 times less power than what the CPU does. In fact you could see them in every phone, iPad or macbook air.

It seems to me that you don't know what you are talking about in the big picture, you see trees but can't see the forest.

@Rune Holm: Of course managing the TLB contents is nothing like managing the actual cache contents – which is why the original comment is totally irrelevant :) There's a lot to do with loads & stores in many specific cases, but not if you have "sufficiently random addresses".

@Jose: well, I mentioned your point in the part about multicore; several slow cores being more energy efficient than one fast one is the same effect as "a real module" running at a low frequency spending less energy than a high-frequency CPU.

However, there's a certain minimal frequency where you gain nothing by lowering it still more. To me, it's interesting what happens at about that range of frequencies, because what happens above it is grossly inefficient in terms of energy consumption and people only target their designs (CPUs, mostly) at such "unhealthy" frequencies for programmer/consumer convenience. And I'm not in the consumer market. BTW, 100MHz at 40nm, for instance, is way below that minimal frequency for a reasonably pipelined design.

Also, this is a second comment pointing out a mistake I didn't even make, and not in a very polite manner. Oh, the joy of the web.

@Morris: with CISC it would be "ugly, general-purpose hardware"; I sort of think ugliness is a price that should buy you something, and with specialisation it buys you efficiency but without, it could only buy compatibility – which is worth a fortune until people toss their old toys and buy new toys and then it's worth nothing.

You are wrong about Intel… They don't make CISC processors. All their designs have been RISC-model since 1995. It's just that this has been hidden under the x86 instruction set, which they emulate since the pentium pro.

Wody, the X86 instruction set is a CISC-style instruction set. Converting the instructions to RISC-style instruction set doesn't make the CPU a RISC CPU, it just means there's overhead that a true RISC CPU wouldn't have to deal with.

Excellent points, though I would throw in maybe adding a FPLD next to the CPU so you can download "specialization" once. Or the CPU cores with DSP like on my Nokia n810. I guess the first question is whether the specialization is intrinsic or temporary to the whole system. There will be time/energy costs in setup, but then it runs fast.

It also goes further. QR Codes have part of the algorithm that uses mod 3 as the operator, however if you do a bit of unrolling you can remove the divide operations and I think I only have one 8×8 multiply:

I'll also add an Amen to the object oriented overload. I don't have a problem with OOP per se – you can do it in plain C, if that paradigm is the right one to solve a problem. But there seems to have been a separate philosophy being taught that everything should be atomized, abstracted, obfuscated, so it becomes setvalueofA(getvalueofB()) and hope the compiler/linker is good enough to untangle it into the load-store.

This was also one reason that Windows CE was such a failure at the time (when it was competing with the Palm Pilot). You can't fool power usages or a battery. The Palm used very efficient routines – basically an original Macintosh processor and code updated. CE ported the Windows kernel with all the bloat and baggage (it had an interrupt jitter making it all but unusable for embedded). If something takes 10x the gates or cycles, or whatever resource, it will drain the battery 10x faster. Adding faster hardware is usually quadratic, i.e. to make it go 2x faster, it will eat 4x the power (and parallelization isn't free – the extra routing adds gates).

The only resource that is rarely tight today is memory, so it is often easier to create tables of results rather than efficient operations (e.g. Rainbow Tables for breaking encryption). That can be traded off for speed and power in some cases. But that doesn't apply to microcontrollers with only a few K.

Regarding CISC vs RISC, what do you think about memory efficiency of instruction encodings?

Most RISC designs I'm aware of use a fixed width (usually 32-bit) encoding, whereas I think x86 instructions average out to about 3.5 bytes per instruction, allowing more instructions to fit in a code cache line. On top of that, allowing instructions to access memory directly can eliminate entire load/store instructions compared to RISC.

Of course, optimizing encoding length is not CISC-specific as such (as evidenced by Thumb), but it would seem that CISC in general would enable shorter instructions overall, possibly leading to better memory utilization.

@tz: I actually think of an FPLD/FPGA as a specialized architecture (rather than "clean slate you can do anything with) – I have a draft about this bit. Likewise, a DSP is a specialized architecture – good at a fraction of things, bad at most. Actually "specialization" is an annoyingly fuzzy term because it's not clear what "general-purpose" means; in this post I allowed myself to use the term because the context was gaining efficiency through supporting what you want in hardware and here, I can implicitly assume that you have a less efficient baseline which you consider something done "without hardware support" – and relative to that hypothetical baseline, the term makes sense. In a broader context, I think there's no such thing as "general purpose hardware" – there are just many different architectures, CPU is one family, FPGA is another, DSP a third, etc. In the broad sense though, we can call X more "general purpose" than Y if it runs more lines of code or sells more units or has more programmers targeting it or a combination; in this sense, I think clearly CPUs are the winner, and in a business sense, "general purpose" is a sensible term even though it's somewhat fuzzy technically.

@Jussi K: x86 binaries, specifically, are indeed somewhat denser than RISC binaries (and ARM binaries are a bit denser than MIPS binaries, for instance, so there are differences within the RISC family, too.)

The thing is, code size did use to be an issue in the old days, but it rarely is these days – most memory is used by data, most instruction accesses hit the cache, and in terms of energy, what matters is the width of your instruction cache memory bus and what it takes to decode the instructions once you fetch them; I believe RISC is cheaper to decode overall.

This shows though that trade-offs are nearly impossible to get right in any kind of a "timeless" fashion. If you have a hairy variable-length encoding at a time when memory is really scarce, and your other option is to have a fixed-sized encoding, presumably the reasonable thing is to look at the prices of the two – contemporary prices – and choose to conserve memory. You know that you made decoding complicated and that's a spending in terms of resources that can come back and byte you when prices shift. But it's still the right trade-off at contemporary prices.

So the only case where "looking at resources" is the sensible thing is if your choice is, waste something or not waste it, without a trade-off; this is the case with int8 vs float – but not really, not if you add, say, development effort to the mix… It's only sensible if you frame the problem as "hardware costs", which I think made sense in the context of my write-up where I try to explain what hardware can and cannot do, but not sensible in a broader case. With trade-offs, I guess all you can do is realize you made one, in terms of fundamental resources, and then watch out for price shifts…

FPGA's seems about 10-20x more energy efficient than GPU, and GPU's are in itself 10x more energy efficient than CPU's.

Anyway, FPGA's do have substantial advantages over CPU's and GPU's in terms of being able to precisely fit arithmetic precision. If you need a 12 bit adder, you use resources for a 12 bit adder. Not resources for a 16 bit adder, etc.

Also FPGA's can do bit extraction and manipulation hardwired and not by utilizing generic shift/and instructions as your "extract bit 3:7" example.

The largest drawback still I think is that FPGA's are usually programmed by hardware designers and they are diametrically opposite of software developers in terms of development culture. Their tools are low level timing diagrams and ancient programming languages such as VHDL. Even though there has been C->VHDL compilers developed or direct Synthesis of C code, FPGA's are still programmed using VHDL or Verilog.

The biggest hurdle still for a software developer to experiment with FPGA's is that step 1 still is to design and build hardware that uses FPGA's. Even though there are several evaluation boards available, you still need the HDL tools.

It has been a long time you have not published; I was wondering whether you had given up.

Deep down your main points stand:
- one still needs to perform the operations required for the task and there is a lower bound as to how much energy you will need.
- A programmable machine using a set of types and operations is inherently less efficient than a fully specialised component which can use any type and any operation as long as it can be clock/power gated when not needed
- Dollar is all that matters in the end. If someone else can make it so that your customer can make a better buck out of the product, you will lose.

I'll be careful from there, I am a bit worried of braking trade sensitive information from my current or my previous employer.

I'd just want to point out that:
1) SMT is not as bad as you make it be. At least once you went through the hassle of getting a full OOO, multi issue pipeline. I would not be surprised for designs based on the technology to be fashionable again. Barrel processors on the other hand, probably not. I would be even less surprised to see x86 with wider execution pipelines and more than 2 way SMT. But then, I don't work for Intel so it is just one of my guesses on how they will fight lighter many core which seem to be entering the server market. Essentially, unused powered silicon is a full energy loss. Low level clock gating is hard (as in small clock domains), low level power gating is even harder. Also it can be good for 3. Making wider pipelines and good I-cache means that you can boost instruction throughput for specific types of parallel friendly streams, yet having several instruction streams menas that you can perform ok on high latency workloads by running several in parallel. Scheduling threads in such a setting is an intersting issue.

2) Dynamic consumption is a lot less than a problem that one would think in modern devices, leakage is huge on the latest processes, clock distribution is a massive energy hog on high speed designs. At less cutting edge nodes and at lower frequency, it is still very relevant. But we are talking CPUs and GPUs here. Comparing them with special purpose HW is a bit of an difficult comparison. On one side you have something running at 1-2Ghz on cutting edge processes, on the other side something running at 100-400 Mhz a couple nodes behind, most of the time.

3) One real difference in between accelerators and general processors is memory coherency. Accelerators will typically be given private memory and a minimalistic mmu just so that the device memory space can be abstracted. On the other hand, memory coherency in coherent core systems is an arduous issue (and an expensive one in term of power)

Time and power consumption are not the only costs imposed by the use of idiot hardware.

What is the total economic cost of the existence of buffer overflows of every kind? They simply don't need to exist. Any of them. Implement hardware bounds checking for *all* array accesses, and hardware type-checking for all arithmetic ops (1970s state of the art!) and buffer overflows vanish, taking with them the lion's share of known security exploits, crashes, and other digital miseries.

x86 has had the BOUND instruction for hardware-accelerated bounds checking since 80186 in 1982. It also has had the INTO instruction for hardware-accelerated integer arithmetic overflow checks since 8086 in 1978. Not that it seemed to help.

And even for typical web languages that do provide array bounds checking and widening of integer arithmetic into infinite-sized integers, the security problems did not vanish, they just shifted into e.g. SQL injection and cross-site scripting.

I believe secure programming is a programmer and programming language problem, not a hardware problem.

@Mattias Ernelli: there are many differences in the table even within, say, the ARM family that are hard to understand (not directly related to, say, frequency); I guess knowing more about the systems and the benchmark could help.

@Ben D (glad to hear from you!): I don't think I said much about OOO or SMT – not about how "bad" they were… As to dynamic vs static consumption – it depends on your process. At 40nm LP, static consumption is almost zero at room temperature and still very small around 100C. I don't compare CPUs designed for high frequencies with accelerators designed for low power – there are CPUs designed or at least synthesised for low power. Regarding memory coherence in accelerators – sure, one of the whole slew of things accelerators don't have to worry about, and making CPUs really, really inefficient…

@Stanislav Datskovskiy – I agree with Rune Holm I guess… With the twist that it's really a hardware/software co-design problem – hardware would have the features if much of the software used them, much of the software would use the features if almost all of the hardware had them. It's a chicken and egg problem in a situation where hardware and software evolution is only very loosely coordinated.

I guess your question is, if you need 128 bits of precision, then is hardware support better than emulation, right? I don't know, really – huge multipliers in particular make no sense starting at some size, I think, better to implement using 4 multiplies using 2x smaller hardware multipliers or some such. But I never cared about this sort of thing so I wouldn't really know where to draw the line.

My guess is that a nice system could use software emulation sped up by fairly trivial hardware support for shuffling bits – extracting/packing mantissas, exponents and signs, dealing with parts of large mantissas, etc. – and that such emulation with rudimentary hardware support would have about the same energy efficiency as full-blown hardware support for 128b floating point; but it's really just a guess.

Oh, as to why there are no processors with native/nice support for quadruple precision: I think it's just because it's a tiny market – almost no software would use the feature, so why bother. As to processors targeted at supercomputers – perhaps even scientists don't care about quad precision and perhaps they do have some sort of rudimentary support that helps bring emulation to about the same energy efficiency as full-blown hardware support would have; I dunno.

Over the years I noticed that hardware beyond the level of the simplest RISC architecture tended to resemble fossilized software intended to do something faster, or cheaper with regard to some resource, or better in some way. Faster and cheaper were usually measurable. Better was much fuzzier.

A real question with floating point arithmetic, "Does it matter that you almost always get a wrong (imprecise) answer?"

Looks like it’s in free fall: “Windows PCs sales in U.S. retail stores fell a staggering 21% in the four-week period from October 21 to November 17, compared to the same period the previous year. In short, there is now falling demand for x86 processors.” The rest of that article looks at plenty signs more but none so visceral as that one.

I do not expect the desktop computer to change all that much – unless you mean “PC as a platform” in the narrow Wintel sense and your question is about what OS is going to run on ARM desktops. Microsoft is not likely to deliver one, Apple is highly likely, and I really hope someone else steps into that void. (ChromeOS…? Eh.) But an Apple-only future would be a far bleaker outlook than a Microsoft-only ever one was, in spite of their far better products. (Partly, actually, because of that. People are delighted to have no other choice than Apple, in droves.)

I meant "PC as a platform" in the sense of having standard protocols and building your system using devices from disparate vendors that use these protocols; and in the sense of being able to run an OS that the machine didn't come preinstalled with. And even in the narrower sense – Windows is wonderful in ways that nothing else is. It's not as bad as an Apple product in terms of the choices it dares to take away from you and it offers compatibility that a Linux distribution is yet to achieve (including Android – I can't run simple games on my 2-year-old phone without upgrading the OS).

As to Apple: I never found their products to be substantially different from the competition, and I think their design is rather ugly, especially rounded rectangles everywhere. My explanation of their recent runaway success is something that you could call hypnosis, and I give them maybe 5 years, 8 years, tops to fall back into relative obscurity now that the hypnotist has passed away.

Having programmed for both x86 and PowerPC I found, to my surprise, that I prefer x86. Well, x64 anyway. Yes, it's messy, but some of that mess adds value. For instance, loading constants that use more than 16-bits is trivial on x86 but requires either multiple instructions are reading the constant from a 64-KB section pointed to by a reserved register. This is one instance of a general problem caused by fixed-size instructions — sometimes you don't have enough bits to express the instruction you want. In x86 land you can *always* extend the instruction set.

x86 instructions can also be far more powerful with addressing modes like eax += [const+reg1+reg2*8].

So that's the benefits of x86. The costs, on the other hand, only go down, as the portion of the die associated with decoding becomes a smaller fraction. Looked at that way it seems inevitable that RISC chips in a particular market will have a temporary advantage (when decode size/power matters) but if they fail to win quickly then their advantage will disappear.

Or maybe you're right and Intel will eventually lose. Certainly winning streaks are hard to maintain.

You mean you programmed in assembly and liked x86 better? That'd make sense; RISC is nicer to target a compiler at, not to hand-code for. As to the portion of the die – look at ARM-based designs with 4 high-end cores and 4 low-end cores. How big would the low-end cores be if they were x86? It's not true that larger chips always use larger cores, they could use more instead, and then chips increasingly shrink over time as power and yield constraints prevent us from fully reaping Moore's law benefits.

Winning streaks are impossible to maintain forever. Now if you had a CISC contender, then CISC would have a chance in the long run, but there isn't any.

@Aristotle: My opinion is that idea about combining an A5 with an x64 is mad. We're not yet at the point in heterogeneous computing where that makes sense. Wait for a few more years until the GPU is a first class citizen of the processor interconnect fabric. Right now it's on the wrong end of a PCIe link and all access will need an arbitration chip or some equally ugly hack (not that that doesn't have recent examples like Optimus v1).

@Yossi, et-al (mostly et-al)
Like you I expect RISC based designs to dominate long-term but, while nothing lasts forever, forever is a long time. The other comment on this, mostly applicable to ARM and POWER is that RISC is generally taken as a base to build on, complexity is added back in every time something turns out to work better with a specialised instruction e.g. SIMD.

Though almost two years after the last post, I still feel the need to react on that misleading RISC vs CISC debate.

RISC and CISC terms have been extrapolated to qualify very different notions: variable length of the instruction set, symmetry of the encoding, decoupling memory accesses from computations, general-purpose registers vs specialized ones. However, initially the only difference between RISC and CISC terms is Complex vs Simple (Reduced). Of course defining the exact boundary between Complex and Simple is a hard task.

In modern out-of-order processors, the complexity of decoding is not that important (though I must admit that x86[_64] brought instruction decoding to an unsuspected level of brainf***king). Maybe one of the major instruction feature that needs to be decoded as soon as possible is whether or not the instruction is a branch, to do everything that is possible to prevent breaking the instruction fetch. In a second step, easy identification of register dependencies is really nice to have. So-called RISC architectures such as Power, ARMv8, Alpha (and others) do provide simple decoding. But ARMv7 (RISC?) makes branch identification almost harder than in x86 (CISC). In fact, the first pages of the ARMv8 specifications acknowledge, in a very interesting discussion, the weakness of the ARMv7.

In the end, I think Wody and Alex are both right: Intel architectures are internally obviously RISC-like; and yes, the Instruction Set is complex. But apart from designing extremely-utltra-super-low-end cores I don't think that the ISA overhead does matter anyway/anymore. If you were to design Instruction Set from scratch you would obviously not do anything close to an x86 ISA, however Intel is not starting from scratch…

I am very late to this discussion.
But i feel you have A FUNDAMENTAL PROBLEM in your logic.

The COST of any given operation you discuss (8/32/load/mulitple/etc) is not fixed. In fact, it has an exponential range of values.

The LOAD instruction takes a pico-joule if it is in an L1 hit. It cantake x10^6 if it triggers some sort of NUMA page access (or whatever).

Generation of architects have brought us a finely tuned eco-system where imperative code, is compiled for a certain instruction set, running on a system with very specific tarde offs.

The amortized cost of an average operation is then somewhat predictable.
As soon you try try a "new" component e.g. excess indirection e.g. fuzzy arithmetic , of course the system is thrown out of wack.

But give it a few decades and there is no reason why a cache system can make a->p1->p2 faster than accessing a[1],a[2]

Similar arguments can be made about many core parallelism taking LESS ABSOLUTE POWER e.g. by always selecting a core closest to the memory or what ever (for by lower frequency as you mentioned).

So your entire argument is therefore limited to the CURRENT status quo rather than an absolute limit.