However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL – "the GPGPU language" – eventually help displace GPGPU, by facilitating "GP-something-else" – "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not – what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come – you could call them two opposite directions:

Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.

Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

Explicitly addressed local memory (as opposed to cache)

DMA (bulk memory transfers)

Short vector data types to make use of SIMD opcodes

Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs – and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
out[i] = f(i);
}
else {
out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads – or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany – Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores – or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream – no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors – unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths – as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

32K memory per core (unified – both instructions and data)

4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic

2-issue cores (one floating point operation and one integer or load/store operation)

800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP – any evidence to the contrary?)

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today – CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however – again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core – but they're kept in a cache, not a local memory, conveniently for large programs.

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

Power: 1x (2W)

Core issue width: 1x (2-issue)

Local memory size: 1x (32K per core)

Frequency: 1.33x (800/600)

Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva – a 5-people company (ST has about 50000 employees – of course not all of them work on the P2012, but still; Chuck Moore's GreenArrays is 18 people – and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs – low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently – you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream – but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA – looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW – but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient – and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above – and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels – these accelerators have some nice properties:

You get scalar threads which can diverge freely and efficiently – this is a lot of extra flexibility compared to SIMT or SIMD GPUs.

For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.

And there's a programming model ready for these accelerators – OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing – by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" – FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip – but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small – they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU – so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome – except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market – or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence – or any other kind, really – is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

Small to medium scale, similarly to FPGAs

Long life cycles encouraging speculative investment in hardware

Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project – a computer with 64 independent OpenCL cores for $99.

then the branching in your code snipet: You compute everything – just as in the "standard" pattern – and you have another multiply-and-add, but you have only one commit to memory. It's sometimes better even in nested conditions.

I disagree with the the premise that "most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs."

The idiosyncratic features that aren't exported to OpenCL are the rasterizer (vertex interpolator) and framebuffer compositer (z-buffer tests, blending, a few other things). These are valuable and graphics centric but account for a small amount of silicon. (The other specialized silicon that's part of the "GPU special sauce" are the filtered texture units, which *are* exported by OpenCL — they compute arbitrary expensive function approximations cheaply by table lookup.)

Roughly speaking, the rest of the chip is a lot of vector cores. The thing about these vector cores is we can have orders of magnitude more of them than CPUs do because we aren't spending any silicon on branch prediction, speculative execution, or out of order execution. Note also that GPU languages don't support pointers. Put that altogether and the resulting restrictions allow for very high thoughput for data intensive calculations. (Also we have many more threads than cores so threads blocked on memory latencies are swapped out for threads that have loaded data ready to go.)

The value GPUs now deliver is from the massive throughput we get from massive programmable parallelism that we get by throwing away features from CPUs that actually aren't necessary for the graphics specific task of processing streams of data packets independently. What's interesting is that a surprising number of jobs can be cast into this format. (Eg see the classic Google MapReduce paper for a few examples.) Even if this format isnt the most "efficient" way to structure the problem, the brute force strength of massive parallelism available now in GPUs can make it a performance win.

The "64 core" version of Parallella is supposed to cost $199, the $99 Parallella is based on the 16 core chip. In either case, the board will have a dual core ARM cpu making the total cores 18 (for $99) or 66 (for $199).

I've long wondered if someone could use the ZMS-08 from ZiiLabs as a GPGPU. OK is it targeted at mobile device, it isn't 64-bit and doesn't have a fancy interconnect but it does have a programmable interface bus. Also it has 4x 12-bit video IOs which could be used as general purpose interfaces.

Probably a case of trying to make a square peg fit in a round hole, but they are substantially cheaper than $99.

It's worth noting a point of GPU limitation you've not illustrated clearly. GPUs operate on a bus and a tiered cache architecture. They are highly optimized to work on sequential streams of data (vertices) and output to a fixed result buffer more or less sequential (backbuffer). GPUs are very bad at random access data crunching and random write data storage. That's the main reason GPUs are very bad for instance at n-body simulations and raytracing. Mid-end GPUs are so bad at these tasks that you can routinely write algorithms in interpreted languages that are hundreds of times slower than C and still outperform those GPUs on the CPU.

What's so promising about the new chips is that they don't look like the sequential stream processors that GPUs are. They look like they could perform much better at random read/write than GPUs, which matters a lot.

@Florian: CUDA devices are fine at rather random access – for example, parallel random access to data small enough to fit in, and explicitly allocated at, what Nvidia calls "shared memory". You get problems if you want random access to swaths of data (though I think highest-end devices come with caches for that, so they'd be tied with CPUs at some point perhaps).

In this sense, the accelerators that I mentioned are not necessarily better. You get fine random access when you hit the per-core/per-cluster explicitly managed local memory, worse latency for nearby local memories, and no data cache at all to speed up access to DRAM.

So I didn't mention this difference because I'm not sure about its significance (perhaps there's some advantage in being able to access neighbors; but for "sufficiently random access", you still need to manually plan all your accesses and in this sense a CPU is much better.)

I am not sure about your conclusion: "Long life cycles encouraging speculative investment in hardware".

If you mean generic HW, you are probably right, but if you already have the generic HW (the GPGPU) and you want to replace it taking a speculative investment is actually worse in long life cycle since it will take you more time to fix if you failed. Isn't it?

If you are the Ofer that I think you are, we better discuss this elsewhere for a variety of reasons :) One thing though – DSP had a footprint in APs for a very long time, and it was way more generic/programmable than GPUs for a very long time. The fact that GPUs are now more widely targeted means that it's not just a question of footprint but what your evolution leads to and this has to do with your base architecture. Different architectures evolve in different directions; a GPU will not evolve into a DSP and vice versa. So something with a footprint will evolve, but it can only evolve into a certain range of things, and this range determines the range of applications where it can find practical uses.

I was looking at GPU acceleration of petabyte scale seismic data processing about 5 years ago. The incumbent approach the was to use several thousand dual die dual core blades at 2.2ghz managed via MPI.

While this is the kind of problem that multiple GPUs should handle well, even the later introduction of CUDA made it difficult for the programmer to achieve the rated execution capacity of the devices.

I see great promise in Epiphany and I suspect that Parallela (particularly the 64 core — or 4×16 as I prefer to look at it) will open the doors to a wide spectrum of applications that the power, heat, and programming model complexity of GPUs make impractical today.

I suspect — as has been the case many times — that one technology will not displace the other; rather, each will find its use within the domain of its intrinsic strength. Many computationally intensive applications are also graphically intensive. Consider for example the challenges involved in a complex SCADA master control station with forward looking simulation. Optimal Real time visualization of several hundred or thousand RTU inputs is enough to tax a GPU. Even though the data is small, it comes from a wide range of sources and it is continually changing; moreover processing the delta data in real time and generating simulation scenarios does not lend itself to a highly linear instruction stream. The ability to "asynchronize" (or temporally decouple) simulation computation from visualization makes for a very robust architecture.

Likewise I see great promise in the potential for real-time optimization of diagnostic imaging in which tightly architected parallel pattern recognition can detect anomalies and change imaging granularity in response.

In work more familiar to many (h.264 encoding), both the challenges of existing MP/MC architectures and the limited practical success in applied GPU acceleration underscore the need for a simpler parallel processing coding model with a flat memory model. That need would seem one that Epiphany may be suited to satisfy.

Epiphany's memory model is not flat – in fact it's somewhat less flat than, say, CUDA's. It's nominally flat because you can use pointers to access anything using the same code, but performance gets worse the farther the memory actually is. In CUDA you have multiple execution cores working with many "equal" banks for close ("shared") memory. Epiphany's model is only flat if you don't care about performance – in which case so is CUDA's (just put everything in DRAM), and which isn't a sensible use case.

The flexibility advantage of Epiphany over CUDA or other GPU programming model is efficient divergence, not the memory model, at least as I see it; the memory model is different and you can argue for or against it, but none of the models is a straightforward flat thing.

Great writeup. My group is researching how to improve programmability on systems such as P2012, Adapteva, etc. We're calling systems that don't have cache that instead use software with DMA tranfers to move data around 'explicitly managed systems'. Check out my site for a paper on our work.

I agree that OpenCL can significantly help with programming for accelerators. From my point of view it helps the programmer spawn lots of kernel as well as explicitly declare the global/local accesses of each kernel. However, there is definitely room for improvement. The boilerplate code required to use OpenCL is ridiculously high and a serious impediment to productivity. There's a lot of room to improve the OpenCL API on the host-side.

Probably I'm missing something since I'm kinda out of my depth, but also a bit surprised to not see mention of Grand Central Dispatch (by Apple but open sourced and ported to FreeBSD). It dynamically recompiles (using LLVM's JIT facilities) the OpenCL kernels to make them run on whatever "processor" is available – so I guess any "processor" targeteable by LLVM could be used?

"Make them run" is one thing, "make them run fast" is another… OpenCL is portable but its performance is very non-portable, unlike the performance of portable general-purpose programming languages. What I was saying was, if an OpenCL target got popular and another OpenCL target with a significantly different architecture showed up, then that latter target would get very limited leverage from being able to run existing OpenCL software, because said software would have been optimized for the former, older target and perform badly on the new one.

Well, autovectorization helps but doesn't always work, and then you have issues with things like how much local memory you have, what's the performance characteristics of accessing it, how thread occupancy impacts performance, how external memory is accessed and what's the perf characteristics there etc.

If having a JIT would give OpenCL performance portability, then say Google wouldn't shy away from exposing OpenCL drivers in Android. The reason they don't want people to use OpenCL very much is because they fear that one target will overtake the market and other targets won't be able to compete because their arch is too different; at least that's how I look at it… (Perhaps things changed here since the last time I checked and Google now promotes OpenCL vigorously with lively competition taking place between hardware vendors… if so my point of view is outdated.)