Adapteva's $100 Parallella Supercomputer Platform Now Shipping

A new $100 supercomputer will facilitate a wealth of compute-intensive applications for medical, automotive, and industrial control, machine vision... the list goes on!

Way back in the mists of time we used to call 2010, I was introduced to a guy called Andreas Olofsson. As I wrote in my column From RTL to GDSII in Just Six Weeks!, Andreas had left his job, formed a company called Adapteva, and -- working in his basement and living off his pension fund -- single-handedly invented a new, ultra-low-power computer architecture.

Andreas went on to design his own System-on-Chip (SoC) from the ground up. In fact, he took the first version of this device -- called the Epiphany -- all the way to working silicon and a packaged prototype.

About two years later, in October 2012, Andreas contacted me to say that he and his colleagues had launched a Kickstarter campaign with the mission to create a Personal supercomputer for only $100!. By this time, there were two versions of the Epiphany -- the Epiphany-III and the Epiphany-IV. Both of these devices contain an array of processor cores, each of which is equipped with its own local memory and a single-precision floating-point engine.

The Epiphany-III (implemented at the 65nm node) boasts an array of 16 processors, while the Epiphany-IV (implemented at the 28nm node) features an array of 64 processors.

Everything on the Epiphany is designed to offer optimum performance while consuming as little power as possible. For example, when operating at peak performance, the Epiphany-IV provides 100 Gflops of raw computing power while consuming only 2W. This means that, at 50Gflops/Watt, the Epiphany-IV is 50 to 100X more efficient than anything else out there.

Adapteva's supercomputer platform, which is called the Parallella, is based on the combination of an Epiphany multi-core processor with a Zynq All Programmable SoC from Xilinx. In fact, there are going to be two versions of this little beauty -- one based on an Epiphany E16 (16 cores), and one equipped with an Epiphany E64 (64 cores). Even when running flat out, a Parallella equipped with an Epiphany E64 will consume as little as 5W!

The credit-card-sized Parallella supercomputer is based on the combination of a Zynq All Programmable SoC from Xilinx and an Epiphany multi-core processor from Adapteva.

Four-board Parallella stack with Ethernet and power connectors.

Laptop connected to a 42-board, 756-CPU Parallella cluster, which consumes less than 500W!

For the past few weeks over on All Programmable Planet, people have been asking "Does anyone have any news about the Parallella?" I must admit that I've been pretty excited to hear what's going on myself, because I made a $99 pledge on the Kickstarter project, and since then, I've been eagerly looking forward to seeing my Parallella "in the flesh" as it were.

I know that initial prototypes were shipped to major backers back in December of 2012, but since that time, everything seemed to fall strangely quiet. Well, I just heard that the folks at Adapteva have started shipping early “Beta” boards to Kickstarter backers of the ROLF, 64-CORE-PLUS, and DEVELOPER support levels.

My understanding is that the folks at Adapteva are still planning on making a few more "tweaks" and refinements, after which they will ship the remaining 6,300 Parallella’s ordered via Kickstarter -- all of these boards should have shipped by the end of the summer.

Ships with free, open-source Epiphany development tools that include C compiler, multicore debugger, Eclipse IDE, OpenCL SDK/compiler, and run time libraries.

Dimensions are 3.4” x 2.1”

But wait, there's more, because Adapteva is now taking pre-orders for the 16-core Parallella platform from the general public. Parallella boards will be available in different build configurations with a starting price of $99, and these "general availability" orders will ship later this fall (click here for more details).

So, now I'm awaiting the arrival of my very own personal supercomputer (the only problem is that I am not a patient man). How about you, did you sign up on the Kickstarter campaign and order one of these little beauties? If so, what do you plan to do with it when it arrives?

@jb0070, I asked about the killer app and you enumerated a list: radar, image, audio etc. which many other technologies are going for. To make it commercially long term, you need to sort out issues to do with programmer productivity, maintainability, and cost, and I can't see why this particular technology will succeed where others have flopped.

If you want to run large programs that are not data-flow, then you have other issues, and an array of coupled processors is not going to be an appropriate solution for that problem set. For the "typical" complex non-data-flow program scheme, one would need to do the typical threaded software, which is NOT simple. On that I think we all agree. Code that is highly branched, and/or that require intra-process communications, have synchronization issues, and that leads to processor stalls ... unless one manages to scale the various threads and execution paths to all cycle together. Not a task for the feint-hearted. Code that does a lot of task-switching is not good for array'd processors, either.

But physics and engineering matrix decomposition (and etc) type programs are good array'ed processor problems. Not every tool is appropriate to every problem. Array'd processor sets are good at "high" data-rate, repetitive processing.

For task-switching code, a processor would be better suited to have larger register files, and caches. For huge, unwieldy, non-threaded code to parallel ... well, there is little hope for that, save maybe for massive re-writing? In the engineering world that would probably include both Verilog/VHDL themselves, as well as the code that they generate, for instance.

The fun of engineering, is to know which tool to use for what problems ... and/or making a new tool to do something that otherwise was not tractable. This board is decently fast, cheap, and capable for data-processing problems. So, for that application set, things are pretty simple, given this new solution.

For what it is worth, I designed a similar arrayed processor chip family, and studied (in some detail) what applications to which it provided a good solution space, and where it was not going to be useful. The biggest hurdle, is getting the software cycle synch'd across all of the processors, and that issue was addressed through a software simulator. This required an assembly code level of attention to detail, but once coded, each "program element" could be stitched in from higher levels of simulation, such as the Ptolomy program out of Berkeley. Of the various types of data-flow programs, FFT's presented the worst issues. It was NOT something anyone would ever want to use to run (say) Linux. Nor would it be worthwhile for small SPICE simulations. But, work through a large SPICE sim, or other large data-set data-flow program and it was faster and lower power than uP's.

Typical microprocessors are built to allow high elvels of branching in the instruction set, including "multi-processing-in-time". Parallel processing arrays are best used as data-processors in a data-flow arrangement - processing data as it arrives, in real time.

Peak performance issues come from 1): branching the processing code; 2): I/O bottlenecks. For a "data-flow" machine, the code does (or should) NOT branch: each processor performs the same calculations "ad eternity". (The results of any given processor might be ignored, ie: scaled to zero, but the calculations are constantly done). If the parallel-processing-machine is I/O limited, then the processors will starve, just like any other processor (ie: bad design).

In other words, the peak performance mentioned would NOT be a "tiny/small fraction of that", or the system either is 1): overkill; 2): not implemented well; or 3): not suited for a parallel process. For data-rate processing systems, a well engineered parallel array would be humming along, at full speed and at, or near the peak processing rate possible. [One would likely NOT run a Ferrari on a school bus route, or mail route (with lots of stops and turns); that Ferrari would be best suited to screaming along the Autobahn, pedal "to the metal".]

"Killer app?: Consider vision/speech/radar/sonar/neural ... etc ... systems where large amounts of data is/are contantly arriving, being processed, and sent along the processing pathways.

=== OR ===

How a real-time gaming system tied to a live (American) football broadcast/Madden-game (coaches view), linked to your kinnect, where you get the QB's (or running back's) view, and have have to "make the play"? And at NFL or college level, real-time speeds. [Use the kinnect for player movement, and even judging the throw itself]. Or, if not for the masses, how about making that system for the players? Baseball from the batter's view might work well, too! There is lots of tape of great pitchers - Sandy Koufax --- Randy Johnson, etc ... could you hit their stuff?

Great product! Just need to "find" the money to buy one, if not the next 6000 (or so!).

I wonder how was the $100 cost figure obtained? In a real world set up, development time, maintenance, and support costs would all be added to get the market cost. Is that the case with the $100 figure?

As for performance, the figure given is peak performance, a real benchmark performance could/would be a tiny/small fraction of that. Have any benchmarks been conducted on this platform?

Finally, what is the killer app that would make it likely to be continuously developed for state-of-the-art fabrication nodes?

I suppose you're right, Rich. And at this price, what's not to like? I wonder if there are plans to consumerize this... Seems like it would be a catchy sales pitch: Why settle for an ordinary computer when you could have a supercomputer for less?