Chipmaker gets the $750,000 it needs to build a tiny new computing platform.

A month ago, we told you about a chipmaker called Adapteva that turned to Kickstarter in a bid to build a new platform that would be the size of a Raspberry Pi and an alternative to expensive parallel computing platforms. Adapteva needed at least $750,000 to build what it is calling "Parallella"—and it has hit the goal.

Today is the Kickstarter deadline, and the project is up to more than $830,000 with a few hours to go. (UPDATE: The fundraiser hit $898,921 when time expired.) As a result, Adapteva will build 16-core boards capable of 26 gigaflops performance, costing $99 each. The board uses RISC cores capable of speeds of 1GHz each. There is also a dual-core ARM A9-based system-on-chip, with the 16-core RISC chips acting as a coprocessor to speed up tasks.

Adapteva is well short of its stretch goal of $3 million, which would have resulted in a 64-core board hitting 90 gigaflops, and built using a more expensive 28-nanometer process rather than the 65-nanometer process used for the base model. The 64-core board would have cost $199.

4,965 people backed the project. The vast majority of them pledged at least $99, meaning they'll receive one of the 16-core boards, which are scheduled to ship between February and May 2013.

Adapteva is calling Parallella a "supercomputer," although the vaguely-defined term is usually applied to the types of large clusters used by government labs and research organizations, rather than computers than can fit right under your monitor. Parallella boards can be clustered together to hit higher levels of performance. Alternatively, the board could be used for simpler tasks, like turning your TV into a home computer. But Adapteva's main target for initial sales, CEO and founder Andreas Olofsson told us last month when the Kickstarter went live, are hobbyists and developers, "the guy who is working on an open source project and there’s no platform they can use today that fits their needs."

There's this little thing called parallel computing, which generally uses large computing clusters to analyse data or make mass calculations, and therefor greatly benefits from cheaper hardware. This practice has many uses such as... performing research to cure cancer.

We don't really neeeeeed to cure cancer, but it would be kinda nice.

There are radically higher-performance parallel processing platforms already available off the shelf, to regular developers and hobbyists. A $99 discrete GPU has dramatically higher throughput. Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

All of this is not to say that those are ideal parallel computing platforms for any particular workload, either. But they surpass the Parallella offering on nearly every axis, they are already available off the shelf by the million, and they are well supported by many existing operating systems and toolchains.

Most concerning, though, is that the Adapteva claims almost entirely fail to address the actually hard part of a high-performance parallel architecture: the memory system. Jamming a pile of general-purpose scalar cores on a die with a grid topology is the easy part. Just die shrinking the original MIT RAW prototype design to 65nm would give a very similar device, now a decade later. And existing designs have shown that grid topologies aren't even necessarily a good idea; RAW/Tilera argue in their favor, Intel, NVIDIA, and AMD all strongly argue against, with Xeon Phi née Larrabee's ring of rings shown to be a strong choice at the dozens-to-hundreds of cores scale both in area overhead and latency. But regardless, how do they expect a few dozen, let alone hundreds or thousands of independent cores pounding on their own little subproblems to saturate a wide DRAM interface, where peak bandwidth is only reached by few, extremely wide transactions to even fewer memory pages? I don't mean to imply that it's not possible, just that their "virtual ghz" multipliers, and total lack of discussion of the memory system or of how they expect to do better than the mass of related architectures and prior work, don't suggest that this is likely a grand solution to the challenge of building an efficient parallel architecture, or even competitive with existing commodity hardware in the same price (GeForce GT 640) and power (Tegra 3, Snapdragon S4) range.

Yup I love to see a project like this succeed. I don't think anyone involved in HPC is going to get confused as to the suitability of this product. The people who bought this, are of the same mindset as the people who built it.

There's this little thing called parallel computing, which generally uses large computing clusters to analyse data or make mass calculations, and therefor greatly benefits from cheaper hardware. This practice has many uses such as... performing research to cure cancer.

> Today is the Kickstarter deadline, and the project is up to more than $830,000 with a few hours to go. I work in the Hi-Tech industry in a hardware company (but I work on the software side). $830,000 may sound like a lot of money... but it is not for this kind of project. Not for design, V&V, compliance testing, manufacturing, software support...

> which are scheduled to ship between February and May 2013.I am also willing to bet that this thing won't ship on time (if ever). Any takers?

Maybe I'm too cynical and pessimistic, but shipping a complex hardware product in time is very hard. The fact that this company cannot even fund its own R&D and had to go to Kickstarter doesn't help my confidence levels.

There's this little thing called parallel computing, which generally uses large computing clusters to analyse data or make mass calculations, and therefor greatly benefits from cheaper hardware. This practice has many uses such as... performing research to cure cancer.

We don't really neeeeeed to cure cancer, but it would be kinda nice.

Unfortunately this only does single precision so it is unlikely to ever see use for that type of research... It will probably only see large usage by hobbyists for that very reason.

There's this little thing called parallel computing, which generally uses large computing clusters to analyse data or make mass calculations, and therefor greatly benefits from cheaper hardware. This practice has many uses such as... performing research to cure cancer.

We don't really neeeeeed to cure cancer, but it would be kinda nice.

There are radically higher-performance parallel processing platforms already available off the shelf, to regular developers and hobbyists. A $99 discrete GPU has dramatically higher throughput. Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

All of this is not to say that those are ideal parallel computing platforms for any particular workload, either. But they surpass the Parallella offering on nearly every axis, they are already available off the shelf by the million, and they are well supported by many existing operating systems and toolchains.

Most concerning, though, is that the Adapteva claims almost entirely fail to address the actually hard part of a high-performance parallel architecture: the memory system. Jamming a pile of general-purpose scalar cores on a die with a grid topology is the easy part. Just die shrinking the original MIT RAW prototype design to 65nm would give a very similar device, now a decade later. And existing designs have shown that grid topologies aren't even necessarily a good idea; RAW/Tilera argue in their favor, Intel, NVIDIA, and AMD all strongly argue against, with Xeon Phi née Larrabee's ring of rings shown to be a strong choice at the dozens-to-hundreds of cores scale both in area overhead and latency. But regardless, how do they expect a few dozen, let alone hundreds or thousands of independent cores pounding on their own little subproblems to saturate a wide DRAM interface, where peak bandwidth is only reached by few, extremely wide transactions to even fewer memory pages? I don't mean to imply that it's not possible, just that their "virtual ghz" multipliers, and total lack of discussion of the memory system or of how they expect to do better than the mass of related architectures and prior work, don't suggest that this is likely a grand solution to the challenge of building an efficient parallel architecture, or even competitive with existing commodity hardware in the same price (GeForce GT 640) and power (Tegra 3, Snapdragon S4) range.

Maybe I'm too cynical and pessimistic, but shipping a complex hardware product in time is very hard. The fact that this company cannot even fund its own R&D and had to go to Kickstarter doesn't help my confidence levels.

At small volumes you are looking at traditional developer pricing for such boards; Think $10 000. Thus with the Kickstarter campaign they are raising money to make a volume run which will bring down the price of the chips to IIRC under <$5 dollars. This brings the total costs of the board down enough to sell them for $99 Dollars.

Your sentence seems more like a jab then just a opinion from a pessimistic perspective.

People invest for the cuteness, I think, rather than real parallel programming potential. It will be able to calculate pi or count primes or any other parallel proof-of-concept. Scaling it up to process a serious parallel workload would demonstrate why a real supercomputer is needed for HPC applications.

This isn't going to help open source developers much at all, not for HPC at least. Fortunately there are platforms to assist open source developers in this field that CEO Andreas Olofsson is not aware of.

yessss.. parallel computing is so awesome on 500mhz 16 single precision cores with each core having 32kb ram! certainly, I hope they are searching for the cure for cancer on some Telsa cards, you know the ones that have 2060 GFLOPS for double precision and 5152 GFLOPS for single precision. if they wait for Ephiphany III's at 1ghz, they might never find it.

they did promise the epiphanyiii as 1ghz.. except they have not hit any yields producing 1ghz chips yet... the epiphany iv can barely hit 800mhz at 28nm.. the ephiphanyiii is 65nm, 500mhz.

There are radically higher-performance parallel processing platforms already available off the shelf, to regular developers and hobbyists. A $99 discrete GPU has dramatically higher throughput. Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

It is a bit different comparing single boards to other single devices when talking about a highly scaleable product.

Guys, seriously - stop posting stories about things getting Kickstarter funding. It's not news. It doesn't mean anything. All it means is "someone wants to do something, and some people want them to do that thing." It doesn't mean this thing is any closer to getting made.

Guys, seriously - stop posting stories about things getting Kickstarter funding. It's not news. It doesn't mean anything. All it means is "someone wants to do something, and some people want them to do that thing." It doesn't mean this thing is any closer to getting made.

Guys, please continue reporting on cool projects. It's good to hear about people who dare to dream, and pursue those dreams.

Yup I love to see a project like this succeed. I don't think anyone involved in HPC is going to get confused as to the suitability of this product. The people who bought this, are of the same mindset as the people who built it.

There are radically higher-performance parallel processing platforms already available off the shelf, to regular developers and hobbyists. A $99 discrete GPU has dramatically higher throughput. Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

All of this is not to say that those are ideal parallel computing platforms for any particular workload, either. But they surpass the Parallella offering on nearly every axis, they are already available off the shelf by the million, and they are well supported by many existing operating systems and toolchains.

Most concerning, though, is that the Adapteva claims almost entirely fail to address the actually hard part of a high-performance parallel architecture: the memory system. Jamming a pile of general-purpose scalar cores on a die with a grid topology is the easy part. Just die shrinking the original MIT RAW prototype design to 65nm would give a very similar device, now a decade later. And existing designs have shown that grid topologies aren't even necessarily a good idea; RAW/Tilera argue in their favor, Intel, NVIDIA, and AMD all strongly argue against, with Xeon Phi née Larrabee's ring of rings shown to be a strong choice at the dozens-to-hundreds of cores scale both in area overhead and latency. But regardless, how do they expect a few dozen, let alone hundreds or thousands of independent cores pounding on their own little subproblems to saturate a wide DRAM interface, where peak bandwidth is only reached by few, extremely wide transactions to even fewer memory pages? I don't mean to imply that it's not possible, just that their "virtual ghz" multipliers, and total lack of discussion of the memory system or of how they expect to do better than the mass of related architectures and prior work, don't suggest that this is likely a grand solution to the challenge of building an efficient parallel architecture, or even competitive with existing commodity hardware in the same price (GeForce GT 640) and power (Tegra 3, Snapdragon S4) range.

I stand corrected... invoking as an good excuse as any, I was a little drunk while posting!

Though that post was a more emotional response, I do think anything that makes technology more accessible to people can only be a good thing, even if these devices only end up in a college lab being used to teach students parallel computing principles.

Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

Actually, no. 4 Cortex-A9s at 1.3 GHz (the most common configuration, Tegra 3 T30) are only capable of about 2.6 GFLOPs of theoretical double-precision FP performance per the ARM whitepaper (2 cycles for one DP FP mult or add, 1.3 GHz*4cores). 26 would be about right for the whole SoC if you included the GPU, but technically "just the four Cortex A9's" can't come anywhere near 26 GFLOPs.

Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

Actually, no. 4 Cortex-A9s at 1.3 GHz (the most common configuration, Tegra 3 T30) are only capable of about 2.6 GFLOPs of theoretical double-precision FP performance per the ARM whitepaper (2 cycles for one DP FP mult or add, 1.3 GHz*4cores). 26 would be about right for the whole SoC if you included the GPU, but technically "just the four Cortex A9's" can't come anywhere near 26 GFLOPs.

Last time I checked, my i5 2500k did 52 GFlops with Linpack. Sure, it is more expensive per GFlop, and consumes more power, but it shows that this type of computing power is already readily available. And that isn't even considering GPUs.

There's this little thing called parallel computing, which generally uses large computing clusters to analyse data or make mass calculations, and therefor greatly benefits from cheaper hardware. This practice has many uses such as... performing research to cure cancer.

We don't really neeeeeed to cure cancer, but it would be kinda nice.

There are radically higher-performance parallel processing platforms already available off the shelf, to regular developers and hobbyists. A $99 discrete GPU has dramatically higher throughput. Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

All of this is not to say that those are ideal parallel computing platforms for any particular workload, either. But they surpass the Parallella offering on nearly every axis, they are already available off the shelf by the million, and they are well supported by many existing operating systems and toolchains.

Most concerning, though, is that the Adapteva claims almost entirely fail to address the actually hard part of a high-performance parallel architecture: the memory system. Jamming a pile of general-purpose scalar cores on a die with a grid topology is the easy part. Just die shrinking the original MIT RAW prototype design to 65nm would give a very similar device, now a decade later. And existing designs have shown that grid topologies aren't even necessarily a good idea; RAW/Tilera argue in their favor, Intel, NVIDIA, and AMD all strongly argue against, with Xeon Phi née Larrabee's ring of rings shown to be a strong choice at the dozens-to-hundreds of cores scale both in area overhead and latency. But regardless, how do they expect a few dozen, let alone hundreds or thousands of independent cores pounding on their own little subproblems to saturate a wide DRAM interface, where peak bandwidth is only reached by few, extremely wide transactions to even fewer memory pages? I don't mean to imply that it's not possible, just that their "virtual ghz" multipliers, and total lack of discussion of the memory system or of how they expect to do better than the mass of related architectures and prior work, don't suggest that this is likely a grand solution to the challenge of building an efficient parallel architecture, or even competitive with existing commodity hardware in the same price (GeForce GT 640) and power (Tegra 3, Snapdragon S4) range.

While you are entirely correct, I think you are missing the point.

It's true that a GPU does indeed easily exceed the performance of the Parallella device, it also radically exceeds the *power draw* of the Parallella device.

It would be far more interesting to look at the two from a MFLOP/Watt perspective, rather then simply looking at cost.

While it's not likely to set the supercomputing world on fire, I think that if they can get the cost down enough, it may have some interesting applications in embedded/portable computer-vision systems. They have published a few examples that use the Parallella to accelerate things like face recognition.

Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

Actually, no. 4 Cortex-A9s at 1.3 GHz (the most common configuration, Tegra 3 T30) are only capable of about 2.6 GFLOPs of theoretical double-precision FP performance per the ARM whitepaper (2 cycles for one DP FP mult or add, 1.3 GHz*4cores). 26 would be about right for the whole SoC if you included the GPU, but technically "just the four Cortex A9's" can't come anywhere near 26 GFLOPs.

Last time I checked, my i5 2500k did 52 GFlops with Linpack. Sure, it is more expensive per GFlop, and consumes more power, but it shows that this type of computing power is already readily available. And that isn't even considering GPUs.

I didn't get the impression that this specific device is being touted for sheer processing power, or that there is any intention folks will fold enough proteins to cure cancer on this thing. Of course bigger and better machines exist - for orders of magnitude more cost.

Having the largish (i.e., more than a quad-core ARM) cluster in such a cheap package may well enable a long tail that may well eventually lead to Big Things.

I see folks who wouldn't otherwise have access to Super Computers playing with this. Academics or someone hacking in their parent's basement could come up with some interesting compiler or runtime technique which makes massively parallel devices easier to program. Eventually advances would percolate up to the real big iron.

Or universities which can now afford to have one of these devices on every desk of a Distributed Systems class. No more simulating or making do with a pthreads environment, and the next generation of engineers comes all the better for it.

I'd want one to play with for the fun of it, just as SPU programming was fun when PS3:Linux was around. For someone who codes for pleasure, it's another style of programming different from desktop or vanilla embedded which is interesting - I'll probably pick one up for $99 when then become available no-risk...

There's this little thing called parallel computing, which generally uses large computing clusters to analyse data or make mass calculations, and therefor greatly benefits from cheaper hardware. This practice has many uses such as... performing research to cure cancer.

We don't really neeeeeed to cure cancer, but it would be kinda nice.

There are radically higher-performance parallel processing platforms already available off the shelf, to regular developers and hobbyists. A $99 discrete GPU has dramatically higher throughput. Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

All of this is not to say that those are ideal parallel computing platforms for any particular workload, either. But they surpass the Parallella offering on nearly every axis, they are already available off the shelf by the million, and they are well supported by many existing operating systems and toolchains.

Most concerning, though, is that the Adapteva claims almost entirely fail to address the actually hard part of a high-performance parallel architecture: the memory system. Jamming a pile of general-purpose scalar cores on a die with a grid topology is the easy part. Just die shrinking the original MIT RAW prototype design to 65nm would give a very similar device, now a decade later. And existing designs have shown that grid topologies aren't even necessarily a good idea; RAW/Tilera argue in their favor, Intel, NVIDIA, and AMD all strongly argue against, with Xeon Phi née Larrabee's ring of rings shown to be a strong choice at the dozens-to-hundreds of cores scale both in area overhead and latency. But regardless, how do they expect a few dozen, let alone hundreds or thousands of independent cores pounding on their own little subproblems to saturate a wide DRAM interface, where peak bandwidth is only reached by few, extremely wide transactions to even fewer memory pages? I don't mean to imply that it's not possible, just that their "virtual ghz" multipliers, and total lack of discussion of the memory system or of how they expect to do better than the mass of related architectures and prior work, don't suggest that this is likely a grand solution to the challenge of building an efficient parallel architecture, or even competitive with existing commodity hardware in the same price (GeForce GT 640) and power (Tegra 3, Snapdragon S4) range.

I'm only up to page 13 of the Epiphany architecture reference (the MIT RAW one is next on my reading list), but so far it seems that the memory model shouldn't require all the cores banging on external DRAM and competing for bandwidth. Of course one could write a system that does just that, but it wouldn't be the recommended approach. Each core has a little bit of local memory with guaranteed bandwidth, and can also directly access other core's local store either with a dereference, or more efficiently by DMA'ing a chunk in.

I feel this will likely become a rather standard model as core counts go up, precisely to address the limitations of externa interfaces you mentioned. Sony/Toshiba/IBM had a similar inkling with the CELL, except Epiphany's memory seems more flexible, in that DMA is not the *only* method of getting data in/out of the local store.

Did Larrabee cores have their own little pool of memory, or are they required always to hit external DRAM?

Most concerning, though, is that the Adapteva claims almost entirely fail to address the actually hard part of a high-performance parallel architecture: the memory system.

I contacted the project authors wrt/ this issue before the kickstarter funding window closed. They told me they knew what there were doing (I hope so) but without providing more details, unfortunately. Nonetheless I'm happy they got their funding.

Just the 4 Cortex A9's in any quad core phone SoC can match the claimed theoretical 26 GFLOPs.

Actually, no. 4 Cortex-A9s at 1.3 GHz (the most common configuration, Tegra 3 T30) are only capable of about 2.6 GFLOPs of theoretical double-precision FP performance per the ARM whitepaper (2 cycles for one DP FP mult or add, 1.3 GHz*4cores). 26 would be about right for the whole SoC if you included the GPU, but technically "just the four Cortex A9's" can't come anywhere near 26 GFLOPs.

Yes. However it's worth noting the Parallella cores only do single-precision, and are in fact not fully IEEE 754 compliant.

The GeForce GTX 690 does roughly 5600 Gigaflops, at about 18.74 Gigaflops per Watt (according to Wikipedia). If that 16-core Parrallela system takes just 1.5 Watts to produce those 26 Gigaflops, then the GTX 680 is more efficient for the task at hand (and a hell of a lot more practical).

This isn't aimed at HPC but at low power space constrained applications ,you can't stick a gpu + support systems in a usb dongle size ( sdr -software defined radio receiver) or in a parking radar or in a audio effects pedal or small box.

They are using a xilinx zinq chip ( fpga with 2x arm9 hardcores at up to 1GHZ) to run the os and peripherals with their own chip doing the low power number crunching.

Their board is basically a cut down zedboard but with their chip on it as well.Cheap way to get a zinq board.

xmos makes similar chips seems to have picked up a few design wins in usb and audio peripherals.The one of the guys behind xmos (their CTO) is one of the original transputer guys - David May

There's this little thing called parallel computing, which generally uses large computing clusters to analyse data or make mass calculations, and therefor greatly benefits from cheaper hardware. This practice has many uses such as... performing research to cure cancer.

We don't really neeeeeed to cure cancer, but it would be kinda nice.

Holy strawman.

The "thing" in super-computing is currently GPU processing, and for good reasons, and this rasberry pi doesn't look in any way fit to supplement let alone replace that.

$99 for 26 gigaflops, or even the 90 gigaflops goal that couldn't be reached, pales in comparison to the teraflops offered by an Nvidia or AMD GPUs at various price ranges from $100 to $500.

Considering the numbers I don't see mobile hardware having any real place in supercomputing, not that it should even have to be said.

The Epiphany architecture has a number of potential advantages over a GPU1. More flexible programming model - each core is an independent execution unit. There may be problems that work well on Epiphany that don't work so well on a GPU. (Not to say that every problem will fall into this category - ome problems will work well with wide vectors on GPUs for which the extra instruction units on the Epiphany will be unneeded overhead)

2. More scalable (both up and down). - GPUs may have high performance and good Perf/Watt, but the high end versions with these specs only come in discrete 200W units. There is no 2W option. - Epiphany can gluelessly connect multiple chips into a larger grid. (Connecting multiple 16-core chips is probably not so interesting, but hooking multiple 64-core chips would be - 11 of them would give 1 TFlop at 22W. Though I don't know what the off-chip connections do to performance and power.)

3. Openness of the architecture and availability of programming tools. Often GPUs have the programming specs closed, and you're at the mercy of the GPU vendor as to the availability of an OpenCL compiler.

4. GPUs are for graphics processing first, computation will play a secondary role.

Also note that the path to the next milestone of performance (exaflop) is dominated heavily by power efficiency. See the Mont Blanc project (http://www.montblanc-project.eu/ ) where they are exploring ARM chips as elements of supercomputers (so mobile hardware could definitely have its place.) The Adapteva chips are even more power efficient than ARM chips. Now the current generation of their chips are not really suitable for large scale computation, but future generations certainly could be.

This might be a bit off-topic, but do the initial seeders of a kickstarter project become co-investors/shareholders? If a project takes off and becomes profitable, do they have "shares" that gain in value? I know the main reason folks seed is to get a product/service they're interested in. But, as extra incentive, if they profited from it, it would seem like more incentive for folks to put their money where their mouth is.

This might be a bit off-topic, but do the initial seeders of a kickstarter project become co-investors/shareholders? If a project takes off and becomes profitable, do they have "shares" that gain in value? I know the main reason folks seed is to get a product/service they're interested in. But, as extra incentive, if they profited from it, it would seem like more incentive for folks to put their money where their mouth is.

no it patronage, not buying stock.

and with this project, these guys already had millions in VC funding, so this kickstarter is really milking the cow, so to speak.

This might be a bit off-topic, but do the initial seeders of a kickstarter project become co-investors/shareholders? If a project takes off and becomes profitable, do they have "shares" that gain in value?

In the US it still isn't legal for a service like Kickstarter to offer actual investments. So you don't become a stockholder; not an "investor" in the usual sense of the term. The only concrete thing you can expect to get out of backing a project is whatever donation gift is offered at your funding level; the people who post the project are obliged to fulfill those, at least. What you're doing is saying, "This is a good idea, the world needs more stuff like this. I'm willing to give these people my money to make this real."

Also, comparing this hardware to a GPU or an i7 seems to be missing the point. A $99 board seems like a good way to test how well your parallel program scales across many cores instead of 4-8, or to prototype something that can be made into a smaller consumer device in the future. Programming for a GPGPU isn't really the same thing.

Because of KickStarter, something like this becomes available to the public much sooner. Not saying that it is the greatest thing since sliced bread, that clearly is chocolate bacon. It's nice to see things like this and 3D printers coming to the public a whole lot sooner and showing the entrepreneurial spirit growing and thriving!