Epiphany-V: A 1024-core 64-bit RISC processor

I am happy to report that we have successfully taped out a 1024-core Epiphany-V RISC processor chip at 16nm. The chip has 4.5 Billion transistors, which is 36% more transistors than Apple’s latest 4 core A10 processor at roughly the same die size. Compared to leading HPC processors, the chip demonstrates an 80x advantage in processor density and a 3.6x advantage in memory density.

Chips will come back from TSMC in 4-5 months. We will not disclose final power and frequency numbers until silicon returns, but based on simulations we can confirm that they should be in line with the 64-core Epiphany-IV chip adjusted for process shrink, core count, and feature changes. For more information, see report below:

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Doubling of local SRAM from 32 to 64KB per cores, is really good thing. But I also hope bigger programs / kernels can be loaded into few neighbours and used as a “mini cluster” of semi-local memory. Would be nice to have separate network for that (other than the one used for off chip communication).

Assuming 1GHz for the IO clock, 192 Bytes / cycle, translates to 192GB/s total extranal bandwidth, or 179MB/s per core. In big systems I would expect half of the IO used for memory, and half for interconecting other CPUs, giving 90MB/s per core on average. Which isn’t very high in memory intensive applications (like machine learning for example).

Hello very cool website!! Guy .. Beautiful .. Amazing .. I’ll bookmark your blog and take the feeds additionallyâ€¦I am satisfied to search out a lot of helpful info right here in the put up, we need work out more techniques on this regard, thanks for sharing. . . . . .

Congratulations will you continue to provide an fpga alongside your lovely little chips?

You may have been a bit too forward thinking with the parallela but it seems that the hpc community and even big data is catching on to the fact that some of their algorithms can be fundamentally more efficient on an slow fpga than on a fast cpu/gpu.

I am backer #2262. When I received my cluster, I was expecting to be able to buy the 64-core variant later on. And so I have waited for over two years. Now that you’ve announced the 1024-core Epiphany V, I look on with enthusiasm. However, since I can’t buy the 64-core variant, I’m sure I won’t be able to purchase this.

How do I feel? I am on the outside of the candy store looking in. Back in 2007, Intel showed off an 80-core CPU. And I thought to myself, it would be so great to work with that. After all, I had worked with the Celerity, the first minicomputer to use a 32-bit microprocessor (two processor boards, each with two FPUs and an integer coprocessor). And so here comes the Parallella, and I thought to myself, “Yeah! This will be fun!” And … No interconnects. No 64-core, except for the initial Kickstarter. Supercomputer.io came and went.

Yes, I have Pi’s, I have the Nvidia Jetson TK1, and I have other small strange things. I’ve supported Seti@Home and BOINC. Even though over 10,000 boards have shipped (how many now?), it’s still as if there’s a chicken-and-egg problem.

Awesome but don’t forget to add a decent memorycontroller on your next board.
All competing boards do either have a massive amount of fixed memory or SODIMM slots along side their cpu’s.
It would be awesome to have this amount of computational resources and a space to park a ton of data.

This is a huge step forward for the Parallella project, as the smaller chip didn’t really offer much of a difference relative to intel’s offerings, which are up to around 20 cores. 64 limited power cores vs. 20 full power cores was not a compelling proposition. The thing that this chip offers finally, is a resounding price/performance advantage over intel’s rather expensive “big iron”. For example, in the press release for Intel’s E7-8890v4 chip (which costs over $12k/chip), they talk about 8 sockets boards, that have 196 cores that “start at $200k”. If you imagine 10 of the Parallela 1024 core chips, suddenly you have something that is very competitive. A 10,000 core system, at under $20k, would be a boon to research everywhere. Really at this stage of the evolution of massively parallel machines, the key thing is to get these systems into the hands of graduate schools around the world, so that new programming languages can be worked on. I know Tucker Taft’s project for a parallel language, called Parasail, hasn’t gotten much traction because hardly anyone has 1000 cores to program, so i am hoping the best for this new chip. The nVidia massive core systems are not easy to program, having been retrofitted from 3D graphics chips, and they present so many weird quirks to the programming. The parallella architecture is much cleaner and tremendously simpler. It remains to be seen how many algorithms can live within 64MB, but i suspect that the doubling of the RAM in this 5th gen chip will do the trick for most people. Once you start manipulating images, the memory gets eaten up fast. This whole process is going to take years to play out, but is a great step forward.

I am a bit worried money to developed came from darpa, they are well known to not be a onlus organization, everything they do is because the have some “grey” plans on it: on other side, community was not able to provide all money needed, so the “necessary evil” must be accepted by parallella. All in all, this 1024 cores chip seems to be “the next big thing” in “number crunching for everyone” field and we could count the time BEFORE it and AFTER it as, for “embedded systems” field, we start to do in 2012 with raspberry. Low size, low energy, 16nm, are good, little size of ram is a question mark for many kind of calculation but it is a good begin. Also scaling of efficency is unknown , it depend by many factors (type of algorithm, data do be moved inside mesh network, etc). only “test on field” will answer to this question. What we need to know now is price of chip: if it came for 1000$$ it is not a democraticizing of massive calculation, if it came for 100$$ it is. of course parallella must have back all efforts they placed inside the project as money reward. Hope parallella will do the right middle size between money reward and attention to community when they will set final price. hope with part of money from 1024 core chip’s selling they will have enough revenue to do in autonomy develop of parallella-VI with 16384 cores with 1MRAM dedicated to every single core chip 🙂

we can do some speculation: if parallella-IV 64 core at 28nm run at 500Mhz, 1024 core at 16nm should reach 8-900Mhz
for sure it will need at lest a heatsinnk , in worst scenario also a fan
aside single chip board you will do, are planned some luxury version, i.e. mainboard with 4 socket where can be placed S.O.M. with parallella-V to have a scalable from 1K to 4K (or more) core system?

If it has “extensions” for Deep Learning (not a hardware/chip design person, myself), how will programmers make use of it? I have searched for ways to make use of epiphany with tensorflow or theano, but gotten nowhere. At least for me, developing machine learning models with C++ on bare metal is not an option.

I share bob’s concerns over DARPA funding and also daniel’s question about a new board based on this chip. If this anouncement is for real then a board based on this chip would be simply astonishing. The potential for advanced algorithm development would be phenomenal, especially if it could be made low cost and affordable. However, I really doubt that DARPA or NSA would like to see this technology in the public domain. I suspect that any production version being made available to the general public would probably be scaled down in some way. Anyway, I don’t get too excited about these announcements anymore. I still remember when they said 64 core boards would be available but it seems only few got those. The rest of us had to make do with 18. Sorry to be a kill joy folks but all this just sounds too good to be true.

mike ross, you are right. Too much often i heared “we will do” instead of “we already did”. too much often people promise (im not referring to parallella, just talk in general) “miracles” that fail on real. we must be realistic, lets wait chip arrive. let wait board is ready. let wait bencmark are done. only there we can say “it is a success”.

[…] McKee). This work continued in 2016 as we needed a way to validate our design decisions for the 1024-core Epiphany-V. Debugging with the simulator is an order of magnitude easier than with hardware, so you should […]

I work at another end of the spectrum needing from what this chip seems to be designed for (TFLOPS of processing power). My systems are embedded and I need them to come in at about 2 watts for the core processing functions. In that, I need about 150 to 160 GFLOPS (single precision floating point) of real processing capability. This may be at the low end of the what your new chip can do. If so, I would hope that cores can be disabled to save power when they are not needed. You are calling this a SOC. Will there be any control processor or even a soft core processor on an FPGA that can “run” the system or will this chip need a companion chip that handles its interface to the outside world?

Looking forward to seeing the new chip and its performance and power consumption. What is the size of the new chip?

Very nice post. I just stumbled upon your weblog and wished to say that I’ve
really enjoyed surfing around your blog posts.
After all I’ll be subscribing to your rss feed and I hope you write again soon!

I really like the improvements. I was expecting Adaptiva to go the full 4096 cores and shrunken to 14nm, but that wouldn’t leave much in the way for debugging, optimisation and other improvements due to the substantially increase in cost.

I’d love to see how well this scales up in performance compared to the previous generation and would be happy to test it 🙂

As I understand this is 64k per processor. So that’s 64k for both code and data?

It is an interesting architecture and will have to warp our minds to figure out how to use this in embedded applications.

It is not going to work well for training deep learning because of the memory bandwidth bottleneck, However, it should work okay for inference assuming that we can exploit the estimated 1 teraflops capability.

64 kByte embedded SRAM in each node has not much of a bottleneck. Accessing remote node memory is penalized by latency, commensurable with distance. So this assumes access locality, which is quite often a given for many problems.

Any chance for getting a status update? Especially on availability and estimated cost?

It has been 6+ months since the announcement on the tape out. At the very least, when we should check back in for an announcement? Perhaps a mailing list we could add or emails to to make sure we get the announcement when it comes out?

so? from announcement to now quite 7 months are gone!
no update? no info? no nothing?
shall we derubricate this post from “good news” to “vaporware”?
if there is delay it can be acceptable, ok, but al least advise us!

We are finalizing the resurrection of our S&L banking, financial and investment A.I. engine and application. Our system was developed and implemented using the Texas Instruments’ “Explorer” Lisp engine and accelerated/processed by AMD’s 2900 slice processor family. We are building a new engine that will run our inference engine, pattern recognition engine, etc. We’d like to begin design-in of an Epiphany V cluster of 128/256 and need an availability date. Please advise.

Getting close to a year now, and I’m getting scared.
I was around for the initial kickstarter, however was a senseless teen at the time with no brains to save up for the original board.
I’ve since then (rather recently) come up with a great idea for personal use with a 64 core or higher chip.
Now here’s where my fright enters the room. Usually, I can find things I once wanted to purchase within a few hours, however after days of looking I’ve only since been able to find the 16 core chips for sale, and at an inflated price of ~$150.
I did find your page explaining the price jump, and I understand. But after reading this, I was excited to see a 1000+ core chip, yet a year later there is no news. My hopes of getting ahold of anything above 16 cores are drying out.
This was the first piece of hardware I was ever excited about it’s release back in 2012. Starting to look like I’ll be once again, moving on.