System Hyper Pipelining = 16 MCU cores on a Spartan-6 LX9 FPGA

My friend Tobias has been working on a rather cool technology called System Hyper Pipelining in which he uses registers to multiply the functionality of IP cores.

Well, I'm still trying to wrap my brain around this one… I just heard from my chum Tobias Strauch, who is the founder of EDAptix in Munich Germany.

It seems that for the past few years Tobias has been working on a rather cool technology called System Hyper Pipelining in which he uses registers to multiply the functionality of IP cores.

In his own words, the example Tobias gives is as follows: "You can usually get only one ARM-compatible core (e.g. Amber from OpenCores) on a Spartan-6 LX9 FPGA running at 60MHz. With System Hyper Pipelining I can multiply this functionality and run 16 cores with an equivalent system performance of 250MHz!"

If you are interested in learning more about this technology, Tobias invites you to bounce over to www.cloudx.cc and take a look around. In particular, he says you should check out the technology video.

Tobias also has a demo that runs on his low-cost FPGA Arduino board, an image of which I just snagged from the left-hand side of his projects page at www.cloudx.cc/projects.html

As part of a spirited email conversation, Tobias added the following:

Maybe I should clarify the 250 MHz a little bit. The following rules exist in the demo:

Tobias is really excited about this – he feels that this System Hyper Pipelining technology is revolutionary, especially when you compare it with the fact that a single ARM only runs at 60MHz.

The problem, he says, is that this technology is so cool that it is very hard to explain and/or sell to people, which is why (a) he created his demo and (b) he asked me to spread the word (grin).

Please do bounce over to www.cloudx.cc and take a look around, and then comment below to tell us what you think.
If you found this article to be of interest, visit Programmable Logic Designline where – in addition to my Max's Cool Beans blogs – you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).

Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).

@green_is_now These are up to 16 independent MCUs (ARMs). No “shipping memory states external” involved. They are cycle accurate to the original core.
This is the difference to Multithreading. The 16 CPUs work independent of each other. There is also no “thread management in hardware” as @shikantaza speculates.

Ali, you are right, deterministic EDP comes seamlessly with SHP.
I'm not sure if performance improves (it actually gets a little bit worse), but the performance per area improves, since the area reduces. Most importantly the power reduces a lot compared to individual instantiations.
From the system perspective you have a lot of positive secondary effects (reduced system architecture, better system performance, reduced power consumption due to better data sharing, EDP, …)
Hope the world is not going down today, because this is just so much fun.

These techniques are indeed becoming more prevalent. There can be significant advantages both in cost and performance. XMOS multicore microcontrollers use these techniques with multiple cores which share an execution unit, and a single shared high speed memory. Together with Event Driven Processing, this approach works very well for software peripherals. It can also have real benefits in designing deterministic cores where low latency, real-time responses can be guaranteed by the architecture.

@Max, I forgot to mention, that you can do this on peripherals and DSPs as well. If you have 8 Ethernet cores, use one SHP-ed instead to reduce area. Or improve the latency of DSPs with SHP. In general you can say, the bigger, the better (Cray;-)). By the way, SHP is attractive for ASICs as well.
@shikantaza Wikipedia clearly separates between multithreading and multiprocessing. SHP plays with multiprocessing and has nothing to do with multithreading or critical path optimization, c-slow retiming etc. SHP improves performance per area (ASICs) or performance per slice (FPGAs) of any design that is instantiated multiple times, which is not a bad thing. Now working on it since a few years, I realized the impact on the system architecture in the MultiCore era (power performance).

Uhhhh... I'm trying to wade through the gorp. I can't figure out whether your bud has managed to reinvent multithreading or reinvent critical-path optimization. It may be both, but mostly it sounds like fine-grain multithreading.
This isn't a bad thing, and it's cool to leverage IP costs in an FPGA implementation, but multithreading has been done before. Intel's chips do either 2-way or 4-way as a means of "hiding" latencies (context switches or I/O waits) to keep the CPU(s) busy. Tera Computing (renamed "Cray" after buying assets) did 512-way. (The jury is still out there...)
Multithreading is one way of squeezing performance from a fixed set of processing resources. It hides latencies, but pushes thread management into hardware. When a system runs out of thread capacity, then what? At some point the multithreading hardware becomes larger than the CPU. Which is the dog? Which is the tail?

Hi Max, yes, if you add the bus system (e.g. AMBA) to the SHP-ed design, you reduce system bottlenecks. SHP has a great impact on the system architecture, this is why I added the word “system” in SHP ;-)