Asynchronous Techniques Push Moore's Law to the Physical Limit

The Globally Asynchronous, Locally Synchronous (GALS) technique can be used to connect multiple IP cores on a single deep-submicron digital IC (e.g., ASIC, SoC, FPGA).

For the reasons we discussed earlier, today's state-of-the-art SoCs and FPGAs contain an ever-growing number of IP cores, each of which is immersed in its own independent clock domain. When the IP core population increases and the minimum transistor size shrinks, it's very difficult to keep all the different clock domains synchronized across the die while maintaining the considerable data bandwidth between them required to keep pace with Moore's law.

One of the most promising solutions to this interconnect issue is the Globally Asynchronous, Locally Synchronous approach, a.k.a. GALS. This solution is based on considering each IP core in a design as forming an independent "synchronous island." In this way, each core can be implemented using standard synchronous design tools to a constrained region of the die. Once every synchronous component has been designed and located, an asynchronous Network-on-Chip (NoC) is deployed to efficiently convey data packets between the different cores as illustrated below:

In future columns, we will discuss how this asynchronous interconnect infrastructure can be efficiently implemented on a SoC, focusing on Xilinx FPGA devices as a practical example. For this purpose, we will only need to make use of a very special kind of building block: the micropipeline, which is a special form of event-driven, elastic data pipeline. Don't worry. I will demonstrate just how easy this can be in a COTS FPGA when using the appropriated design techniques and tools.

By using the GALS approach, we will be able to achieve the low power consumption, rugged behavior, low EMI, and speed advantages associated with asynchronous design techniques in an FPGA while still harnessing all the productivity of its well-known synchronous design toolchain.

But before we go there, let's pause for a moment to once again consider the physical issues pushing us toward this kind of implementation. As data communication has become the most critical part of a design to be optimized, a lot of attention must be devoted to the place-and-route portion of the workflow. Xilinx provides a lot of wonderful tools for controlling the physical implementation of an HDL description, including the PlanAhead floorplanning engine along with very accurate post-place-and-route simulation models. Using this design infrastructure is not only a must when implementing asynchronous designs on an FPGA, but also when optimizing for power, area, and speed in any complex synchronous design.

Now, are you using floorplanning tools and post-place-and-route simulation models, or are you still relying on a "single-click" automated flow? Have you ever been obliged to roll up your sleeves and control your design by hand as a hardware artisan? Or are you comfortable with the "HDL -- it's just a kind of software" point of view?

I don't know why, but I love the concept of asynchronous logic -- I know, I know, synchronous logic was good enough for my father and for his father before him, so it should be good enough for me ... but still, I love the concept.

Well, this blog is not supposed to be an historical introduction to asynchronous logic, but a discussion about the REAL problems that synchronous logic is facing today. Thus, the main purpose of this text is setting the scene for introducing the Globally-Asynchronous, Locally-Synchronous approach, A.K.A. GALS, a tradeoff between hand-made async communication and synchronous island synthesized with COTS EDA tools.

The point is that the same Manchester University team that built the Amulet, are now one of the most active advocates for the GALS approach and the asynchronous-NoC -- it's worthy to note that Steve Furber, one of the main architects of the original ARM, is one of the leaders of this Manchester async research group... and he lauched the Silistix Spin-Off too, a startup based in async and backed by Intel. but I'm digressing so much...

The real purpose of this blog is starting a logical line of thinking in which the different components and main ideas for building async circuits over COTS FPGAs are introduced and justified. FPGA technology is intended in this framework as an educative platform for learning and playing with REAL async circuits. Next step will be understanding the basic Ivan Sutherland's micropipelines; after that, I'll introduce you an open project full of resources that should be in the logic designer toolbox... and Finally, I promise a mayor and exciting upgrade from one of the greatest gurus on Async: CalTech's Alain J. Martin.

PS. Sorry if I digress too much, but I really loves this topic. I've bet & lost for Async so many times, that now I'm completely aware the next big async thing is really near -- too much lessons learned!!

Intel has been involved with a number of asynchronous logic companies. They bought Fulcrum, which has a number of chips, including Ethernet switches, that use asynchronous logic inside. And they partner with Achronix, which makes FPGAs that utilize asynchronous techniques for internal data pipelining.

But I would add one more deal: Intel funded Silistix, a Manchester Unversity start-up leaded by Steve Furber (one of the original ARM architects) and which focused in the development of an asynchronous On-Chip core interconnect.

You can find more info in the next Silistix Limited website -- where references to Intel has been convenently suppressed, of course ;-)

@Max: "Actually... did you know that, in the early days, asynchronous logic was the norm"

In fact, most of the computers that were built in the 50ths and 60ths had asynchronous logic in their inner gears, e.g. ORDVAC, BRLESC II.

As it's explained in this blog, the rise of Moore's law enabled by CMOS technology empowered the synchronous paradigm adoption, but the fall of Moore's law while hitting the nano scale is signaling the beginning of the new Async era!! -- just ask to the Intel guys what is behind most of the papers are presenting this year at ISSCC ;-)

@Max: "I don't know why, but I love the concept of asynchronous logic"

The reason for this is quite simple: this is the way in that Nature works!! Just a couple of examples:

BIOLOGY: asynchronous logic behaves just as a neuronal network, i.e. data Synchronization is handled by local handshaking and feedback.

PHYSICS: Thes synchronous logic paradigm assumes that you can build a clock that ticks at every point in the space --the registers-- at the same time... but this is just impossible, as you cannot build such a physical system due to the limitations imposed by Einstein's Relativity.

And, by the way, I'm really glad of seeing you and me putting the Async Logic into the headlines again together: this is how we get acquainted just a year ago!!

@AZskibum: "I'm curious what tools all you "I love asynchronous logic" guys are using to implement and verify your designs."

Well, last week I used standard VHDL in order to describe a 4000 stages-depth asynchronous micropipeline in the Xilinx Zynq device that powers my Zedboard. No strange circuitry, tools.... just PlanAhead+Xise+plain HDL description.

Asynchronous design is interesting at various scales, but I also find wave pipelining interesting (which is sort of related to timing-dependent asynchronous design). It seems neat that one can avoid a pipeline latch by timing more pulse-like signals (waves) such that they do not overlap. Sadly, wave pipelining is a victim of fast clocks, process variation, and other modern factors.

@Paul: "Sadly, wave pipelining is a victim of fast clocks, process variation, and other modern factors."

This issues are common for every design in which you must do deterministic timing asumptions. This is one of the reasons because a well known approach to asynchronous design is considered a glitter alternative for the future: the delay insensitive logic.

This kind of circuits are "correct-by-design", as you don't need to make any extra assumption about process variation, timing, etc. In addition, this circuits automatically adapt their performance to environment variations, such as temperature derives or a swinging power supply voltage.

Plus, delay insensitive design can be used in flexible substrates. An interesting example is the 8 bits MCU that Seiko developed some years ago:

Moore's Law says nothing about speed and everything about circuit density. Speed WAS a side effect, but that has waned with advanced nodes. The title should read something like "Asynchronous Techniques Push Circuit Speeds to the Physical Limit". A rather obvious statement if you ask me. Asynchronisity (is that even a word?) has nothing to do with and is completely orthogonal to circuit density.

@tpfj: "joore's Law says nothing about speed and everything about circuit density"

Moore law is all about jumping from one process node to the next and the overal exponential advantage. Density, Speed and Power consumption are all related with the transistor shrinking process.

Moore's law is meaningless beyond CMOS or derivatives, as it badly collapses when data communication replaces data processing as the most expensive process in terms of power consumption, propagation delay, occupied layout area and, of course, monetary cost.

Remeber that by just adding cores/logic, you cannot increase the overall performance in General Purpose Programming.

Back in the nineties, CPU were sold by highlighting their clock speed, as this was really the parameter that supposed an advantage in day to day application. e.g. office suite, graphics, boot-up time.

"Several measures of digital technology are improving at exponential rates related to Moore's law, including the size, cost, density and speed of components. Moore himself wrote only about the density of components (or transistors) at minimum cost."

I agree, the trend is collapsing. Surprisingly, Moore's component, ie. circuit density, is the one that is holding out the longest. We lost speed a long time ago, we lost power fairly recently, we are now rapidly losing silicon $ cost (just look to current process node fab and related mask costs). I don't believe that asynchronous circuits will solve this either. Sure they will scoop out all those currently wasted picoseconds on the non-critical synchronous paths as well as any delay savings through flops that will be absent from an asynchronous equivalent design. But realistically, how much of an advantage do you think this will give you against a well balanced (ie many paths near critical speed) sychronous equivalent. I doubt it is double? Also, remember, a synchronous circuit gives you inherent parallism through pipe-lining. How do you achieve the same parallism asychronously? Really what I'm saying is the laws of physics that are currently kicking our asses will unfortunately still apply to asynchronous circuits, and hence should then suffer in similar ways.

@tpfj: "Really what I'm saying is the laws of physics that are currently kicking our asses will unfortunately still apply to asynchronous circuits, and hence should then suffer in similar ways."

I completely agree with you in this point. Indeed, the really worthy advantages comes from other side effects derived from this situation, not from a direct speed advantage.

The main message I pretended to explain, is that our live has been easy as we had not needed to pay attention to datapath internal geometry for a long while by using synchronous design, but now the paradigm has changed and clock starts to be a problem by itself.

The huge clock network in modern ICs sucks as many as 50% of the total power being wasted. For this reason, there are most practical applications of async circuits in RFID cards, in which async MCUs are heavily used.

In addition, the lack of a global clock with a fixed frequency reduces the EMI by orders of magnitude in async parts. For this reason, Async is used too in low power wireless communication modems.

I'm concerned about the GALS (Globally Asynchronous, Locally Synchronous) approach. Aren't you going to have metastability problems with multiple clock islands? Wouldn't one be better off with all islands having the same clock frequency, but with relaxed timing between islands?

@betajet I must admit that occurred to me as well. Back in the day I used to deal with strings of modems used to poll terminals. You HAD to have one clock through the whole system or you'd get errors. If anything puts data out faster than something else can accept it, you'll start losing it. You can use buffers to take up phase differences and slight clock slippage but eventually the buffer will over- or under-flow.

If your buffer is big enough and you're only sending bursts of data you can get away with differences, but we're talking big amounts of data here I think?

This GALS idea seems to be just using a similar approach on a chip that is tradionally used on a PCB. The chips on a PCB will probably all be synchronous designs, but they certainly won't all be sharing the same clock tree!

Jack wrote: The chips on a PCB will probably all be synchronous designs, but they certainly won't all be sharing the same clock tree!

Actually, they may be sharing a single PLL-based clock generator that derives all the clocks from a single reference. In that case they are sharing the same clock tree, with the trunks implemented on the PCB and the branches inside different ICs. For example, one of my designs has a 33.333 MHz master clock with matched clock lines so that the clock arrives at the various ICs and I/O modules at the same time. Very clean, plenty of timing margin.

I'm coming at this as an ex-digital IC designer (65nm was considered sexy when I stopped). I was designing standard components, which customers could take and use as they wished. From that point of view, once signals left the IC, all bets were off: not only did I not have any control over the customer's PCB layout, I didn't know what other ICs they might be using.

Jack wrote: Interesting - did you design all the ICs as well as the PCB? ...once signals left the IC, all bets were off.

I designed the contents of the FPGAs. Actually, as an IC designer you do have some control over the PCB since you provide a "data sheet" that specifies the external timing of the chip: combinational delays, clock to output delays, and setup/hold constraints. As the chip designer, you promise the PCB designer that if the timing constraints are met the IC will function properly -- no bets about it.

Well, yes, I can specify stuff on a datasheet, but I want to make it as easy as possible for a customer to use my IC. For a start, my ICs would interface to an IC with processor cores on it, which need software. Consequently, the customer isn't going to change that at the drop of a hat because they've invested a lot of money in that software. So, if my IC is not compatible with the processor IC, my IC isn't going to get selected. If I had specified some interface which required the processor IC to be using some sort of balenced, shared clock, sales revenue would have been $0.

Was your PCB full of FGPAs really fully synchronous? 33MHz only gives you 30ns and that doesn't seem very long to get off one IC, across a PCB and on to another. Another path longer than 30ns would then become a Multicycle Path, which IMHO makes the design not fully synchronous (there will be some combination of process corner, voltage, temperature where the path delay is exactly a multiple of 30ns, which is no good unless you have some circuitry to mitigate the resulting metastability). However, I've been designing at the Matlab level for the last 6 years, so perhaps 30ns isn't so short now?

Jack wrote: 33MHz only gives you 30ns and that doesn't seem very long to get off one IC, across a PCB and on to another.

30 ns is plenty of time if you register inputs and outputs at the I/O pads and have point-to-point connections that are less than 12 inches: clock to out is 5-10 ns, propagation delay is another 5-10 ns including reflections, and setup is maybe 5ns, so plenty of time. A 33 MHz PCI bus has that same 30 ns clock period, and since it's a bus it has slower propagation than point-to-point if there are multiple cards, and requires that some wait state logic be done in the same cycle. People do 66 MHz PCI, though you may need custom logic to meet the timing.

@jackOfManyTrades: "This GALS idea seems to be just using a similar approach on a chip that is tradionally used on a PCB"

This is a very clever intuition. In some way, today ICs resembles a big system which has been collapsed inside the chip package. For this reason, terms such as "System-on-Chip", "Sytem-in-Package" or "Network-on-Chip" are a common topic in state-of-the-art VLSI design -- and the thing promises to get even more interesting in the future ;-)

@Betajet: "Aren't you going to have metastability problems with multiple clock islands?"

Metastability is a real issue in all asynchronous logic designs, just as it is in synchronous designs dealing with asynchronous external inputs.

About the synchronous islands, let me clarify that having such designs blocks doesn't always mean you are using a conventional periodic clock. The synchronous island implies that you have a local clock distribution network that acts as an "isochronic fork". This is, by limiting the clock network to a local boundary, you can assure that you can insert a clock/trigger signal in the clock network input and this signal will reach the local registers/flip-flops with a controlled skew inside the synchronous island. By this way, you can use conventional synchronous EDA tools in order to design, synthesize and lay down the digital logic inside the island.

But the point is that the "clock" you are injecting into such a local synchronous island may be a locally generated signal, being aligned by this way with the asynchronous handshaking control circuitry. This is what is called a "pausable clock", and it supposes an advantage in power consumption as this is only tiriggering when the synchronous logic block has an actual work to do.

Garcia-Lasheras wrote: This is what is called a "pausable clock", and it supposes an advantage in power consumption as this is only triggering when the synchronous logic block has an actual work to do.

I've always called those "gated clocks", and I used them decades ago with TTL. It was easy to do in TTL, because you could make a clock tree from NAND gates instead of inverters and the NAND gates provided a good place to put in a clock enable. You could also use the output enable of a tri-state driver as a clock gate. You can do this as a fully synchronous design: when clocks do toggle, they do so at the same time.

The FPGAs I'm familiar with won't let you do this, because the global clocks don't have clock enables [that are visible to the designer]. So if you want to gate clocks, you have to do it through logic that adds skew and I suspect screws up your hold timing so that it's impractical.

Update: See Brian_D's comment below -- Xilinx does have this capability.

@Betajet: "I've always called those "gated clocks", and I used them decades ago with TTL"

A pausable clock is a different concept. As you point out, clock gating implies that you have a clock signal that is always running and waiting for being injected into the clock distribution network.

A pausable clock is based in locally generated clock bursts. This is, there is an specific digital circuitry that is in charge of generating "trigger" signals only when required. In this way, you don't have a "real clock", but a kind of "pulse" generator -- I understand that the "pausable clock" term may lead to some confussion, but I didn't invent it ;-)

@Zeeglen: "Are you describing an independent clock oscillator that starts up on a trigger so there is always a defined and repeatable time between trigger and first clock edge?"

Well, what I'm describing is a circuit made from logic gates and that includes some "delay" feedback loops -- similar to a ring oscillator, but there are a lot of alternatives for implementing this.

About the defined and repeatable time, this is a very good question. A very clever alternative is having different delay loops, in such a way you can choose the clock period in real time. By this way, you can change the operating speed, but also implement EMI mitigation techniques -- similar to spread spectrum techniques.

@Zeeglen: "Sounds similar to a triggered oscillator using an active digital LC delay line"

Yes, the working mechanism is very similar, but in the case of integrated asynchronous circuits, LC delay lines are not an option -- they are just to big and "expensive".

Instead, you must use the intrinsic delay of the logic gates. I order to make accurate logic delay calculations, the best option is using the logical effort theory, developed by Ivan Sutherland and Bob Sproull in the early nineties.

@Betajet: "The FPGAs I'm familiar with won't let you do this, because the global clocks don't have clock enables"

I should clarify that I'm not trying to endorse the general use of async logic over COTS FPGAs -- but I did it in the past ;-). The point is that FPGAs are a really affordable platform for learning asynchronous techniques, just as they are for learning conventional synchronous logic.

I truly believe that async methodologies are being to be icreasingly used in the future as a way of dealing with the issues associated to deep submicron nodes and new nanoscale technologies.

Now, let me point to a paper I wrote some years ago about the efficient implementation of asynchronous logic over COTS FPGAs. Here, I expose some interesting experimental results on the old and good Xilinx SpartanIII and VirtexIV. Maybe the most interesting ones are the speed reached for both data communication [Mega Data Items per second] and pausable clock generation [MHz] -- I needed to perform some live demos in order to convince some researchers these were true:

Clock gating is also supported in most of the clock buffers in Xilinx 7 Series devices, e.g. in global clock buffers (BUFG) and horizontal clock buffers -- a regional clock distribution network -- (BUFH).

You can also select a lot of different clock inputs for each different clock distribution network, including internally generated clocks.

tpfj wrote: "Moore himself wrote only about the density of components (or transistors)".

Based on that, and the impending proliferation of 3D ICs, are we not extending Moore's law vertically, so to speak? True, the transistors you can fit in one layer are getting limited, but we're still putting more transistors on a chip by putting in more layers.

Most likely 3d chips will buy us 1-2 or maybe 3 generations of moore's law(both in density and partially in price). After that prices won't go down(according to zvi orbach from monolitic 3d), and it would be hard to put more layers due to thermal limits.

@David: "True, the transistors you can fit in one layer are getting limited, but we're still putting more transistors on a chip by putting in more layers"

Yes, this is true. But this increases the issues related to clock distribution problems too!!

In state-of-the-art digital ICs, as much as the 50% of power consumption is drawn by the clock distribution circuitry in order to assure that the signal will be "skew-free" across the whole layout. This is because we need to include lots of "H" forks and signal buffers for getting a clear signal.

When we jump to a extra 3rd dimension, this problem gets worse, as illustrated in the next image -- shared from the Ecole Polythecnique Federale de Lausanne (link):

Thanks @Max & @Garcia for those explanations. Max, I did not see your 3d ic blog either - a serious problem on EET these days is that if stuff makes it to the home page it is gone within days or less. Fascinating stuff though. How do they make the Vias in the layers?

Max, what a cool collection of 3D ICs related blogs. They suppose a pretty nice reading before going to sleep for me now ;-)

By the way, back in 2007, I was trying to launch a project with a big company in order to build a new asynchronous 3D FPGA architecture. The project was really interesting and very smart planned. Unfortunately, those were bad times for the company, and there were rumours about IBM going to buy it -- finally, Oracle was the one which absorbed the company I was dealing with: SUN Microsystems :-(

@Max: "I remember a time when I thought Sun Microsystems would be around for ever ... it's like the old saying goes: "The bigger they are, the harder they fall" -- it was a sad day when they disappeared :-("

Yes, a very sad day. I really loved SUN Microsystem: its technology, both hardware and software, its openness, its advanced research... Some people think that they were too smart, commited and professional for staying alive in these weird days we are living now.