"Under the hood" of Altera's 20nm offerings

While chatting with Misha Burich, CTO at Altera, I learned myriad tidbits of trivia and nuggets of knowledge of which I was previously unaware...

Last week, the folks at Altera unveiled several of the key innovations they have planned for their next generation of products, which will be implemented at the 20nm technology node (Click Here to see that column: Altera unveils innovations at the 20nm node).

Well, a couple of days ago, I had a chat with Misha Burich, Chief Technical Officer (CTO) at Altera. Misha was extremely open and forthcoming, and I learned myriad tidbits of trivia and nuggets of knowledge of which I was previously unaware…

SoC FPGAsLet's start with what Altera refer to as SoC FPGAs (the industry is still trying to come to grips with the fact that the "FPGA" moniker is almost 30 years old and no longer reflects the humongous capacity and vast capabilities associated with today's high-end incarnations of these little beauties).

I'm talking about the integration of a dual-core ARM Cortex-A9 MPCore processor with the flexibility of traditional FPGA fabric. Now, to be honest, I hadn’t really given this much thought before, but if you had asked me, I would have guessed that the folks at Altera had simply purchased a pre-implemented / laid-out hard core version of the Cortex-A9 from someone (the current version is at the 28nm node, but the same would apply to the future version at the 20nm node).

In fact, I learn that the guys and gals at Altera licensed the raw RTL (which means they synthesized it and ran place-and-route and suchlike themselves) because they wanted to add functionality and flexibility that "Didn't come out of the box." Now you can’t change the "core" of the ARM core (if you see what I mean) because this has to retain 100% functional compatibility with other implementations of the core, but you can modify peripheral functions, such as the fact that the core communicates with the FPGA fabric via multiple AXI busses.

Again, I really hadn’t thought about this too much. Of course I knew that the dual Cortex-A9 includes on-chip cache, and that it's imperative to maintain cache coherency between the cores, but I had not given any thought to the fact that it was also necessary to main cache coherency between the cores and the programmable FPGA fabric. I'm still trying to wrap my brain around all of this.

More gigabits than you can swing a stick at...Back in the early 1980s, I remember working on a circuit board for a new computer system. We were going to push the system clock to 1MHz (yes, just one megahertz), which was much higher than we had gone before. We were all jolly proud of ourselves when it worked. If you had told me we that we would one day (in my lifetime) have chip-to-chip and chip-to-backplane communications operating at multiple gigabits per second, I would have laughed my socks off.

But year-by-year things get faster and faster. First we had 3.xxx Gbps, then 6.xxx Gbps, and onwards and upwards to the 28Gbps boasted by today's FPGAs implemented at the 28nm technology node. Of course this 28Gbps refers to chip-to-chip communications on the same board. Chip-to-backplane (i.e. board-to-board) communications are much more difficult, because in addition to the fact that you are dealing with a very lossy medium, these signals are much more susceptible to noise effects and reflections and suchlike.

This is why chip-to-backplane data rates trail their chip-to-chip counterparts by at least one generation. Thus, while the state-of-the-art in chip-to-chip communications is 28Gbps at the 28nm technology node, the cutting edge of chip-to-backplane communications is 14Gbps (Click Here to see the column Altera shipping 14.1 Gbps backplane-capable transceivers in 28-nm Stratix V FPGAs).

This is why, in the case of their "innovations at the 20nm node" announcement, the folks at Altera were talking about 40Gbps Chip-to-Chip and 28Gbps backplane transceivers. Personally, I cannot even conceive how you can make 28Gbps work for backplanes, but Misha assures me that it can be done, although it requires much more sophistication with regard to adaptive and programmable pre-emphasis and equalization technologies.

Misha also noted that, after the 40Gbps barrier is breached, the next step up will be 56Gbps (people tend to say "50Gbps", but the IEEE standard is for 56.xxxGbps). After this point, copper becomes so lossy that we will have to move to optical interconnect, which brings us to…

3D IC technologyFirst of all, when we talk about 3D ICs in this context, we are actually talking about active-on-passive – that is, active die mounted on a silicon interposer layer. The die are attached using micro-bumps with a 40 micron pitch, which means you can get 6,000 connections on a 10 mm square die.

The folks at Altera are leaping headfirst into this technology with gusto and abandon at the 20nm technology node. Instead of starting out mounting multiple homogenous (identical) die on the interposer, they say that they are plunging straight into heterogeneous implementations involving disparate die, including memory, Altera's own HardCopy ASICs, third-party ASICs and ASSPs, optical modules, and… the list goes on.

With regard to the 3D IC portion of last week's announcement, one phrase did catch my eye: "At 20-nm, Altera will introduce an innovative high-speed chip-to-chip interface that integrates multiple dies together in a 3D package."

When I questioned Misha about this, he confirmed that they have developed something they call the Universal Interface Bus, which is a well-defined interface that is specifically designed to drive very short traces at very high data rates with very low power consumption. As part of this, the Universal Interface Bus sports small buffers and receivers. Misha says that this interface, which will facilitate designers connecting multiple dice, is "almost like memory mapped" – also that each wire can run at over 1GHz, thereby giving a total in-package bandwidth measured in terabits per second (my mind is officially boggled).

Next-generation variable-precision DSP blocksBy this point, my mind was starting to get a little "wobbly" around the edges, but there was one more topic I wanted to discuss with Misha – the implementation of floating-point DSP algorithms.

Historically, FPGA designers have typically focused on integer or fixed-point numerical representations, and they've typically steered away from floating-point implementations. All this started to change in late 2011 when Altera demonstrated a new floating-point DSP flow with FPGAs (Click Here to see that article).

This can be a little hard to wrap one's brain around, so let me give you my version of the 30,000-foot view as follows. A floating-point value consists of a mantissa and an exponent. When you add two floating point numbers together (having first "normalized" the exponents) you simply add the mantissas together. When you multiply two floating point numbers together, you multiply the mantissas and you add the exponents, and so forth.

Typical DSP blocks include a multiplier and an accumulator. What Altera have done is to provide designers with a tool that allows them to specify a floating-point algorithm. This tool then takes the algorithm and partitions it and implements it such that the multiplication portion of any DSP operations is performed using the hard core multipliers in the DSP blocks, while the addition portions are implemented in programmable fabric.

As I understand it, they also have additional tricks, such as including a few extra bits which means the values don’t have to be normalized too often, while leaving the end results fully compliant with the IEEE 754 floating-point standard.

So here's the interesting thing. Last week's announcement from Altera stated that the enhancements they were making to their next-generation variable-precision DSP block will deliver over 5 TFLOPs of IEEE 754 floating-point performance.

Hmmm, what enhancements could we be talking about, I wonder? Although I could not get Misha to talk about this, I'm speculating that one of these enhancements will be to implement one or more hard-core adders inside each DSP block. These adders could support (and dramatically increase the performance of) floating-point operations as discussed above. Furthermore, if implemented intelligently (and why wouldn’t they be?) these hard core adders could be made available to designers who weren't using them for floating-point operations, thereby saving programmable logic resources, increasing performance, and reducing power (I'm assuming that unused adders could be completely powered-off, but that's a discussion for another time).

Now I have something to look forward to … which is to await future announcements and discussions to see if I managed to hit this nail on the head…
If you found this article to be of interest, visit Programmable Logic Designline where – in addition to my Max's Cool Beans blogs – you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).

Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).