Xilinx Starts Unveiling 16nm UltraScale MPSoC Architecture

I just got off the phone with Steve Glaser, senior vice president of corporate strategy and marketing for Xilinx. He very kindly brought me up to date on his company's forthcoming UltraScale MPSoC devices, which will be introduced at the 16nm technology node.

Actually, this can all be a tad confusing, so let's take a moment to set the scene. When Xilinx adopted the 28nm technology node, it released its lowest-power, smallest-form-factor Artix; its best price-and-performance per watt Kintex; and its high-capacity, high-bandwidth Virtex families at this node. It also released its Zynq All Programmable SoC devices at the 28nm node.

The current All Programmable SoC at the 28nm technology node, the Zynq, is hardware, software, and I/O programmable. It boasts a homogenous dual-core ARM Cortex-A9 microcontroller subsystem (running at up to 1 GHz and including floating-point engines, on-chip cache, counters, timers, etc.), coupled with a wide range of hard-core interface functions (SPI, I2C, CAN, etc.) and a hard core dynamic memory controller. All this is augmented by a large quantity of traditional programmable fabric and a substantial number of general-purpose input/output pins.

When we come to the 20nm technology node, only the Kintex and Virtex families are being brought forward with the UltraScale architecture. The Artix family will hold the fort at the 28nm node. When we say UltraScale, we're talking about a radical new FPGA architecture that offers massive -- ASIC-class -- I/O bandwidth, memory bandwidth, and DSP processing capabilities. In the case of the programmable fabric, we're talking about millions of logic cells all supported by ASIC-class data-flow and routing resources.

The first 20nm UltraScale devices started shipping in December, but Xilinx is already looking to the 16nm future. In addition to UltraScale FPGAs and 3D ICs, the company will field a family of 16nm multi-processor SoCs (MPSoCs). In addition to UltraScale FPGA fabric, these devices will boast a heterogeneous multicore processing capability.

The software-programmable portion of this image shows and ARM processing element augmented with real-time, graphics, video, waveform, and packet processing units. Xilinx isn't releasing too many details at this time. In particular, it is being very cagey about the identity of the ARM core(s). It will say these devices will offer a combination of new and next-generation processing and programmable engines optimized for different application tasks. Also, the devices will be scalable from 32 to 64 bits with support for virtualization. This scalability includes CPU, interconnect, peripherals, processing engines, and an address space measured in terabytes.

My understanding is that different members of the family will be targeted toward different applications, as illustrated below.

Of course, I cannot contain my excitement to know exactly which ARM processor Xilinx will announce. Will it be a dual or quad core -- or even more? What do you think?

@DrFPGA, betajet; Discussions tend to be too superficial and full of acronyms/buzz words.

Multi-core/parallel programming is not too successful, yet it is 2 vs 4 core. in the discussion Maybe multiple single core is more like DrFPGA has in mind.

The FPGA design methodology has not changed because every register has to be placed and connections routed. Aggravated by adding extra registers for pipe-lining and timing closure. There is no wonder that "compilation" takes forever since P&R is included.

When a CPU is involved there is an automatic performance limitation due to loading and storing operands. There is just too much serialization even with dual/quad core because there is a single memory.

OK, what can be done? Let's start by looking at C source code. A debugger can single step through the source and display variable contents as well as expression evaluations.

That is because the source line number is in effect the state of a state machine.

Implementing state machines is routine in FPGA design. The number only limited by chip resources. Expression evaluation starts with 2 operands and combines the result of ech operator with the next operator. Dual port memory can be used and since it is pre-placed only the routing between the memory and alu is required.

DrFPGA asked: Any other thoughts on what the FPGA guys could be doing to differentiate themselves from the current processor architecture roadmaps?

IMO the problem with FPGAs is not that they need to differentiate themselves from standard CPUs, but that they're already too different in the following two ways:

1. It takes orders of magnitude longer to compile a design for an FPGA compared to the time it takes to compile a program of the same complexity for a standard CPU. The FPGA vendors IMO do not seem to think this is a problem, because they're competing with ASIC design.

2. Standard CPUs have open instruction sets so parties other than the manufacturer can design development tools for them. Thus we have FLOSS compilers that produce excellent code very fast for standard CPUs, but no way to solve problem #1.

FPGAs have much more flexibility than standard processors, so I'd like to see some architectural divergence from the traditional processor quad-core roadmaps.

Why not put in multiple dual-cores instead of going to a quad? FPGAs have always been better at distributed processing so it makes more sense to me to have more elements more widely distributed on the die. With an ability to put processors and their associated fabric co-processors and peripherals into low power modes when they are not needed more seperation would be better, right?

Any other thoughts on what the FPGA guys could be doing to differentiate themselves from the current processor architecture roadmaps?

I would prefer a dual core A57 instead of a quad core A53..., but I think I'm likely to become dissapointed. Quad core looks more impressing on paper, but at least our applications would be easier to program on a dual core A57.

@Max: For big.LITTLE they'd need to go for A53+A75 combination. I suspect power-saving isn't so important in typical Zync designs so it would seem like a significant effort and complication. It seems unlikely that Xilinx are hoping to displace ASIC SoCs in smartphones...

Given the mention of 64-bit I think we can be fairly sure that A9s won't be present.

We know it's going to be at least an A-53 (next generation, 64-bit). To equal Altera they'll have to put down quad core A-53. I'm hoping they leapfrog A-53 and jump to quad core A-57 processors.

Although >4 cores would provide some interesting possibilities (and compete with Cavium/Tilera) it seems unlikely as the ARMv8 architecture only supports 4 cores in an SMP cluster. To get >4 cores you need multiple clusters sharing AMBA 5 CHI or AMBA 4 ACE, a risky jump from a relatively simple dual core A9 on current generation. We can but hope though!