During the Xilinx Developer Forum in San Jose earlier this week, Xilinx showed off a server built in partnership with AMD that uses FPGA-based hardware acceleration cards to break an inference record in GoogLeNet by hitting up to 30,000 images per second in total high-performance AI inference throughput. GoogLeNet is a 22 layer deep convolutional neural network (PDF) that was started as a project for the ImageNet Large Scale Visual Recognition Challenge in 2014.

Xilinx was able to achieve such high performance while maintaining low latency windows by using eight of its Alveo U250 acceleration add-in-cards that use FPGAs based on its 16nm UltraScale architecture. The cards are hosted by a dual socket AMD server motherboard with two Epyc 7551 processors and eight channels of DDR4 memory. The AMD-based system has two 32 core (64 threads) Zen architecture processors (180W) each clocked at 2 GHz (2.55 GHz all core turbo and 3 GHz maximum turbo) with 64 MB L3, memory controllers supporting up to 2TB per socket of DDR4 memory (341 GB/s of bandwidth in a two socket configuration), and 128 PCI-Express lanes. The Xilinx Alveo U250 cards offer up to 33.3 INT8 TOPs and feature 54MB SRAM (38TB/s) and 64GB of off-chip memory (77GB/s). Interfaces include the PCI-E 3.0 x16 connection as well as two QSFP28 (100GbE) connections. The cards are rated at 225W TDPs and cost a whopping $12,995 MSRP each. The FPGA cards alone push the system well into the six-figure range before including the Epyc server CPUs, all that system memory, and the other base components. It is not likely you will see this system in your next Tesla any time soon, but it is a nice proof of concept at what future technology generations may be able to achieve at much more economical price points and used for AI inference tasks in everyday life (driver assistance, medical imaging, big data analytics driving market research that influences consumer pricing, etc).

Interestingly, this system may hold the current record, but it is not likely to last very long even against Xilinx’s own hardware. Specifically, Xilinx’s Versal ACAP cards (set to release in the second half of next year) are slated to hit up to 150W TDPs (in the add-in-card models) while being up to eight times faster than Xilinx’s previous FPGAs. The Versal ACAPs will use TSMCs 7nm FinFET node and will combine scalar processing engines (ARM CPUs), adaptable hardware engines (FPGAs with a new full software stack and much faster on-the-fly dynamic reconfiguration), and AI engines (DSPs, SIMD vector cores, and dedicated fixed function units for inference tasks) with a Network on Chip (NoC) and customizable memory hierarchy. Xilinx also has fierce competition on its hands in this huge AI/machine learning/deep neural network market with Intel/Altera and its Stratix FPGAs, AMD and NVIDIA with their GPUs and new AI focused cores, and other specialty hardware accelerator manufacturers including Google with its TPUs. (There's also ARM's Project Trillium for mobile.) I am interested to see what the new AI inference performance bar will be set to by this time next year!

ACAPs are a new product segment to solve some of the core difficulties that Xilinx has observed with development via their current FPGA devices. FPGAs traditionally excel in the hands of developers who are more oriented in the hardware world rather than the software world. However, these hardware developers make up only a small percentage compared to the total amount of software developers.

Built from the ground up with complete software programmability in mind, the concept of an ACAP aims to fix this through easy to use software tools, libraries, and runtimes, allowing both the likes of hardware and software developers, as well as data scientists to leverage the power of application acceleration.

In general, ACAPs aim to offer similar performance levels of an ASIC, while still maintaining the highly programmable nature of an FPGA.

Versal, the first device under this ACAP designation, has been developed by Xilinx in a time that they see as the "the era of Heterogeneous compute." Versal tackles this prospect of heterogeneous compute through the use of Scalar Processing Engines, Adaptable Hardware Engines, Intelligent engines, and the integration of advanced interfaces. Versal is built on the cutting edge 7nm FinFET technology from TSMC.

Arm DesignStart is a program which allows smaller customers to gain quick access to Arm IP. Developers can access the full Cortex-M0, Cortex-M3, and subsystem RTL designs for evaluation and integration into their products.

If a customer decides to utilize this IP in a commercialized product, they are then subject to a success-based royalty model. This is a similar business model that we've seen 3D game engines like Unreal Engine and Unity move to, where the development tools are free, but the engine holders are paid a percentage of unit sales.

Today's announcement in conjunction with Xilinx, removes the royalty requirement traditionally associated with DesignStart. Developers will gain access to Arm Cortex-M1, an optimized version of Cortex-M0 specifically for usage in FPGAs, Cortex-M3 soft processor IP, as well as software toolchain improvements. Arm IP has been integrated into the Xilinx Vivado Design Suite, allowing for "drag and drop" integration of Arm Cortex-M processors and Xilinx FPGAs.

At a time when the competition in the embedded space is stronger than ever from the likes of the RISC-V foundation, this could be an excellent opportunity for Arm to attract new customers to their ecosystem. As high-speed data processing becomes the norm, the pairing of application-optimized FPGA and general purpose Microprocessors should become common in the data center and beyond.

Today ARM is announcing their partnership with Xilinx to deliver design solutions for their products on TSMC’s upcoming 7nm process node. ARM has previously partnered with Xilinx on other nodes including 28, 20, and 16nm. Their partnership extends into design considerations to improve the time to market of complex parts and to rapidly synthesize new designs for cutting edge process nodes.

Xilinx is licensing out the latest ARM Artisan Physical IP platform for TSMC’s 7nm. Artisan Physical IP is a set of tools to help rapidly roll out complex designs as compared to what previous generations of products faced. ARM has specialized libraries and tools to help implement these designs on a variety of processes and receive good results even on the shortest possible design times.

Design relies on two basic methodologies. There is custom cell and then standard cell designs. Custom cell design allows for a tremendous amount of flexibility in layout and electrical characteristics, but it requires a lot of man-hours to complete even the simplest logic. Custom cell designs typically draw less power and provide higher clockspeeds than standard cell design. Standard cells are like Legos in that the cells can be quickly laid out to create complex logic. Software called EDA (Electronic Design Automation) can quickly place and route these cells. GPUs lean heavily on standard cells and EDA software to get highly complex products out to market quickly.

These two basic methods have netted good results over the years, but during that time we have seen implementations of standard cells become more custom in how they behave. While not achieving full custom performance, we have seen semi-custom type endeavors achieve appreciable gains without requiring the man hours to achieve fully custom.

In this particular case ARM is achieving a solid performance in power and speed through automated design that improves upon standard cells, but without the downsides of a fully custom part. This provides positive power and speed benefits without the extra power draw of a traditional standard cell. ARM further improves upon this with the ARM Artisan Power Grid Architect (PGA) which simplifies the development of a complex power grid that services a large and complex chip.

We have seen these types of advancements in the GPU world that NVIDIA and AMD enjoy talking about. A better power grid allows the ASIC to perform at lower power envelopes due to less impedence. The GPU guys have also utilized High Density Libraries to pack in the transistors as tight as possible to utilize less space and increase spatial efficiency. A smaller chip, which requires less power is always a positive development over a larger chip of the same capabilities that requires more power. ARM looks to be doing their own version of these technologies and are applying them to TSMC’s upcoming 7nm FinFET process.

TSMC is not releasing this process to mass production until at least 2018. In 1H 2017 we will see some initial test and early production runs for a handful of partners. Full blown production of 7nm will be in 2018. Early runs and production are increasingly being used for companies working with low power devices. We can look back at 20/16/14 nm processes and see that they were initially used by designs that do not require a lot of power and will run at moderate clockspeeds. We have seen a shift in who uses these new processes with the introduction of sub-28nm process nodes. The complexity of the design, process steps, materials, and libraries have pushed the higher performance and power hungry parts to a secondary position as the foundries attempt to get these next generation nodes up to speed. It isn’t until after some many months of these low power parts are pushed through that we see adjustments and improvements in these next generation nodes to handle the higher power and clockspeed needs of products like desktop CPUs and GPUs.

ARM is certainly being much more aggressive in addressing next generation nodes and pushing their cutting edge products on them to allow for far more powerful mobile products that also exhibit improved battery life. This step with 7nm and Xilinx will provide a lot of data to ARM and its partners downstream when the time comes to implement new designs. Artisan will continue to evolve to allow partners to quickly and efficiently introduce new products on new nodes to the market at an accelerated rate as compared to years past.