During the Xilinx Developer Forum in San Jose earlier this week, Xilinx showed off a server built in partnership with AMD that uses FPGA-based hardware acceleration cards to break an inference record in GoogLeNet by hitting up to 30,000 images per second in total high-performance AI inference throughput. GoogLeNet is a 22 layer deep convolutional neural network (PDF) that was started as a project for the ImageNet Large Scale Visual Recognition Challenge in 2014.

Xilinx was able to achieve such high performance while maintaining low latency windows by using eight of its Alveo U250 acceleration add-in-cards that use FPGAs based on its 16nm UltraScale architecture. The cards are hosted by a dual socket AMD server motherboard with two Epyc 7551 processors and eight channels of DDR4 memory. The AMD-based system has two 32 core (64 threads) Zen architecture processors (180W) each clocked at 2 GHz (2.55 GHz all core turbo and 3 GHz maximum turbo) with 64 MB L3, memory controllers supporting up to 2TB per socket of DDR4 memory (341 GB/s of bandwidth in a two socket configuration), and 128 PCI-Express lanes. The Xilinx Alveo U250 cards offer up to 33.3 INT8 TOPs and feature 54MB SRAM (38TB/s) and 64GB of off-chip memory (77GB/s). Interfaces include the PCI-E 3.0 x16 connection as well as two QSFP28 (100GbE) connections. The cards are rated at 225W TDPs and cost a whopping $12,995 MSRP each. The FPGA cards alone push the system well into the six-figure range before including the Epyc server CPUs, all that system memory, and the other base components. It is not likely you will see this system in your next Tesla any time soon, but it is a nice proof of concept at what future technology generations may be able to achieve at much more economical price points and used for AI inference tasks in everyday life (driver assistance, medical imaging, big data analytics driving market research that influences consumer pricing, etc).

Interestingly, this system may hold the current record, but it is not likely to last very long even against Xilinx’s own hardware. Specifically, Xilinx’s Versal ACAP cards (set to release in the second half of next year) are slated to hit up to 150W TDPs (in the add-in-card models) while being up to eight times faster than Xilinx’s previous FPGAs. The Versal ACAPs will use TSMCs 7nm FinFET node and will combine scalar processing engines (ARM CPUs), adaptable hardware engines (FPGAs with a new full software stack and much faster on-the-fly dynamic reconfiguration), and AI engines (DSPs, SIMD vector cores, and dedicated fixed function units for inference tasks) with a Network on Chip (NoC) and customizable memory hierarchy. Xilinx also has fierce competition on its hands in this huge AI/machine learning/deep neural network market with Intel/Altera and its Stratix FPGAs, AMD and NVIDIA with their GPUs and new AI focused cores, and other specialty hardware accelerator manufacturers including Google with its TPUs. (There's also ARM's Project Trillium for mobile.) I am interested to see what the new AI inference performance bar will be set to by this time next year!

ACAPs are a new product segment to solve some of the core difficulties that Xilinx has observed with development via their current FPGA devices. FPGAs traditionally excel in the hands of developers who are more oriented in the hardware world rather than the software world. However, these hardware developers make up only a small percentage compared to the total amount of software developers.

Built from the ground up with complete software programmability in mind, the concept of an ACAP aims to fix this through easy to use software tools, libraries, and runtimes, allowing both the likes of hardware and software developers, as well as data scientists to leverage the power of application acceleration.

In general, ACAPs aim to offer similar performance levels of an ASIC, while still maintaining the highly programmable nature of an FPGA.

Versal, the first device under this ACAP designation, has been developed by Xilinx in a time that they see as the "the era of Heterogeneous compute." Versal tackles this prospect of heterogeneous compute through the use of Scalar Processing Engines, Adaptable Hardware Engines, Intelligent engines, and the integration of advanced interfaces. Versal is built on the cutting edge 7nm FinFET technology from TSMC.

Arm DesignStart is a program which allows smaller customers to gain quick access to Arm IP. Developers can access the full Cortex-M0, Cortex-M3, and subsystem RTL designs for evaluation and integration into their products.

If a customer decides to utilize this IP in a commercialized product, they are then subject to a success-based royalty model. This is a similar business model that we've seen 3D game engines like Unreal Engine and Unity move to, where the development tools are free, but the engine holders are paid a percentage of unit sales.

Today's announcement in conjunction with Xilinx, removes the royalty requirement traditionally associated with DesignStart. Developers will gain access to Arm Cortex-M1, an optimized version of Cortex-M0 specifically for usage in FPGAs, Cortex-M3 soft processor IP, as well as software toolchain improvements. Arm IP has been integrated into the Xilinx Vivado Design Suite, allowing for "drag and drop" integration of Arm Cortex-M processors and Xilinx FPGAs.

At a time when the competition in the embedded space is stronger than ever from the likes of the RISC-V foundation, this could be an excellent opportunity for Arm to attract new customers to their ecosystem. As high-speed data processing becomes the norm, the pairing of application-optimized FPGA and general purpose Microprocessors should become common in the data center and beyond.

Synopsys has just published a video on YouTube where they connect two bonded lanes over a standard Type-C cable. This was accomplished with a host USB 3.2 controller embodied by an FPGA board. The device controller is the same hardware that was configured to be seen by the pair as a USB device.

In case you're wondering, the demo happens at around 1:48. Blink and it's over.

From a practical standpoint, USB 3.2 is still some time away, and a factor of 2 speed-up is not too large considering the amount of bandwidth that USB 3.1 already provides. That said, more bandwidth is always better, especially when you’re running in industrial or other professional workloads, and especially in places where Thunderbolt has marketshare.

Intel and recently acquired Altera have launched a new FPGA product based on Intel’s 14nm Tri-Gate process featuring an ARM CPU, 5.5 million logic element FPGA, and HBM2 memory in a single package. The Stratix 10 is aimed at data center, networking, and radar/imaging customers.

The Stratix 10 is an Altera-designed FPGA (field programmable gate array) with 5.5 million logic elements and a new HyperFlex architecture that optimizes registers, pipeline, and critical pathing (feed-forward designs) to increase core performance and increase the logic density by five times that of previous products. Further, the upcoming FPGA SoC reportedly can run at twice the core performance of Stratix V or use up to 70% less power than its predecessor at the same performance level.

The increases in logic density, clockspeed, and power efficiency are a combination of the improved architecture and Intel’s 14nm FinFET (Tri-Gate) manufacturing process.

Intel rates the FPGA at 10 TFLOPS of single precision floating point DSP performance and 80 GFLOPS/watt.

Interestingly, Intel is using an ARM processor to feed data to the FPGA chip rather than its own Quark or Atom processors. Specifically, the Stratix 10 uses an ARM CPU with four Cortex A53 cores as well as four stacks of on package HBM2 memory with 1TB/s of bandwidth to feed data to the FPGA. There is also a “secure device manager” to ensure data integrity and security.

The Stratix 10 is aimed at data centers and will be used with in specialized tasks that demand high throughput and low latency. According to Intel, the processor is a good candidate for co-processors to offload and accelerate encryption/decryption, compression/de-compression, or Hadoop tasks. It can also be used to power specialized storage controllers and networking equipment.

Intel has started sampling the new chip to potential customers.

In general, FPGAs are great at highly parallelized workloads and are able to efficiently take huge amounts of inputs and process the data in parallel through custom programmed logic gates. An FPGA is essentially a program in hardware that can be rewired in the field (though depending on the chip it is not necessarily a “fast” process and it can take hours or longer to switch things up heh). These processors are used in medical and imaging devices, high frequency trading hardware, networking equipment, signal intelligence (cell towers, radar, guidance, ect), bitcoin mining (though ASICs stole the show a few years ago), and even password cracking. They can be almost anything you want which gives them an advantage over traditional CPUs and graphics cards though cost and increased coding complexity are prohibitive.

The Stratix 10 stood out as interesting to me because of its claimed 10 TFLOPS of single precision performance which is reportedly the important metric when it comes to training neural networks. In fact, Microsoft recently began deploying FPGAs across its Azure cloud computing platform and plans to build the “world’s fastest AI supercomputer. The Redmond-based company’s Project Catapult saw the company deploy Stratix V FPGAs to nearly all of its Azure datacenters and is using the programmable silicon as part of an “acceleration fabric” in its “configurable cloud” architecture that will be used initially to accelerate the company’s Bing search and AI research efforts and later by independent customers for their own applications.

It is interesting to see Microsoft going with FPGAs especially as efforts to use GPUs for GPGPU and neural network training and inferencing duties have increased so dramatically over the years (with NVIDIA being the one pushing the latter). It may well be a good call on Microsoft’s part as it could enable better performance and researchers would be able to code their AI accelerator platforms down to the gate level to really optimize things. Using higher level languages and cheaper hardware with GPUs does have a lower barrier to entry though. I suppose ti will depend on just how much Microsoft is going to charge customers to use the FPGA-powered instances.

FPGAs are in kind of a weird middle ground and while they are definitely not a new technology, they do continue to get more complex and powerful!

UPDATE (Nov 26th, 3:30pm ET): A few readers have mentioned that FPGAs take much less than hours to reprogram. I even received an email last night that claims FPGAs can be reprogrammed in "well under a second." This differs from the sources I've read when I was reading up on their OpenCL capabilities (for potential evolutions of projects) back in ~2013. That said, multiple sources, including one who claim to have personal experience with FPGAs, say that it's not the case. Also, I've never used an FPGA myself -- again, I was just researching them to see where some GPU-based projects could go.

Designing integrated circuits, as I've said a few times, is basically a game. You have a blank canvas that you can etch complexity into. The amount of “complexity” depends on your fabrication process, how big your chip is, the intended power, and so forth. Performance depends on how you use the complexity to compute actual tasks. If you know something special about your workload, you can optimize your circuit to do more with less. CPUs are designed to do basically anything, while GPUs assume similar tasks can be run together. If you will only ever run a single program, you can even bake some or all of its source code into hardware called an “application-specific integrated circuit” (ASIC), which is often used for video decoding, rasterizing geometry, and so forth.

This is an old Atom back when Intel was partnered with Altera for custom chips.

FPGAs are circuits that can be baked into a specific application, but can also be reprogrammed later. Changing tasks requires a significant amount of time (sometimes hours) but it is easier than reconfiguring an ASIC, which involves removing it from your system, throwing it in the trash, and printing a new one. FPGAs are not quite as efficient as a dedicated ASIC, but it's about as close as you can get without translating the actual source code directly into a circuit.

Intel, after purchasing FPGA manufacturer, Altera, will integrate their technology into Xeons in Q1 2016. This will be useful to offload specific tasks that dominate a server's total workload. According to PC World, they will be integrated as a two-chip package, where both the CPU and FPGA can access the same cache. I'm not sure what form of heterogeneous memory architecture that Intel is using, but this would be a great example of a part that could benefit from in-place acceleration. You could imagine a simple function being baked into the FPGA to, I don't know, process large videos in very specific ways without expensive copies.

Again, this is not a consumer product, and may never be. Reprogramming an FPGA can take hours, and I can't think of too many situations where consumers will trade off hours of time to switch tasks with high performance. Then again, it just takes one person to think of a great application for it to take off.

Intel has just revealed what The Register is aptly referring to as the FrankenChip, a hybrid Xeon E5 and FPGA chip. This will allow large companies to access the power of a Xeon and be able to offload some work onto an FPGA they can program and optimize themselves. The low power FPGA is actually on the chip, as opposed to Microsoft's recent implementation which saw FPGA's added to PCIe slots. Intel's solution does not use up a slot and also offers direct access to the Xeon cache hierarchy and system memory via QPI which will allow for increased performance. Another low power shot has been fired at ARM's attempts to grow their share of the server market but we shall see if the inherent complexity of programming an FPGA to work with an x86 is more or less attractive than switching to ARM.

"Intel has expanded its chip customization business to help it take on the hazy threat posed by some of the world's biggest clouds adopting low-power ARM processors."

(Update 10/17/2013, 6:13 PM) Apparently I messed up inputing this into the website last night. To compare FPGAs with current hardware, the Altera Stratix 10 is rated at more than 10 TeraFLOPs compared to the Tesla K20X at ~4 TeraFLOPs or the GeForce Titan at ~4.5 TeraFLOPs. All figures are single precision. (end of update)

Field Programmable Gate Arrays (FPGAs) are not general purpose processors; they are not designed to perform any random instruction at any random time. If you have a specific set of instructions that you want performed efficiently, you can spend a couple of hours compiling your function(s) to an FPGA which will then be the hardware embodiment of your code.

This is similar to an Application-Specific Integrated Circuit (ASIC) except that, for an ASIC, it is the factory who bakes your application into the hardware. Many (actually, to my knowledge, almost every) FPGAs can even be reprogrammed if you can spare those few hours to configure it again.

Altera is a manufacturer of FPGAs. They are one of the few companies who were allowed access to Intel's 14nm fabrication facilities. Rahul Garg of Anandtech recently published a story which discussed compiling OpenCL kernels to FPGAs using Altera's compiler.

Now this is pretty interesting.

The design of OpenCL splits work between "host" and "kernel". The host application is written in some arbitrary language and follows typical programming techniques. Occasionally, the application will run across a large batch of instructions. A particle simulation, for instance, will require position information to be computed. Rather than having the host code loop through every particle and perform some complex calculation, what happens to each particle could be "a kernel" which the host adds to the queue of some accelerator hardware. Normally, this is a GPU with its thousands of cores chunked into groups of usually 32 or 64 (vendor-specific).

An FPGA, on the other hand, can lock itself to the specific set of instructions. It can decide to, within a few hours, configure some arbitrary number of compute paths and just churn through each kernel call until it is finished. The compiler knows exactly the application it will need to perform while the host code runs on the CPU.

This is obviously designed for enterprise applications, at least as far into the future as we can see. Current models are apparently priced in the thousands of dollars but, as the article points out, has the potential to out-perform a 200W GPU at just a tenth of the power. This could be very interesting for companies, perhaps a film production house, who wants to install accelerator cards for sub-d surfaces or ray tracing but would like to develop the software in-house and occasionally update their code after business hours.

Regardless of the potential market, a FPGA-based add-in card simply makes sense for OpenCL and its architecture.