Advantages of the Xilinx Virtex-5 FPGA

Overview

NI LabVIEW FPGA hardware targets use FPGA technology that gives engineers and scientists the ability to create custom, reconfigurable systems with onboard processing. The Xilinx Virtex-5 FPGAs offer a number of performance enhancements over previous generations of FPGAs. Improvements from legacy Virtex-II FPGAs include six-input look-up tables (LUTs), a diagonally symmetric interconnect pattern, and special DSP48E slices for complex math. Additionally, all chips are processed using a 65 nm lithographic process. This whitepaper offers an in-depth examination of these improvements.

Table of Contents

1. Background

Table 1 shows the major specifications of the Virtex-II and Virtex-5 chips. The number of gates has traditionally been a way to compare FPGA chips to ASIC technology, but it does not truly describe the amount of useful logic inside an FPGA. This is one of the reasons why Xilinx did not specify the number of gates for the new Virtex-5 family.

Virtex-II

1000

Virtex-II

3000

Virtex-5

LX30

Virtex-5

LX50

Virtex-5

LX85

Virtex-5
LX110

System gates

1 million

3 million

----

----

----

----

Slices

5,120

14,336

4,800

7,200

12,960

17,280

Flip-flops

10,240

28,627

19,200

28,800

51,840

69,120

LUTs

10,240

28,627

19,200

28,800

51,840

69,120

Multipliers

40

96

32

48

48

64

Block RAM (kb)

720

1,728

1,152

1,728

3,456

4,608

Table1: Virtex-II and Virtex 5 Specifications

The number of each component is useful for direct comparisons within each family; however, many of the component architectures have been redesigned for the Virtex-5 making comparisons between families difficult. For example, the Virtex-5 LX85 has fewer slices than Virtex-II 3000, but the performance is greater on the Virtex-5. Instead, it is more useful to examine the specific features of the Virtex-5 to understand the benefits.

Common components such as flip-flops, LUTs, block RAM, and multiplexers make up the basic logic structures on a Virtex FPGA. A Collection of these basic structures is referred to as a slice or a configurable logic block (CLB). The definitions of a CLB and a slice are specific to each device family. For instance, a CLB on the Virtex-II is four slices and each slice contains two 4-input LUTs, two flip-flops, wide-function multiplexers, and carry logic. On the Virtex-5, the definition of a CLB is two slices, and each slice contains four 6-input LUTs, four flip-flops, wide-function multiplexers, and carry logic. On the Virtex-5, these base slices are called SLICEL. Some slices have built-in distributed RAM and 32-bit shift registers. Slices with these additions are called SLICEM.

Two of the most exciting benefits of the Virtex-5 are differences in the fundamental architecture of the chip components. For the new design, Xilinx increased the capacity of the standard four-input LUT to be a full six-input LUT. Xilinx also designed an improved diagonally symmetric interconnect pattern.

Six-input LUT

With increasingly complex systems, applications requiring wider data paths are more common. Traditional four-input LUTs have become extremely limiting and require many levels of logic to implement complex code. Xilinx has extended the LUT to a six-input LUT for more capacity. The traditional four-input LUT has a truth table capacity for 16 different combinations. The new six-input LUT increases the truth table to 64 different combinations. For example, consider the implementation of a simple comparison between two 16-bit numbers, as shown in Figure 1.

Figure 1: Simple Greater Than Comparison in NI LabVIEW Software

A LUT compares each bit to determine the result, so a 16 bit number requires 16 LUT inputs. As shown in Figure 2, if this operation were performed using a four-input LUT architecture, it would require 11 LUTs and three logic levels. With a six-input LUT architecture, the same function uses only seven LUTs and two logic levels.

Figure 2: 16-Bit Comparison Example

Diagonally Symmetric Interconnect

The denser Virtex-5 logic would be limited with the previous interconnect design. There is a relationship between the amount of logic and the amount of connectivity required by the logic block. This is represented by Rent’s rule:

T=tgp

Where t and ρ are material constants, T is the number of terminals (or connections at the boundaries of a logic block) and g is the number of internal components (logic). Using the appropriate constants Xilinx has calculated that the Virtex-5 requires 50 percent more interconnects than the previous design, the Virtex-4. To achieve this, Xilinx implemented a radically new, diagonally symmetric interconnect pattern, shown in Figure 3.

Figure 3: Interconnect Pattern for Virtex-4 and Virtex-5

This design also improves speed by making more locations reachable with fewer hops. A hop is made when a connection between 2 CLBs, either adjacent or not, utilizes one trace, or routing structure. If two different routing structures are needed to make a connection between two CLBs, this is considered 2 hops. Figure 3 is color coded to show the how many logic blocks are reachable within 1, 2, and 3 hops from a central CLB. With process advancements, interconnect timing delay can account for more than 50 percent of the critical path delay. Because the distance between blocks is smaller on the new design, and so many more logic blocks are reachable within the two and three hop range, the time it takes for signals to arrive at their destinations is greatly reduced. Figure 4 compares the routing delay of the Virtex-5 versus the Virtex-4 design.

Figure 4: Routing Delay Comparison for Virtex-4 and Virtex-5 FPGAs

Furthermore, every time a signal needs to travel between blocks, it goes out of that block and across the metallization layer to the next block, which requires power. The reduced distance not only means increased speed but also lower power consumption.

3. Digital Signal Processing and the DSP48E Slice

Digital Signal Processing (DSP) is the study, processing, and analysis of digital signals, or digitized analog signals. Common applications include sensor array processing, statistical signal processing, and signal processing for digital imaging, communication, and biomedical applications. Traditionally, this required a dedicated digital signal processor. Taking advantage of hardware parallelism, FPGAs exceed the computing power of digital signal processors (DSPs) by breaking the paradigm of sequential execution and accomplishing more per clock cycle. Specialty components, such as multipliers, greatly improve FPGA utilization and efficiency for DSP applications. Without these components, the seemingly simple task of multiplying two numbers together can become extremely resource-intensive. The Virtex-II had an 18 x 18 bit multiplier, but with the release of the Virtex-4, Xilinx added a specialized logic block called the DSP48 slice. These blocks, specifically designed for DSP data and signal analysis operations, include built-in multiply and adder circuitry. The multiplier on the original DSP48 slice was 18 x 18 bit. Xilinx improved this by extending the multiplier capacity to 25 x 18 bit on the Virtex-5. These slices are called DSP48E slices. This slice supports more than 40 dynamically controlled operating modes, including multiplier, multiplier-accumulator, multiplier-adder, subtracter, three-input adder, barrel shifter, wide bus multiplexers, wide counters, and comparators. These slices make all of these functions available without consuming normal logic resources.

The DSP48E slice is also optimized for adder chain implementations, a powerful capability that enables the very efficient, high-performance filters. For information regarding filter design using NI LabVIEW and the LabVIEW FPGA Module, see “Designing Filters Using the NI LabVIEW Digital Filter Desgin Toolkit” in the related links section.

Dedicated routing resources on the inputs and outputs of each DSP48E slice permit any number of slices to be cascaded within a column. This dedicated routing ensures that every DSP48E slice in the chain runs at full speed without consuming any of the fabric routing or logic resources required by other FPGAs.

One way to compare the processing power of a DSP slice is the number of multiply-accumulate (MAC) operations you can perform per second, usually expressed as billions of multiply-accumulate operations per second (Giga MACs or GMACs). Using LabVIEW, Figure 5 shows this graphically.

Figure 5: Graphical Representation of Multiply-Accumulate Function

In reality, this graphical code does not fully utilize the DSP slice in Virtex-5 FPGAs. You need a single cycle timed loop with an HDL node to provide the compiler with code to use the DSP slice. This is shown in Figure 6. The actual code used in this example is available on the LabVIEW FPGA IPNet and this NI Developer Zone Example Program.

Figure 6: Actual Code Used to Perform Multiply Accumulate Function

NI performed MAC benchmarks for the Virtex-II and Virtex-5 using the HDL node example above. Each was compiled to perform 96 parallel multiply accumulates. The Virtex-II was only able to achieve a rate of 40 MHz while the Virtex-5 was able to perform this test at 100 MHz. The benchmark results translate to 3.84 GMACs for the Virtex-II and 9.6 GMACs for the Virtex-5.

4. 65nm Process and Improved Power Efficiency

The power and efficiency in an FPGA chip closely correspond to the size of the lithographic node used in processing. Previous FPGA designs used a 90 nm process. With the introduction of the Virtex-5, Xilinx was the first to design an FPGA chip using a 65 nm lithographic node. This is also the transistor gate oxide thickness, as shown in figure 7. Do not confuse this transistor gate with the “millions of gates” specification used to tell relative size of an FPGA chip prior to the Virtex-4.

Figure 7: Typical Transistor Showing Gate Oxide Thickness

Using a smaller transistor gate, Xilinx reduced the power consumption on the chip because less voltage is needed to drive the gate on the transistor. Core logic uses only 1 V on the Virtex-5, as opposed to 1.2 V on the Virtex-4 and 1.5 V on the Virtex-II.

There are some disadvantages to the smaller transistor gate oxide thickness. The primary disadvantage with a smaller gate oxide thickness is an increase in leakage current or static discharge. Leakage current is when the transistor gate is small enough that electrons can tunnel across the barrier, creating a power drain. Xilinx realized there are certain transistors, typically used for configuration, that tend to remain in the on or off position. Even if a transistor remains off, there is still a voltage required to keep this state, and the transistor still experiences leakage current. Other transistors are always active and switch states constantly. For these transistors, the power saved by the smaller gate oxide is significant and leakage current is less of a concern. Xilinx has divided up the logic into these two groups: configuration logic and core logic. To mitigate leakage current in configuration logic, the transistor gate oxide is made slightly thicker. In core logic, where leakage current is not a major issue the transistor gate oxide is still 65 nm. There is also a third type of transistor used for I/O that requires 3.3V on the gate and has a much thicker, 250 nm gate oxide thickness. The three gates are illustrated in Figure 8.

Figure 8: Illustration of the Three Types of Transistor Gate Thicknesses

Preventing leakage current in this fashion greatly improves both dynamic power dissipation and the ratio of active to static power consumption. Dynamic power dissipation on the Virtex-5 is reduced by 35 percent when compared to the Virtex-4. Furthermore, the ratio of static to active power consumption is 9.2 percent for a standard 90 nm process. This ratio on the new 65 nm process is improved to only 6.7 percent. The 65 nm process is clearly a step up in design.

5. Advanced Applications

The Virtex-5 has many other useful improvements for even more powerful applications. However, many of these applications require advanced knowledge and are available only using direct HDL programming. With the LabVIEW FPGA Module, you have access to these functions through the use of the HDL function node. For instance, code compiled in LabVIEW FPGA can achieve loop rates of up to 200 MHz. The Virtex-5 is rated at clock speeds of 550 MHz. This is a theoretical maximum that depends on code. Using the HDL node in LabVIEW FPGA, you can program loops to run above the 200 MHz rate offered by graphical LabVIEW FPGA code.

Another advanced feature is Xilinx’s implementation of the true six-input LUT using two 5-input LUTs, as illustrated in Figure 8.

Figure 9: Dual 5-input LUT

Using HDL, you can also configure the six-input LUT as two, 5-input LUTs that share their inputs.

Beyond the LX family of chips, Xilinx offers three other families of Virtex-5 FPGAs. Xilinx optimizes these families for different applications by changing the relative amount of certain logic blocks. For instance, an FPGA optimized for DPS applications has many more DSP48E slices than a general purpose FPGA. Table 3 outlines the different families and their advantages.