fpgas – HPCwirehttps://www.hpcwire.com
Since 1987 - Covering the Fastest Computers in the World and the People Who Run ThemFri, 09 Dec 2016 21:51:05 +0000en-UShourly1https://wordpress.org/?v=4.760365857Intel’s FPGAs Target Datacenters, Networkinghttps://www.hpcwire.com/2016/10/06/intels-fpgas-target-datacenters-networking/?utm_source=rss&utm_medium=rss&utm_campaign=intels-fpgas-target-datacenters-networking
https://www.hpcwire.com/2016/10/06/intels-fpgas-target-datacenters-networking/#respondThu, 06 Oct 2016 16:45:18 +0000https://www.hpcwire.com/?p=30646Targeting cloud computing, datacenters, network infrastructure and Internet of Things sensors, chipmaker Intel Corp. said it is sampling the latest version of its field programmable gate arrays (FPGAs). Intel said this week its Stratix 10 FPGAs are the first chips based on its 14-nanometer tri-gate process technology used to implement its HyperFlex architecture.

]]>Targeting cloud computing, datacenters, network infrastructure and Internet of Things sensors, chipmaker Intel Corp. said it is sampling the latest version of its field programmable gate arrays (FPGAs).

Intel said this week its Stratix 10 FPGAs are the first chips based on its 14-nanometer tri-gate process technology used to implement its HyperFlex architecture. That framework is geared toward advanced computing and data-intensive applications in datacenters. The additional focus on radar and imaging systems underscores the chipmaker’s drive to support connected IoT devices.

The chipmaker is positioning the Stratix 10 as enabling cloud service providers and datacenter operators to boost computing and storage as they hustle to keep up with the computational demands of storage and processing big data. Along with increased power efficiency (as much as 70 percent lower, the company claimed), the new FPGA is said to reduce latency as computing is moved closer to data.

Among the performance improvements claimed by Intel is a two-fold increase in core performance along with five-fold increase in chip density compared to the previous generation. The Stratix 10 also delivers up to 10 teraflops of single-precision floating-point performance and up to 10-terabits/second of memory bandwidth.

Interestingly, the new FPGA embeds a quad-core 64-bit Cortex A53 processor from chip intellectual property vendor ARM Ltd., which was acquired by Japanese technology conglomerate SoftBank Group in July. SoftBank said it was targeting IoT applications with the ARM acquisition.

Along with “performance-per-watt” efficiencies in datacenters, Intel is positioning its latest FPGA as a “multifunction accelerator” for widening and accelerating network bandwidth. “The need for more bandwidth and lower latency in our networks, the need for flexibility of our datacenters to react to new and changing workloads, and the need to manage performance per watt are all key value drivers” for the new FPGA, Dan McNamara, corporate vice president and general manager of Intel’s Programmable Solutions Group, noted in a statement.

McNamara said the chipmaker is currently sampling the Stratix 10 with unnamed customers.

Intel zeroed in on the FPGA market with its 2015 acquisition of market leader Altera Corp. Along with combining Altera’s FPGAs with Intel Xeon processors, the deal was designed to leverage the Intel’s advanced process technology and Altera’s design prowess. The result is a system-on-chip that incorporates Intel’s multi-die technology used to integrate a monolithic FPGA fabric into a single package that supports multiple networking protocols.

The decision to use the ARM processor in the Altera system-on-chip was driven by the need for “application-class processing that’s very power efficient and very high performance,” according to Patrick Dorsey, the Altera unit’s senior director of product marketing. The ARM processor “allows us to extend the virtualization capability of the [system-on-chip] from the processor into the hardware fabric of the FPGA itself.”

In the datacenter, Intel is touting its next-generation FPGA as an accelerator as well as a connectivity and storage controller. The major selling point is that FPGAs can perform standard datacenter functions using about one-tenth the power.

]]>https://www.hpcwire.com/2016/10/06/intels-fpgas-target-datacenters-networking/feed/030646Microsoft Eyes AI Supercomputer on Azurehttps://www.hpcwire.com/2016/10/03/microsoft-eyes-ai-supercomputer-azure/?utm_source=rss&utm_medium=rss&utm_campaign=microsoft-eyes-ai-supercomputer-azure
https://www.hpcwire.com/2016/10/03/microsoft-eyes-ai-supercomputer-azure/#respondMon, 03 Oct 2016 23:18:41 +0000https://www.hpcwire.com/?p=30595Microsoft is jumping on the artificial intelligence bandwagon with the formation of a new research group that will seek to make the technology more accessible via its Azure cloud while helping to deliver new capabilities across applications, services and infrastructure. The infrastructure portion of the effort focuses on combining the processing engines like GPUs and […]

]]>Microsoft is jumping on the artificial intelligence bandwagon with the formation of a new research group that will seek to make the technology more accessible via its Azure cloud while helping to deliver new capabilities across applications, services and infrastructure.

The infrastructure portion of the effort focuses on combining the processing engines like GPUs and FPGAs designed to improve network connectivity as ways to boost AI performance running on Microsoft’s Azure Cloud.

Microsoft said last week Harry Shum, a 20-year company veteran who worked on the Bing search and Cortana intelligence personal assistant projects, would head the AI initiative. More than 5,000 computer scientists and engineers work for Microsoft’s AI and Research Group.

Microsoft’s AI initiative seeks to “democratize” AI technology through a focus on agents, applications, services and infrastructure. Infrastructure goals include building what the company asserts would be the world’s most powerful “AI supercomputer” integrated with its Azure Cloud that would help extend access to more users.

The software giant and cloud competitor said the AI supercomputer would combine FPGA and GPU silicon with the Azure cloud as Microsoft researchers look “post-Moore’s Law—and solving the issue of Moore’s Law running out of steam.”

The emerging cloud platform would use an “FPGA fabric” tied to GPU processing to speed applications like machine translation and Bing search queries. “Azure is using this technology for accelerated networking and for a virtual machine that can drive 25 gigabits per second throughput at a 10X reduction in latency,” the company claimed. “Every time you do a Bing query, you’re touching this fabric to get better results.”

That approach would help drive emerging AI services that Microsoft said would require new approaches like combining and FPGA fabric with Azure, a cloud architecture that would allow the AI platform to “talk to the network directly.”

Meanwhile, AI researchers are adding GPUs to the cloud mix as a way to achieve scale and boost processing performance.

The company added that customers are currently running CPU-based virtual machines on the cloud architecture to scale production workloads. The addition of FPGAs is credited with boosting network performance in the cloud to boost throughput for many workloads.

The AI initiative led by Shum, executive vice president of the Microsoft AI and Research Group, reflects efforts to expand the 25-year-old Microsoft Research unit to develop disruptive technologies. In the case of infrastructure, the effort looks to combine emerging processing architectures centered on GPUs and FPGAs to create more use cases for the Azure Cloud, which remains a distant second in the public cloud race to market leader Amazon Web Services.

Along with the cloud infrastructure push, Microsoft’s AI effort also includes a group focusing on furthering Bing search and Cortana development along with robotics and what the company referred to as “ambient computing.” Meanwhile, AI services efforts will focus on cognitive capabilities ranging from vision and speech to machine analytics while making those services more widely available to developers.

]]>https://www.hpcwire.com/2016/10/03/microsoft-eyes-ai-supercomputer-azure/feed/030595EU Projects Unite on Heterogeneous ARM-based Exascale Prototypehttps://www.hpcwire.com/2016/02/24/eu-projects-unite-exascale-prototype/?utm_source=rss&utm_medium=rss&utm_campaign=eu-projects-unite-exascale-prototype
https://www.hpcwire.com/2016/02/24/eu-projects-unite-exascale-prototype/#respondWed, 24 Feb 2016 19:13:38 +0000http://www.hpcwire.com/?p=25152A trio of partner projects based in Europe – Exanest, Exanode and Ecoscale – are working in close collaboration to develop the building blocks for an exascale architecture prototype that will, as they describe, put the power of ten million computers into a single supercomputer. The effort is unique in seeking to advance the ARM64 + […]

]]>A trio of partner projects based in Europe – Exanest, Exanode and Ecoscale – are working in close collaboration to develop the building blocks for an exascale architecture prototype that will, as they describe, put the power of ten million computers into a single supercomputer. The effort is unique in seeking to advance the ARM64 + FPGA architecture as a foundational “general-purpose” exascale platform.

Funded for three years as part of Europe’s Horizon2020 program, the partners are coordinating their efforts with the goal of building an early “straw man” prototype late this year that will consist of more than one-thousand energy-efficient ARM cores, reconfigurable logic, plus advanced storage, memory, cooling and packaging technologies.

Exanest is the project partner that is focused on the system level, including interconnection, storage, packaging and cooling. And as the name implies, Exanode is responsible for the compute node and the memory of that compute node. Ecoscale focuses on employing and managing reconfigurable logic as accelerators within the system.

Exanest

Manolis Katevenis, the project coordinator for Exanest and head of computer architecture at FORTH-ICS in Greece, explains that Exanest has set an early target of 2016 to build this “relatively-large” first prototype, comprised of at least one-thousand ARM cores.

He says, “We are starting early with a prototype based on existing technology because we want system software to be developed and applications to start being ported and tuned. For the remainder of the two years, there will be ongoing software development, plus research on interconnects, storage and cooling technologies. We also believe that there will be new interesting compute nodes coming out from our partner projects and we will use such nodes.”

In discussing target workloads, Katevenis emphasizes flexibility and breadth, echoing the sentiments we are hearing from across the HPC community. The goal for this platform is to be able to support a range of applications, both on the traditional compute and physics side and the data-intensive side. A look at the Exanest partner list hints at the kind of high-performance applications that will be supported: astrophysics, nuclear physics, simulation-based engineering, and even in-memory databases with partner MonetDB Solutions. Allinea will be providing the ARMv8 profiling and debugging tools.

Although the projects are still in the specification phase, they will be making selections with the aim of overcoming the specific challenges related to exascale. Areas of focus include compact packaging, permanent storage, interconnection, resilience and application behavior. Some of the design decisions were revealed in this poster from Exanest that shows a diagram of the daughterboard and blade design. Note that Xilinx is a key partner.

To achieve a complete prototype capable of running real-world benchmarks and applications by 2018, the primary partners are collaborating with a number of other academic groups and industry partners using co-design principles to develop the hardware and software elements. This is a classic public-private arrangement where academic and industrial partners join forces and industrial partners benefit by being able to reuse the technology that is developed.

On the technology side, packaging and cooling is a key focus for Exanest, which will rely on Iceotope, the immersive cooling vendor, to design an innovative cooling environment. The first prototype will employ Iceotope technology and there is the expectation that technology with even higher power density will be developed as the project progresses.

One of the primary criteria for the project partners is low-energy consumption for the main processor. They have chosen 64-bit ARM processors as their main compute engine. Katevenis affirms that having a processor that consumes dramatically less power allows many more cores to be packaged in the same physical volume and within the same total power consumption budget. “One way we will achieve scale is this low-power consumption,” says the project lead, “but another is by having accelerators to provide floating point performance boost to appropriate applications.”

As for topology, the Exanest team is discussing the family of networks that includes fat trees and Dragonfly topology. They will be linking blades through optical fibers that they can plug and unplug allowing them to experiment with more than one topology. Exanest will also be using FPGAs for building the interconnection network so they can experiment with novel protocols.

Exanode

Denis Dutoit, the project coordinator for Exanode, tells HPCwire the goal of that project is to build a node-level prototype with technologies that exhibit exascale potential. The three building blocks are heterogeneous compute elements (ARM-v8 low-power processors plus various accelerators, namely FPGAs although ASICs and GPGPUs may also be explored); 3D interposer integration for compute density; and, continuing the efforts of the EUROSERVER project, an advanced memory scheme for low-latency, high-bandwidth memory access, scalable to exabyte levels.

Dutoit, who is the strategic marketing manager, architecture, IC design and embedded software division at CEA-Leti, notes that this is a technology driven project at the start, but on top of this prototype, there will be a complete software stack for HPC capability. Evaluation will be done first will be done on the node level, explains Dutoit. They will utilize emulated hardware first and representative HPC applications to evaluate at the level nodes, but after that, Exanest will reuse these compute nodes and integrate them into their complete machine to do the full testing and evaluation with real applications.

There will be a formal effort to productize the resulting technology through a partnership with Kaleao, a UK company that focuses on energy-efficient, compact hyperconverged platforms.

Ecoscale

Iakovos Mavroidis, project coordinator for Ecoscale, says that while there are three main projects, he sees it as one big project with Ecoscale dedicated to reconfigurable computing.

A member of Computer Architecture and VLSI Systems (CARV) Laboratory of FORTH-ICS and a member of Telecommunication Systems Institute, Mavroidis notes that the main problem being addressed is how to improve today’s HPC servers. Simple scaling without improving technologies is unfeasible due to utility costs and power consumption limitations. Ecoscale is tackling these challenges by proposing a scale-out hybrid MPI+OpenCL programming environment and a runtime system, along with a hardware architecture which is tailored to the needs of HPC applications. The programming model and runtime system follows a hierarchical approach where the system is partitioned into multiple autonomous workers (i.e. compute nodes).

“The main focus of Ecoscale is to support shared partitioned reconfigurable resources, accessed by these compute nodes,” says Mavroidis. “The intention is to have a global notion of the reconfigurable resources so that each compute node can access remote reconfigurable resources not only its own local resources. The logic can also be shared by several compute nodes working in parallel.” To accomplish this, workers are interconnected in a tree-like structure in order to form larger Partitioned Global Address Space (PGAS) partitions, which are further hierarchically interconnected via an MPI protocol.

“The virtualization will happen automatically in hardware and it has to be done because reconfigurable resources are very limited unless remote access is enabled,” states Mavroidis. “The aim is to provide a user-friendly way for the programmer to use all the reconfigurable logic in the system. This requires a very high-speed low-latency interconnection topology and this is what Exanest will provide.”

Mavroidis explains there must be means for the programmer to access the system and at a higher-level the run-time system has to be redefined to understand the needs of the application so it can reconfigure the machine. He believes that in order to fully implement this, there will need to be innovation in all the layers of the stack, and also the programming model itself will also need to be redefined. The partners are aiming to support most of the existing and common HPC libraries in order to make this architecture available to most of the existing applications.

The main focus of Ecoscale is to automate out the complexity of FPGA programming. Anyone who has watched FPGAs struggle to get a foothold in HPC knows this is not an easy task, but the need for low-power performance is driving interest and innovation. “The programmer should not have to be aware that the machine uses reconfigurable computing, but rather be able to write the program using high-level programming model such as MPI or Standard C,” states Mavroidis.

On a related note, Exanest project partner BeeGFS has just announced that the BeeGFS parallel file system is now available as open source from www.beegfs.com. “Although BeeGFS can already run out of the box on ARM systems today, this project [Exanest] will give us the opportunity to make sure that we can deliver the maximum performance on this architecture as well,” shares Bernd Lietzow, BeeGFS head for Exanest.

]]>https://www.hpcwire.com/2016/02/24/eu-projects-unite-exascale-prototype/feed/025152Shining a Light on SKA’s Massive Data Processing Requirementshttps://www.hpcwire.com/2015/06/04/shining-a-light-on-skas-massive-data-processing-requirements/?utm_source=rss&utm_medium=rss&utm_campaign=shining-a-light-on-skas-massive-data-processing-requirements
https://www.hpcwire.com/2015/06/04/shining-a-light-on-skas-massive-data-processing-requirements/#respondThu, 04 Jun 2015 15:02:54 +0000http://www.hpcwire.com/?p=18776One of the many highlights of the fourth annual Asia Student Supercomputer Challenge (ASC15) was the MIC optimization test, which this year required students to optimize a gridding algorithm used in the world’s largest international astronomy effort, the Square Kilometre Array (SKA) project. Gridding is one of the most time-consuming steps in radio telescope data processing. […]

]]>One of the many highlights of the fourth annual Asia Student Supercomputer Challenge (ASC15) was the MIC optimization test, which this year required students to optimize a gridding algorithm used in the world’s largest international astronomy effort, the Square Kilometre Array (SKA) project.

Gridding is one of the most time-consuming steps in radio telescope data processing. To reconstruct a sky image from the data collected by the radio telescope, scientists need to take the irregular sampled data and map it onto a standardized 2-D mesh. The process of adding sampled data from the telescopes to a grid is called gridding. After this step, the grid can be Fourier transformed to create a sky image.

To say that radio astronomy is pushing the limits of data processing is an understatement. Consider that the data produced by SKA per second is expected to exceed 12TB and nearly 50 percent of this astronomy data need to be processed through gridding. In a 2012 paper, Netherlands Institute for Radio Astronomy (ASTRON) researcher John W. Romein placed SKA phase one image processing requirements in the petaflops range; the full-scale project will cross into exaflops territory.

Unlike the other five ASC15 test applications (LINPACK, NAMD, WRF-CHEM, Palabos and the surprise-app HPCC), which run on Inspur provided racks with a maximum power consumption limit of 3000 watts, the gridding app is run on a separate platform provided by the committee consisting of one login node and 16 computing nodes. The 16 nodes are outfitted with two CPUs (Intel Xeon E5-2670 v3, 12-core, 2.30GHz, 64GB memory) and one MIC card (Intel Xeon Phi 7110P, 61 cores, 1.1Ghz, 8GB memory) connected over InfiniBand FDR.

The gridding portion of the ASC15 challenge is worth 20 points out of a possible 100 and the team with the fastest run time is awarded the e-Prize award, which comes with $4,380 in prize money. During the awards ceremony held Friday, May 22, the winner of this challenge was declared to be Sun Yat-sen University. This was not Sun Yat-sen’s first time being honored in an ASC competition. Last year, the team rewrote the LINPACK record by achieving a peak performance of 9.272 teraflops within the 3,000 watts power budget.

Sun Yat-sen University was victorious in this effort, but they were not alone in their ability to impress the judges, a panel of HPC experts that included ASTRON researcher Chris Broekema. Compute platform lead for the SKA Science Data Processor (essentially the HPC arm of SKA), Broekema shared with HPCwire that while the solutions that the students came up with were not entirely new ideas, the quality of the teams’ work exceeded his expectations.

The 16 teams who competed in the ASC15 finals were allowed to research the application in advance, but the way that they tackled the problem showed creativity and an understanding of the main issues involved in optimizing this I/O bound algorithm. In fact, they managed to get fairly close to the state-of-the-art in just a couple weeks, according to Broekema.

While the various teams employed different optimization techniques, Broekema said that the best results were the ones that completely reordered the way the data was handled and altered the structure of the different loops. This led to a result that was essentially one step short of most successful optimization developed by the SKA community.

One of the primary challenges of this algorithm relates to memory accesses, something that was correctly identified by most of the teams. Gridding involves many memory reads and writes for very little compute. The current state-of-the-art in addressing this imbalance is to sacrifice compute for reduced memory accesses. Implementing this solution takes a while, and requires a complete rethink of the way you go through your data.

“Even though it’s a bit more expensive in terms of compute, the fact that it’s far more efficient in going through memory makes it a far more efficient implementation of gridding altogether,” Broekema explained.

According to the ASC15 committee, the application selected for the MIC optimization test should be “practical, challenging and interesting.” Asked why this application was a good fit for the contest, Broekema responded that the shortness of the code snippet engendered a much more detailed analysis of what’s happening in the actual code, compared to the other applications, which, being established and somewhat bulky code bases, can be very difficult for students to fully penetrate. While the snippet allowed for a more meaningful challenge in some ways, Broekema is already thinking about ways to fine-tune the test code to further enrich the student experience. He wants to make it more like real-world implementations so students can get a feel for how it is used in practice.

MIC optimization is one of many projects that Broekema and his colleagues are working on. Several of the SKA processing workloads, including the gridding algorithm, have been optimized for GPUs, he said, but it can work for other platforms as well, including MIC, FPGAs and ASICs. Each of these necessitates a different approach to data handling. A number of benchmarking efforts have already been completed and others are underway as the SKA ramps up to its 2017 Phase 1 launch.

Broekema’s next point drove home just how integral platform evaluation is to the greater SKA effort. “One of the undertakings of the SKA community in general is looking at the various platforms that are currently available and the various algorithms important to the work to see how they map on those platforms,” he said. “This isn’t confined to the Science Data Processor [the high-performance computing component of the SKA].”

“Before data is sent to the Science Data Processor, which does the gridding, Fourier transforming, etc., there’s the central signal processor, essentially the correlator, which involves a very large amount of fairly simple algorithms – correlation, filtering, and also Fourier Transforms, probably on fixed integer size data – and those may well be done in FPGAs or ASICS, although it’s also possible to use accelerators like GPUs or Phi. So there’s a range of algorithms, correlators, Fourier transforms, gridding, convolutions, filters, etc., that are analyzed for different kinds of platforms, to see what is the best combination of platform and implementation.”

Asked whether FGPAs/ASICs wouldn’t be the best choice in terms of highest performance and performance-per-watt, Broekema said they are still very hard to program, which increases the risk of a late implementation. It’s also his opinion that the performance gap between GPUs and FPGAs is narrowing fairly quickly. It used to be several factors of discrepancy, but now it’s just a couple of dozen percent, he reported, and implementations of months (with GPUs) rather than years (with FPGAs) is a great advantage as well.

After a slight pause, however, Broekema began laying out the factors that could turn the tide toward FPGAs, starting with Intel’s purchase of Altera on Monday. The February announcement that Altera FPGAs will be manufactured on Intel’s 14 nm tri-gate transistor technology was cited as another reason to believe that FPGAs will continue to maintain their energy-efficiency edge over GPUs. And the fact that the reconfigurable chips can now be programmed using OpenCL promises to ease one of their main weaknesses. Just how much having this OpenCL support changes the FPGA programming paradigm is something that the SKA HPC group will be exploring with a new pilot project.

In summary, Broekema characterized the boundary between different kinds of programmable accelerators as fuzzy, which is why they are taking a look at all of them. “FPGAs are getting easier to integrate,” he stated. “There’s the Xeon Phi, which has the advantage of being easier to program and looking more like a regular Xeon, but they are a little late to the party and performance is not optimal at the moment. We did benchmarks on DSPs as well, and found them to be even more difficult to program than FPGAs.”

With all this benchmarking, GPUs are currently the preferred accelerator within the SKA community and the one they have deployed in production environments.

While the research into different platforms is being carried out by and for the benefit of the radio astronomy community in preparation for the immense SKA radio telescope, the value does not end there. “There’s an obvious parallel with medical imaging,” Broekema told HPCwire. “The data from large MRI machines, they do fairly similar work; then there’s the multimedia sector, streaming video has very similar data rates,” he said.

More significant is the potential for shared lessons going forward as HPC and even general computing become ever more data-laden. Radio astronomy knows all about these extremely I/O bound algorithms, where the data rates far exceed the compute element. The skewed ratio between I/O and compute is set to skew even further in the future, according to Broekema, and not just in radio astronomy.

“The problems that we face now are probably indicative of the problems that everyone is going to face in the next few years,” he commented. “So in that sense, I believe that the problems that we solve are useful for pretty much the entire HPC community and possibly even computer science in general.”

The ASTRON scientist recalled an example of this synergistic cross-HPC pollination that occurred several years ago. The systems software team at Argonne National Lab built an extension to an operating system that was intended to be for high-performance computing on their Blue Gene systems, and the radio astronomy community coopted it with great success for their Blue Gene that performed the data processing for LOFAR.

“Many of the optimizations that we come up with are equally valuable and equally useful for other HPC and other computer science applications,” he stated.

]]>https://www.hpcwire.com/2015/06/04/shining-a-light-on-skas-massive-data-processing-requirements/feed/018776SRC Debuts Defense-Hardened FPGA Serverhttps://www.hpcwire.com/2015/05/28/src-debuts-defense-hardened-fpga-server/?utm_source=rss&utm_medium=rss&utm_campaign=src-debuts-defense-hardened-fpga-server
https://www.hpcwire.com/2015/05/28/src-debuts-defense-hardened-fpga-server/#respondFri, 29 May 2015 00:37:07 +0000http://www.hpcwire.com/?p=18704The SRC Saturn 1 server debuted today, built by reconfigurable computing specialists SRC Computers and positioned as an x86 alternative for high-volume workloads in hyperscale computing and big data analytics. While we have seen several FPGA-based server solutions come to market in the past year, the Saturn-1 is distinguished by its pedigree and insulation from […]

While we have seen several FPGA-based server solutions come to market in the past year, the Saturn-1 is distinguished by its pedigree and insulation from the usual complexities of FPGA programming. SRC was founded in 1996 by supercomputing innovator Seymour R. Cray and Intel investor and former director James Guzy. Cray was focused on how to get the most horsepower out of a very small number of processors and how to harness the power of multiple processors, while Guzy et. al. at Intel were building a general purpose processor designed for the personal computer. Recognizing that there are limitations to using a general purpose processor for very specific tasks, the duo came up with the platform that would become the Saturn 1.

For 12 years, the FPGA-based server and development platform CARTE have been used for a variety of defense and intelligence solutions, but SRC thinks the time is right to enter a wider market.

Vice President of Sales and Marketing at SRC Computers Dave Eaton explained to HPCwire that many of the requirements developed for defense carry over to high-performance enterprise apps, for example the push for more performance per processor, minimized footprint and reduced power.

“The server market is starting to ask a lot of the same questions,” says Eaton.

Moore’s law-driven performance boosts based on faster clock rates petered out in the mid-2000s. Multicore technology adds several cores to one die to increase throughput, but it doesn’t address the inherent problem: microprocessor performance has hit a wall. Unlike the x86 microprocessor, which uses 90 percent of its resources to store and shuttle data to the remaining 10 percent, the SRC sheds unnecessary overhead by using only the logic required by the workload. All instructions are executed at the same time – in just one clock cycle. SRC reports that the server is up to 500x faster than traditional x86 designs.

SRC has put over $100 million into designing this new platform, which has been hardened by over a decade of government use. The firmware is version 8 and CARTE is on version 12.

The Saturn 1 server cartridge is available from SRC Computers as a standalone server and through HP as part of its Moonshot chassis. As depicted in the image above, there are two Stratix-IV FPGAs on one card. The top one, the user FPGA, is completely available for the customer’s application, and the system FPGA below that, has software written by SRC that ties all the components together into a unified system. There are also two memory banks, a small four-core Intel processor, and the edge connector which has multiple Ethernet ports.

The server cartridges can be dropped into an HP moonshot chassis, which has a passive backplane and is processor agnostic. Each server runs at 45 watts, and with 42 Saturn 1 servers in a Moonshot chassis, the power load is 2000 watts. A rack outfitted with 9 chassis draws about 20,000 watts.

The server is equipped with Stata-IV FPGAs, but the platform is actually processor-agnostic. Earlier iterations employed Lucent, Altera, Xilinx, before returning to Altera with the current version. The driving factor for FPGA selection is access to pure horsepower and avoidance of unnecessary industry-specific add-ons.

The deterministic nature of the platform means that programs executed on the Saturn 1 will run with the exact same performance every single time. Other benefits compared with traditional x86 processors include orders of magnitude performance increases, reduction in processing power consumption, and smaller footprint.

Another benefit that SRC emphasizes is ease-of-use. The time and effort required to program FPGAs has kept them from enjoying wider adoption. Instead of being relegated to a low-level language like Verilog, SRC customers can use the custom development environment, CARTE, to write programs with a familiar high-level language, such as C+, Python and Ruby, so the server can be deployed just like a regular x86 server. SRC reports that companies have ported their applications to the Saturn server in just three days.

As for HPC-specific use cases, Eaton says that the Saturn 1 is ideal for addressing hot spots – that specific place in a program or algorithm that burns up 90-95 percent of CPU cycles. He shares the example of a financial institution that prices options using a Black-Sholes algorithm souped-up with additional CPU-intensive constants and adjustments. Since moving over to Saturn 1, shops like this one are seeing 20-30x speedups above what they were getting on the most-performant x86 machines. This also allows them to add even more features to the algorithms without slowing down the processing, says Eaton.

For those worried about the lack of flexibility of the FPGA model, it may be helpful to know that processors can be completely reprogrammed in a little less than a second. This means servers can be optimized on the fly to match peak workloads. For example, using the Saturn 1, a company like Google would be able to configure the processors to handle search requests during peak daylight hours and then at night when searches are less frequent, the server environment could be optimized for Web crawling.

“You’ve heard about the search for the software-defined server?” inquires Eaton. “It’s one thing to have a software-defined server where I can change memory configuration or change how much hard drive space is there, now we’re talking about defining the actual processor on the server itself with software.”

The SRC server is shipping immediately, however very large orders of several hundred systems may take 6-8 weeks to satisfy. Servers cost $19,950 each and volume pricing is available. Licenses for the CARTE development environment are $14,000 per seat and the company offers a three-day workshop for learning CARTE that is priced according to the number of participants. Support for both the software and hardware comes through SRC if desired, or if the HP Moonshot chassis version is purchased, HP can provide support based on the customer needs.

]]>https://www.hpcwire.com/2015/05/28/src-debuts-defense-hardened-fpga-server/feed/018704FPGA-Accelerated Search Heads for Mainstreamhttps://www.hpcwire.com/2015/03/10/fpga-accelerated-search-heads-for-mainstream/?utm_source=rss&utm_medium=rss&utm_campaign=fpga-accelerated-search-heads-for-mainstream
https://www.hpcwire.com/2015/03/10/fpga-accelerated-search-heads-for-mainstream/#respondWed, 11 Mar 2015 01:17:21 +0000http://www.hpcwire.com/?p=17762With data volumes now outpacing Moore’s Law, there is a move to look beyond conventional hardware and software tools. Accelerators like GPUs and the Intel MIC architecture have extended performance goals for many HPC-class workloads. Although field-programmable gate arrays (FPGAs) have not seen the same level of adoption for traditional HPC workloads, a subset of big data […]

]]>With data volumes now outpacing Moore’s Law, there is a move to look beyond conventional hardware and software tools. Accelerators like GPUs and the Intel MIC architecture have extended performance goals for many HPC-class workloads. Although field-programmable gate arrays (FPGAs) have not seen the same level of adoption for traditional HPC workloads, a subset of big data applications have proved a good fit for the processor type.

The area where FPGAs have shined brightest is in search-based applications. Most notably, Microsoft is deploying FPGA-accelerated nodes for its Bing service and Asian Web giant Baidu has implemented a similar approach for pattern and image recognition.

Now we are seeing the debut of an FPGA-based appliance that is accelerating select analytics workloads and giving Hadoop and Spark clusters a run for their money. Ryft, a company that has been serving the government sector for over a decade, is jumping into the commercial analytics space with the deployment of Ryft ONE, a 1U analytics appliance that combines Xilinx FPGAs, twin parallel server backplanes and up to 48 TB of SSD-based storage with custom software primitives that hide the underlying hardware complexity. According to company benchmarks, the Ryft ONE platform can analyze data at speeds up to and exceeding 10 Gigabytes per second, some 100 to 200 times faster than the fastest 4 core CPU/15 GB RAM servers.

Bill Dentinger, vice president of products for Ryft, explains that they are targeting customer applications that require real-time insights from both streaming and historical data sets, in many cases simultaneously. After ten years of working on complex data sets with the government, Ryft drew from that experience to create a purpose-built solution that looks like a general-purpose Linux server but acts like a high-performance computer.

Pat McGarry, vice president of engineering, says Ryft One was designed to overcome the speed and performance bottlenecks of traditional x86 clusters. Each 1U box has eight drives and each one can house six off-the-shelf solid state drives for a total of 48 hot-swappable drives, providing up to 48 TB of SSD storage. The SSDs are RAIDed together in hardware. Data is ingested with dual 10GE cards.

Behind the front-end storage piece is Ryft’s secret sauce, a set of fundamental logic that is built into the Ryft Analytics Cortex (RAC) aimed at solving specific problems. McGarry explains that instead of using sequential processors, like x86, ARM, DSPs and the like, they implemented systolic arrays in an FPGA fabric, a massively parallel bitwise computing architecture. The move to this non von Neumann model is where Ryft differentiates itself.

While FPGAs lack the ease of programmability and familiarity of x86 Linux clusters, the use of hard coded primitives gets around these limitations by extracting away the internal complexities, using a standard C-language API to invoke a function. Ryft is launching with three primitives – search, fuzzy search, and term frequency operations (the equivalent of word count) – with plans to expand its library of prebuilt algorithm components in accordance with customer and market demands.

The initial custom software primitives are designed to tackle specific analytics workloads in areas where there is a requirement to find and match strings of text at lightening speed. Machine learning, fraud detection and gene sequencing are areas where they expect the platform to have the most impact initially.

When it comes to benchmarking results, McGarry reveals that a 1U Ryft box significantly outperforms quad-core x86 servers, as detailed in the following chart:

Furthermore by replacing hundreds of vanilla servers, the appliance reduces maintenance and operational costs by up to 70 percent.

If customers need more than one box, a sharding approach can be used to scale the system out, but the company says that what they are seeing is that most of their customers are not maxing out storage. They are running into data sizes in the realm of lower terabytes, with very few above 20 terabytes, outside of certain government needs.

Hosted and on-premise versions of Ryft One will be available in early Q2. The first-year cost of $200,000 includes an $80,000 integration fee. Thereafter, the price drops to $120,000 a year.

]]>https://www.hpcwire.com/2015/03/10/fpga-accelerated-search-heads-for-mainstream/feed/017762Programmability Mattershttps://www.hpcwire.com/2014/06/30/programmability-matters/?utm_source=rss&utm_medium=rss&utm_campaign=programmability-matters
https://www.hpcwire.com/2014/06/30/programmability-matters/#respondMon, 30 Jun 2014 21:34:20 +0000http://www.hpcwire.com/?p=13584While discussions of HPC architectures have long centered on performance gains, that is not the only measure of success, according to Petteri Laakso of Vector Fabrics. Spurred by ever-proliferating core counts, programmability is taking on new prominence. Vector Fabrics is a Netherlands-based company that specializes in multicore software parallelization tools, so programmability is high on […]

]]>While discussions of HPC architectures have long centered on performance gains, that is not the only measure of success, according to Petteri Laakso of Vector Fabrics. Spurred by ever-proliferating core counts, programmability is taking on new prominence. Vector Fabrics is a Netherlands-based company that specializes in multicore software parallelization tools, so programmability is high on their list of priorities.

In a recent blog post, the first article in a three-part series, Laakso contends that in the current paradigm, writing the software is the programmer’s problem, not the silicon maker’s. He thinks this is an approach that is losing ground.

“The question is,” writes Laakso, “does peak performance and performance/power ratios alone determine the success of an architecture, or does programmability impact the initial adoption or success of a silicon architecture?”

He turns to the field of HPC as a test case of sorts and sets out the following hypothesis:

“If programmability does not impact the success of a silicon architecture, we should be see the best-performing architectures win.”

To examine the issue in more detail, the Vector Fabrics team looked at the following software programmable accelerator technologies: CUDA GPGPUs, OpenCL GPGPUs, FPGAs, and Xeon Phis. For sample data, they turned to the TOP500 list of world’s fastest supercomputers. They looked at which systems used these accelerators, and then mapped the adoption rates from the debut of each technology.

“The difference is remarkable, since the performance figures are not very different between ATI/AMD and NVIDIA GP-GPUs,” writes Laakso.

“One clear difference between ATI/AMD and NVIDIA can be found in their investment in tooling and the programming paradigm. NVIDIA spent a considerable amount in developing the CUDA programming paradigm and accompanying tools. AMD’s investment in OpenCL and development tooling has been much more limited and leaning more towards the community to provide the improvements.”

“Neither of GP-GPUs’ programming paradigms can be called simple due to architectural limitations of GP-GPUs. But NVIDIA’s CUDA programming environment is much more developed than OpenCL’s. Looking at the relative adoption rates of the products, it’s hard to ignore the sentiment that the lack of good tooling has really hurt the chances of AMD GP-GPUs and OpenCL, regardless of its benefits over CUDA of openness and portability.”

So where does Intel’s accelerator play, the Xeon Phi, fit in?

Laakso: “When comparing to recent GP-GPUs, Xeon Phi offers comparable if slightly inferior performance and power characteristics. The main selling point of Xeon Phi is that you can use the same programming paradigm and tooling as you are using for normal node programming. While the reality does not carry quite as far as the marketing claims go, you can execute your existing applications on the Xeon Phi using MPI or OpenMP. You don’t have to port your code to a accelerator-specific programming paradigm.”

As for FPGAs, there are no FPGA systems on the TOP500, a data point that Laakso maintains further strengthens his conclusion that “[as] coding gets easier, adoption in the TOP500 is faster.”

The blog covers a lot of ground and makes a lot of claims. It also provides a nice counterpoint to our analysis of accelerator trends on the TOP500 list. What do you think?

]]>https://www.hpcwire.com/2014/06/30/programmability-matters/feed/013584A Shot of Java to Send Accelerators Mainstreamhttps://www.hpcwire.com/2013/09/05/a_shot_of_java_to_mainstream_accelerators/?utm_source=rss&utm_medium=rss&utm_campaign=a_shot_of_java_to_mainstream_accelerators
https://www.hpcwire.com/2013/09/05/a_shot_of_java_to_mainstream_accelerators/#respondThu, 05 Sep 2013 07:00:00 +0000http://www.hpcwire.com/2013/09/05/a_shot_of_java_to_mainstream_accelerators/When it comes to mainstream adoption of the use of GPUs and other accelerators, one of the primary barriers lies in programmability. While the vendor communities around accelerators have pushed to flatten the learning curve, the fact remains that it takes special effort on the part of ordinary developers to undertake the educational process. In this in-depth audio feature, we talk to Rice University researcher, Max Grossman about how Java...

When it comes to mainstream adoption of the use of GPUs and other accelerators, one of the primary barriers lies in programmability. While the vendor communities around accelerators have pushed to flatten the learning curve, the fact remains that it takes special effort on the part of ordinary developers to undertake the educational process.

The HPC space has proven, at most massive scale, that GPUs and accelerators can lead to significant performance improvements and these are certainly not unattractive to businesses outside of the traditional high performance computing purview. So the question becomes, what might “sweeten the deal” for mainstream developers when it comes to diving into programming for acceleration?

According to one researcher, Max Grossman from Rice University, there is a steep learning curve and it takes some time to get up to speed but there are some notable projects that are pushing the reach. The interview below details some of these challenges—and what’s being done, especially on the OpenCL/Java front by this young researcher and his team—not to mention others who want to bring advanced tools to a higher level.

Grossman says that even though it’s possible to use OpenCL to allow portable execution of SIMD kernels across a number of platforms (from CPUs, manycore GPUs, FPGAs, etc.) using OpenCL from Java to do so is a perilous path—and one that is anything but a simplification. For instance, it will still be necessary to dig in deep to manage data transfers, write kernels in the OpenCL kernel language, and so on.

To tackle these issues, they collaborated on some unique compile and run-time techniques to speed Java-based programs via an automatic generation of OpenCL as the base. As the team describes, the approach, which they call HJ-OpenCL includes: Automatic generation of OpenCL kernels and JNI glue code from a parallel-for construct (forall) available in the Habanero-Java (HJ) language; Leveraging HJ’s array view language construct to efficiently support rectangular, multi-dimensional arrays on OpenCL devices; Implementing HJ’s phaser (next) construct for all-to-all barrier synchronization in automatically generated OpenCL kernels.

As the team summarizes:

“We use a set of ten Java benchmarks to evaluate our approach, and observe performance improvements due to both native OpenCL execution and parallelism. On an AMD APU, our results show speedups of up to 36.7× relative to sequential Java when executing on the host 4-core CPU, and of up to 55.0x on the integrated GPU. For a system with an Intel Xeon CPU and a discrete NVIDIA Fermi GPU, the speedups relative to sequential Java are 35.7× for the 12-core CPU and 324.0× for the GPU. Further, we find that different applications perform optimally in JVM execution, in OpenCL CPU execution, and in OpenCL GPU execution. The language features, compiler extensions, and runtime extensions included in this work enable portability, rapid prototyping, and transparent execution of JVM applications across all OpenCL platforms.”

In addition to these and other approaches, Grossman says there are some simpler things that the vendor community can do to boost experimentation with accelerators, including pushing more hardware out to wider sets of developers.

]]>The top research stories of the week have been hand-selected from prominent journals and leading conference proceedings. Here’s another diverse set of items, including novel methods of data race detection; a comparison of predictive laws; a review of FPGA’s promise; GPU virtualization using PCI Direct pass-through; and an analysis of the Amazon Web Services High-IO platform.

Scalable Data Race Detection

A team of researchers from Berkeley Lab and the University of California Berkeley are investigating cutting-edge programming languages for HPC. These are languages that promote hybrid parallelism and shared memory abstractions using a global address space. It’s a programming style that is especially prone to data races that are difficult to detect, and prior work in the field has demonstrated 10X-100X slowdowns for non-scientific programs.

In a recent paper, the computer scientists present what they say is “the first complete implementation of data race detection at scale for UPC programs.” UPC stands for Unified Parallel C, an extension of the C programming language developed by the HPC community for large-scale parallel machines. The implementation used by the Berkeley-based team tracks local and global memory references in the program. It employs two methods for reducing overhead 1) hierarchical function and instruction level sampling; and 2) exploiting the runtime persistence of aliasing and locality specific to Partitioned Global Address Space applications.

Experiments show that the best results are attained when both techniques are used in tandem. “When applying the optimizations in conjunction our tool finds all previously known data races in our benchmark programs with at most 50% overhead,” the researchers state. “Furthermore, while previous results illustrate the benefits of function level sampling, our experiences show that this technique does not work for scientific programs: instruction sampling or a hybrid approach is required.”

A fascinating new study applies the scientific method to some of our most popular predictive models. A research team from MIT and the Santa Fe Institute compared several different approaches for predicting technological improvement – including Moore’s Law and Wright’s Law – to known cases of technological progress using past performance data from different industries.

Moore’s Law, theorized by Intel co-founder Gordon Moore in 1965, predicts that a chip’s transistor count will double every 18 months. In more general terms, it suggests that technologies advance exponentially with time. Wright’s Law was first formulated by Theodore Wright in 1936. Also called the Rule of Experience, it holds that progress increases with experience. Other alternative models were proposed by Goddard, Sinclair et al., and Nordhaus.

The study, which employed hindcasting, used a statistical model to rank the performance of the postulated laws. The comparison data came from a database on the cost and production of 62 different technologies. The expansive knowledge-base enabled researchers to test six different prediction principles against real-world data.

The results revealed that the law with the greatest accuracy was Wright’s Law, but Moore’s Law was a very close second. In fact, the laws themselves are more similar than previously realized.

“We discover a previously unobserved regularity that production tends to increase exponentially,” write the authors. “A combination of an exponential decrease in cost and an exponential increase in production would make Moore’s law and Wright’s law indistinguishable…. We show for the first time that these regularities are observed in data to such a degree that the performance of these two laws is nearly the same.”

“Our results show that technological progress is forecastable, with the square root of the logarithmic error growing linearly with the forecasting horizon at a typical rate of 2.5% per year,” they conclude.

The team includes Bela Nagy of the Santa Fe Institute, J. Doyne Farmer of the University of Oxford and the Santa Fe Institute, Quan Bui of St. John’s College in Santa Fe, NM, and Jessika E. Trancik of the Santa Fe Institute and MIT. Their findings are published in the online open-access journal PLOS ONE.

FPGAs (field programmable gate arrays) have been around for many years and show real potential for advancing HPC, but their popularity has been restricted because they are difficult to work with. This is the assertion of a group of researchers from the T.J. Watson Research Center. They argue that FPGAs won’t become mainstream until their various programmability challenges are addressed.

In a paper published last month in ACM Queue, the research team observes that there exists a spectrum of architectures, with general-purpose processors at one end and ASICs (application-specific integrated circuits) on the other. Architectures like PLDs (programmable logic devices), they argue, have that best-of-both-worlds potential in that they are closer to the hardware and can be reprogrammed. The most prominent PLD is in fact an FPGA.

The authors write:

FPGAs were long considered low-volume, low-density ASIC replacements. Following Moore’s law, however, FPGAs are getting denser and faster. Modern-day FPGAs can have up to 2 million logic cells, 68 Mbits of BRAM, more than 3,000 DSP slices, and up to 96 transceivers for implementing multigigabit communication channels. The latest FPGA families from Xilinx and Altera are more like an SoC (system-on-chip), mixing dual-core ARM processors with programmable logic on the same fabric. Coupled with higher device density and performance, FPGAs are quickly replacing ASICs and ASSPs (application-specific standard products) for implementing fixed function logic. Analysts expect the programmable IC (integrated circuit) market to reach the $10 billion mark by 2016.

The researchers note that “despite the advantages offered by FPGAs and their rapid growth, use of FPGA technology is restricted to a narrow segment of hardware programmers. The larger community of software programmers has stayed away from this technology, largely because of the challenges experienced by beginners trying to learn and use FPGAs.”

The rest of this excellent paper addresses the various challenges in detail and brings attention to the lack of support for device drivers, programming languages, and tools. The authors drive home the point that the community will only be able to leverage the benefits of FPGAs if the programming aspects are improved.

The technical computing space has seen several trends develop over the past decade, among them are server virtualization, cloud computing and GPU computing. It’s clear that GPGPU computing has a role to play in HPC systems. Can these trends be combined? A research team from Chonbuk National University in South Korea has written a paper in the periodical Applied Mechanics and Materials, proposing exactly this. The investigate a method of GPU virtualization that exploits the GPU in a virtualized cloud computing environment.

The researchers claim their approach is different from previous work, which mostly reimplemented GPU programming APIs and virtual device drivers. Past research focused on sharing the GPU among virtual machines, which increased virtualization overhead. The paper describes an alternate method: the use of PCI direct pass-through.

“In our approach, bypassing virtual machine monitor layer with negligible overhead, the mechanism can achieve similar computation performance to bare-metal system and is transparent to the GPU programming APIs,” the authors write.

The HPC community is still exploring the potential of the cloud paradigm to discern the most suitable use cases. The pay-per-use basis of compute and storage resources is an attractive draw for researchers, but so is the illusion of limitless resources to tackle large-scale scientific workloads.

In the most recent edition of the Journal of Grid Computing, computer scientists from the Department of Electronics and Systems at the University of A Coruña in Spain evaluate the I/O storage subsystem on the Amazon EC2 platform, specifically the High I/O instance type, to determine its suitability for I/O-intensive applications. The High I/O instance type, released in July 2012, is backed by SSD and also provides high levels of CPU, memory and network performance.

The study looked at the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. It also assessed several I/O interfaces, notably POSIX, MPI-IO and HDF5, that are commonly employed by scientific workloads. The scalability of a representative parallel I/O code was also analyzed based on performance and cost.

As the results show, cloud storage devices have different performance characteristics and usage constraints. “Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud,” the researchers state. “An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.”

]]>https://www.hpcwire.com/2013/03/07/the_week_in_hpc_research-8/feed/04166Latest FPGAs Show Big Gains in Floating Point Performancehttps://www.hpcwire.com/2012/04/16/latest_fpgas_show_big_gains_in_floating_point_performance/?utm_source=rss&utm_medium=rss&utm_campaign=latest_fpgas_show_big_gains_in_floating_point_performance
https://www.hpcwire.com/2012/04/16/latest_fpgas_show_big_gains_in_floating_point_performance/#respondMon, 16 Apr 2012 07:00:00 +0000http://www.hpcwire.com/?p=4504<img style="float: left;" src="http://media2.hpcwire.com/hpcwire/V72000T_chip.jpg" alt="" width="105" height="77" />Thanks to shrinking semiconductor process geometries, the newest FPGAs have more usable transistors than ever before and are now capable of considerable floating point (FP) performance. That makes them candidates for more generalized use in high performance computing. This article describes the FP capabilities of Xilinx’s new Virtex-7 FPGA and how it stacks up against a generic 16-core CPU.

This is the fourth in a series of HPCwire articles comparing the theoretical floating point performance of Field Programmable Gate Arrays (FPGA) to microprocessors. As shown in the last article, the performance gap continues to expand between these two classes of devices. Comparing theoretical peaks for 64-bit floating point arithmetic, the current generation of Xilinx’s Virtex-7 FPGAs is about 4.2 times faster than a 16-core microprocessor. This is up from a factor of 2.9X as reported in 2010.

This article also includes some new empirical validation of the theoretical calculation by implementing a simple single-stream double-square function on two FPGAs using AutoESL, a C/C++ synthesis tool. That tool was able to implement a design within 2 percent of the theoretical predicted performance. The calculations were also supplemented by the hardware description language (HDL) implementation of a matrix multiplication (DGEMM) on one of the Virtex-7 FPGA devices.

Background

High performance computing applications have hit the practical limits of clock speeds for microprocessors. To increase the performance of a computing device, parallelism must be exploited so that more operations can be performed per clock cycle. For instance, multiple computing cores are being placed within the same microprocessor device. This keeps the programming model simple, since the same set of instructions can be spread across the multiple cores. The drawback is that a lot of circuitry is replicated that might not add performance. The Graphical processing unit (GPU) addresses this issue by providing more functional units sharing the same control logic.

FPGAs push this idea of parallelism to the limit via dynamic reconfiguration of the entire device, allowing the user to place only the functions and controls that are need for the calculation. The down side of this approach is that the complexity of the design must be handled by the programmer or hidden in FPGA design tools or pre-packaged libraries.

This design freedom on FPGAs also makes it difficult to gauge what the devices are capable of for 64-bit floating point performance. For this reason the first HPCwire article was written to describe one method to estimate the peak floating point performance of FPGAs. The concept was simple. Figure out all the ways floating point function units can be placed on a device and multiply it by the clock frequency from the data sheets. This method was further refined in a whitepaper published by Altera.

Since the FPGA is a blank sheet of transistors, some portion needs to be reserved for interfaces like memory controllers. In addition, FPGA design tools cannot make 100 percent use of the FPGA device, so some portion of the device area is needed to factor in this constraint. Lastly, not all the data paths between the floating point operators will be able to meet timing when packing a device close to its resource limits, so the data sheet clock frequency needs to be derated. Since different engineers might want to derate the FPGA devices in different ways for a predicted performance, the articles have been presenting both a “peak” floating point performance number that simply pack the FPGA with functions units and a derated “predicted” performance.

Soft floating point operators allow programmers to implement adders and multipliers in multiple ways and in any ratio needed. In contrast, the microprocessor has a fixed number of floating point function units, so the ratio of adders to multipliers is fixed. If a calculation only needs to perform additions, half the functional units (i.e. the multipliers) will become idle. This leads to an ambiguity regarding a device’s “peak” performance. Is it for an even ratio of adders to multipliers or for any ratio?

For this reason, the FPGA performance has been evaluated for both scenarios: an even ratio for direct comparison with microprocessors, and any ratio for a look at the optimal performance combination. The floating point operators supplied for these devices, come in 64-bit, 32-bit, and 24-bit versions. While it is very rare for a researcher in HPC to use 24-bit logic, these results show another dimension of the flexibility of FPGAs. If the calculations can make use of 24-bits, there is additional performance to be gained.

Calculating Peak Performance

The peak performance calculation of a Virtex-7 FPGA starts with collecting its available resources as reported from the data sheet, ds180. For example, the V7-2000T contains 1.2 million Look-up Tables (LUT), 2.4 million Flip-Flops (FF) and 2160 Digital Signal Processing (DSP) slices.

Next, the resource requirements for building functional units such as logic adders, full adders, logic multipliers, medium multipliers, full multipliers, and max multipliers are collected from the LogiCORE IP Floating point Operator v6.0 data sheet, ds816. Some operators use more DSPs to run faster and use less logic.

With this data, it is just a matter of picking a configuration, adding up the LUTs, FFs, and DSPs needed, and seeing if they will fit on the device of interest. A program was written to systematically try every possible combination of the six types of floating point operators and multiplying them by the appropriate clock frequency, calculated gigaflops, then recording the best for each device. For the 64-bit floating point operators, the program was able to do a fully exhaustive search of every combination of operators. Because the 32-bit and especially the 24-bit operators are quite a bit smaller, many more will fit on a given device and hence the search space gets very large. For these precisions, a “step” function was used to regularly skip some configurations and do a semi-exhaustive search. This makes the performance predictions for the 32-bit and 24-bit performance more conservative.

Using this method, the best possible 64-bit floating point peak performance was calculated to be 670.99 gigaflops on the V7-2000T using 1469 logic adders and 196 max multipliers running at a 403 MHz clock. Further constraining the configuration to only look at adder/multipliers configurations with a one-to-one ratio drops the performance of the V7-2000T to 345.35 gigaflops. That configuration used 543 logic adders, 2 full multipliers, 237 medium multipliers and 304 logic multipliers running at a 318 MHz clock.

The floating point performance for the reference microprocessor is calculated by multiplying the number of floating point functions units on each core by the number of cores and by the clock frequency. For instance, the calculation for a 16-core device would be four 64-bit floating point ops per clock times 16 cores times 2.5 GHz, which comes to a theoretical peak of 160 gigaflops. Although clock frequency typically drops as the number of cores per microprocessor goes up, this article series has been using a normalized value of 2.5 GHz clock frequency for all microprocessor flavors for straightforward comparisons.

CalculatingPredicted Performance

To calculate a more realistic “predicted” performance, some logic needs to be set aside for an interface and for routing the design. Xilinx recommended removing 20,000 LUTs and FFs for an interface and further reducing that by another 15 percent for routing to give a more realistic performance calculation. This is one of the reasons why the gap between the FPGAs and microprocessors has been growing. As FPGAs get bigger a smaller percentage of resources are need for the interface logic. Clock frequency is also reduced by 15 percent to simulate the longer data-paths in the design not meeting timing.

Applying those modifications, the predicted 64-bit performance of the highest peak V7-2000T performance drops 38 percent, from 670.99 to 484.02 gigaflops. It is interesting that the best predicted 64-bit configuration is still very similar to the peak performance configuration, using the same 196 max multipliers, but dropping the number of logic adders from 1469 to 1217.

The best one-to-one adder/multiplier ratio predicted 64-bit performance also drops 33 percent from 345.35 to 258.95 gigaflops. Again, the configuration looks very similar with the number of logic adders reduced due to the reduction of logic slices. This configuration is 479 logic adders, 3 full multipliers, 236 medium multipliers, and 240 logic multipliers running at a 270 MHz clock. For the microprocessor, its predicted performance is calculated by derating its peak performance by 85 percent.

While not practical for most HPC applications, the flexibility of having 24-bit floating point operators could yield over 1.6 teraflops on the V7-2000T.

One other aspect of the floating point performance that has yet to be explored fully is performing fixed point arithmetic within the FPGA and floating the results at the end of the calculations. At the lowest level, any floating point calculation involves a series of binary operations. Using floating point operators, the results are rounded after every calculation. This rounding takes up logic that could be used for more operators.

What if instead of rounding after each operation, the results was allowed to grow in bit-width and only floated at the very end? This would yield a more exact answer since there is no rounding and would use less logic.

Validating Predicted Performance Using a Comparable Design Implemented in AutoESL

To compare the validity of the calculated predicted performance, AutoESL was used to implement a simple design on two FPGA devices. AutoESL allows a programmer to write a high level description of the design in a standard programming language, which is then automatically synthesized into HDL. The HDL can be implemented into a design for an FPGA.

Using this tool, a double-square function was implemented on the X690T and X980T devices. The double-square function is a single-stream function that ties an arbitrary number of adders and multipliers together in one pipeline. An initial value is split, and passed as the two inputs to an adder. The output from the adder is then split and fed as the two inputs to a multiplier. The pipeline can be arbitrarily long and made up of an arbitrary number of adders and multipliers. With AutoESL, many combinations of the number and type of operators were tried to maximize performance for the target device.

This experiment created a double-square implementation for the X690T and X980T devices that were within about 2 percent of the predicted 64-bit floating point performance and validated the calculated predicted performance. For the X690T, AutoESL got timing closure on a 64-bit design using 390 full adders and 180 full multipliers running at 387 MHz for 220.59 gigaflops. The best predicted 64-bit performance was 224.03 gigaflops using 327 logic adders and 327 max multipliers running at a 342 MHz clock. For the X980T device, AutoESL achieved 282.5 gigaflops, where as the program calculated a 64-bit predicted performance of 289.45 gigaflops.

For these two data points the AutoESL designs shows that the predicted performance calculations can be obtained on simple algorithms and functions. The double-square function, though a simple algorithm was comparable to illustrate the validity of the upper limit of the predicted performance on a device.

Comparing Predicted Performance against the Results of a Typical HPC Algorithm

To demonstrate the performance limits of an FPGA when designing a complex design, a DGEMM algorithm was implemented on the X690T. DGEMM (“Double precision General Matrix Multiply”) is a standard routine from BLAS (“Basic Library of Algebra Subprograms”) and is commonly used for benchmarking HPC machines. The matrix multiply, a workhorse function for many scientific applications, happens to reap tremendous performance gains when accelerated on hardware within an HPC environment. Thus, the algorithm is apropos for this demonstration.

The FPGA fabric’s inherent parallelism allows the matrix multiply algorithm to be implemented using a systolic array of MACs (“Multiply-ACcumulate”) designed so that each MAC can calculate a continuous stream of dot products simultaneously. After analyzing the device specifications and going through a series of dry runs with smaller arrays, a 12×12 array clocked at 500 MHz could be attainable with reasonable effort. This architecture is shown in the figure below.

Several techniques had to be employed to maintain systolic operation (and hence, maximum performance) of the array throughout the algorithm’s execution, such as maximizing DDR3 efficiency, employing an innovative scheme for handling heavily-pipelined accumulators, using embedded RAM blocks as cache, and adopting a data re-use strategy while uploading matrix data from memory.

After carrying out an efficient floor planning strategy (a must for architecture of this complexity to meet 500 MHz), timing closure was met. The overall performance (number of MACs x 2 x frequency) measured out to 144 gigaflops, which works out to about 64 percent of the predicted limit of 224.03 gigaflops on the X690T.

There are opportunities for pushing this performance even higher. For instance, it is feasible that another row and column can be added and still achieve 500 MHz, resulting in a performance of 169 gigaflops, or 75 percent of the theoretical limit. Approaching it from a different angle, it’s possible to condense the arrays even further to create a 15×15 array, albeit at the sacrifice of clock frequency. In such a scenario, a 15×15 array clocked at 400 MHz would reach 180 gigaflops, or 80% of the predicted performance limit.

Expanding the Niche of FPGAs in HPC

The HPC computing landscape is moving towards heterogeneous computing using multiple threads internally on each computing device and tightly coupling thousands of devices together into large systems. Both manycore microprocessors and GPU fit well into this architecture.

FPGAs too can play well in this environment. They have the computing performance needed to compliment microprocessors, they have more flexibility to maximize the use of the given transistors, and they have the advantage of running at a lower clock frequency to lower their power requirements. Today FPGAs are used in some bioinformatics and financial applications. As researchers and companies improve the programmability of FPGAs with tools like AutoESL and pre-programmed libraries, the HPC community will find more uses for these accelerators.