GPGPU – HPCwirehttps://www.hpcwire.com
Since 1987 - Covering the Fastest Computers in the World and the People Who Run ThemTue, 26 Sep 2017 18:39:43 +0000en-UShourly1https://wordpress.org/?v=4.8.260365857GPUs Advance Deep Learninghttps://www.hpcwire.com/2014/09/18/gpus-advance-deep-learning/?utm_source=rss&utm_medium=rss&utm_campaign=gpus-advance-deep-learning
https://www.hpcwire.com/2014/09/18/gpus-advance-deep-learning/#respondThu, 18 Sep 2014 21:50:35 +0000http://www.hpcwire.com/?p=15222Over the last decade, GPU-acceleration techniques have infiltrated the high-end of supercomputing, but increased adoption of GPUs is occurring in other compute-driven disciplines too, like deep learning, one of the fastest growing segments of the machine learning field. The trend toward accelerated computing reached a new high at last week’s ImageNet Large Scale Visual Recognition Challenge. The event, […]

Over the last decade, GPU-acceleration techniques have infiltrated the high-end of supercomputing, but increased adoption of GPUs is occurring in other compute-driven disciplines too, like deep learning, one of the fastest growing segments of the machine learning field.

The trend toward accelerated computing reached a new high at last week’s ImageNet Large Scale Visual Recognition Challenge. The event, which involves the evaluation of algorithms for object detection and image classification at large scale, was the subject of a recent NVIDIA blog entry.

“When the number of users of your product flips from zero to nearly 100%, you don’t need a Ph.D. to realize a trend has formed,” states NVIDIA’s Stephen Jones.

“At last week’s event, over 95% of the teams tapped GPUs for their ground-breaking submissions. This compares with just 10% two years ago (and 0% three years ago), underscoring how accelerated computing has become fundamental for this fast-growing field,” he continued.

The strategy also reduced the team’s error rates from nearly 30 percent in 2010 to less than 10 percent in this last contest.

Winning teams, revealed at the European Conference on Computer Vision (ECCV), included the National University of Singapore, the University of Oxford, Google, the Center for Research on Intelligent Perception and Computing, and Adobe/University of Illinois at Urbana–Champaign.

The ECCV event also served as a launch pad for a new CUDA-based programming library, called cuDNN, that helps developers harness GPU acceleration. UC Berkeley researchers have integrated cuDNN into the popular deep learning framework Caffe.

In the video below Evan Shelhamer, a PhD Student Researcher at UC Berkeley explains how NVIDIA’s new deep learning software improves the behavior of Caffe.

“In Caffe, we can actually recognize the contents of a single image in only 2.5 milleseconds, which allows us to process over 40 million images a day on a single device…at massive even Internet scales,” says Shelhamer.

“With the new cuDNN library developed by NVIDIA, it’s further accelerated the key routines of deep models, so that now we can actually infer the contents of an image in just over a millisecond.

“These models can learn to do all sorts of visual tasks, even recognize the style of a photo or painting, so that it can see an image and know that it’s a vintage photo or a romantic scene or that it’s a painting done in an impressionist style.”

]]>https://www.hpcwire.com/2014/09/18/gpus-advance-deep-learning/feed/015222NVIDIA Boasts ‘Compelling HPC Solution’https://www.hpcwire.com/2014/08/20/nvidia-calls-cuda-6-5-compelling-hpc-solution/?utm_source=rss&utm_medium=rss&utm_campaign=nvidia-calls-cuda-6-5-compelling-hpc-solution
https://www.hpcwire.com/2014/08/20/nvidia-calls-cuda-6-5-compelling-hpc-solution/#respondWed, 20 Aug 2014 20:14:44 +0000http://www.hpcwire.com/?p=14725Today marks the official release of the NVIDIA CUDA Toolkit version 6.5, which had previously been only available in its pre-release form. In a company blog post, NVIDIA’s Chief Technologist for GPU Computing Software Mark Harris covers the toolkit’s new features and improvements, including “support for CUDA Fortran in developer tools, user-defined callback functions in […]

Today marks the official release of the NVIDIA CUDA Toolkit version 6.5, which had previously been only available in its pre-release form. In a company blog post, NVIDIA’s Chief Technologist for GPU Computing Software Mark Harris covers the toolkit’s new features and improvements, including “support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.”

The release is part of a greater ecosystem that includes CUDA on ARM, released last year, and and the Jetson TK1 developer board, released in March. At last week’s Hot Chips conference, NVIDIA revealed more information about the upcoming Tegra K1 “Project Denver” 64-bit ARM CPU architecture.

He continues: “The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC.”

Harris paints the marriage of ARM64 and GPGPUs as a best of both worlds scenario with ARM64 providing power efficiency, system configurability, and a large, open ecosystem, and the GPUs facilitating high-throughput, power-efficient compute performance, and a robust HPC ecosystem that includes hundreds of CUDA-accelerated applications. As with other CPU-GPU hybrid systems, the ARM64 CPUs can offload the compute-intensive tasks to the GPUs. In this way, “CUDA and GPUs make ARM64 competitive in HPC from day one,” concludes Harris.

Figure one depicts the performance of three CUDA-accelerated applications on ARM64+GPU systems as being on par with x86+GPU systems. The bigger competitive threat for NVIDIA will come from Intel’s Xeon CPU-MIC architecture.

Currently available CUDA+ARM64 development platforms include Cirrascale’s RM1905D HPC Development Platform and the E4 ARKA EK003. These are equipped with Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPUs, and CUDA 6.5. Eurotech has plans release a similarly-outfitted system soon, which it says will enable a peak performance of 1 petaflops in one square meter.

The remainder of the blog is dedicated to addressing the ways that CUDA 6.5 improves performance and productivity. Highlights include the ability to specify cuFFT device callbacks; improved support for CUDA Fortran tools; and a new CUDA occupancy calculator and occupancy-based launch configuration API interfaces. The latest CUDA release also includes support for Microsoft Visual Studio 2013 for Windows.

]]>https://www.hpcwire.com/2014/08/20/nvidia-calls-cuda-6-5-compelling-hpc-solution/feed/014725Adapting Algorithms to Modern Hybrid Architectureshttps://www.hpcwire.com/2014/08/13/adapting-algorithms-modern-hybrid-architectures/?utm_source=rss&utm_medium=rss&utm_campaign=adapting-algorithms-modern-hybrid-architectures
https://www.hpcwire.com/2014/08/13/adapting-algorithms-modern-hybrid-architectures/#respondWed, 13 Aug 2014 22:09:36 +0000http://www.hpcwire.com/?p=14622Technology, like other facets of life, commonly experiences cycles of rapid change followed by periods of relative stability. Computing has entered a stage of increased architectural diversity, as evidenced by the rise of accelerators, coprocessors, and other alternatives, like ARM processors. An international team of researchers explores how these various supercomputing architectures perform on parallelized turbulent […]

Technology, like other facets of life, commonly experiences cycles of rapid change followed by periods of relative stability. Computing has entered a stage of increased architectural diversity, as evidenced by the rise of accelerators, coprocessors, and other alternatives, like ARM processors. An international team of researchers explores how these various supercomputing architectures perform on parallelized turbulent flow problems.

In their paper “Direct Numerical Simulation of Turbulent Flows with Parallel Algorithms for Various Computing Architectures,” the authors describe the process of creating efficient parallel algorithms for large-scale simulations of turbulent flows and comparing their performance on AMD, NVIDIA and Intel Xeon Phi parts. They also introduce a new series of direct numerical simulations of incompressible turbulent flows with heat transfer performed with the newly-developed algorithms.

To optimize performance, algorithms need to be customized for each system type.

“The first type, the basic one, requires highly scalable parallel algorithms that can run on thousands of cores,” the authors state. “It also needs efficient shared-memory parallelization with large number of threads to engage modern multi-core nodes: two 12-core Intel Xeon CPUs with Hyper Threading (HT) can execute 48 parallel threads on a dual-CPU node. In addition it needs efficient vectorization since AVX extension operates with vectors of 4 doubles. The second type requires adaptation of algorithms to the streaming processing which is a simplified form of parallel processing related with SIMD (single instruction multiple data) model. This can be a challenge itself. The third type requires much more deep multi-threaded parallelism and vectorization than the first type.”

There is also a fourth type, ARM-based architectures, which like other hybrid types, involve a lot of attention to optimize memory access and load balancing between the CPU and accelerators. However, the main focus of this paper is on GPGPUs from NVIDIA and AMD and on the Intel Phi coprocessor.

The team take a multilevel approach that combines different parallel models. They explain: “MPI is used on the first level within the distributed memory model to couple computing nodes of a supercomputer. On the second level OpenMP is used to engage multi-core CPUs and/or Intel Xeon Phi accelerators. The third level exploits the computing potential of massively-parallel accelerators.”

OpenMP and OpenCL-based extensions were developed to exploit the computing potential of modern hybrid machines. In adapting the computational algorithms to different accelerator architectures, the group came across some interesting findings regarding performance.

Figure 3: Comparison of performance on a mesh with 472114 cells (flow around a sphere) for different devices using a 1st order finite-volume scheme for unstructured meshes

Figure 4: Comparison of performance on a mesh with 679339 cells (flow around a sphere) for different devices using a 2nd order polynomial-based finite-volume scheme for unstructured meshes

Looking at figure 3 and 4 (above) the team stated “it can be noted that for the 1st order scheme (Figure 3) NVIDIA GTX TITAN outperforms AMD 7970 while for the 2nd order polynomial-based scheme which requires much more resources (registers and shared memory usage) AMD one significantly outperforms NVIDIA one. This indicates the insufficiency of register and local memory of NVIDIA architecture that prevents from achieving high occupancy of the device and reduces efficiency.”

Also in Figure 4, it can be seen that the Intel Xeon Phi architecture is less performant than the various GPUs. Although this could be due to the OpenCL implementation, an OpenMP implementations resulted in similar behavior, providing only a 10-20 percent speedup over an 8-core Intel Xeon E5-2690 CPU.

“So the common statement that Intel Xeon Phi is much easier to use than GPU because it can handle the same CPU code is an illusion,” they conclude. “The computing power of this kind of accelerator is much more difficult to get.”

Structured and unstructured mesh algorithms modified for significantly multithreaded OpenMP parallelization demonstrated high internal speedups: up to 200 times faster on Intel Xeon Phi compared to a sequential execution on the same accelerator. However, net performance was not much higher than an 8-core CPU. Surprised by this result, the team speculates it could be related to insufficient memory latency hiding mechanisms that are based on 4-thread hyper threading. A GPU, they note, can have tens of threads switching for latency hiding. They add that poor cache performance could also be a contributing factor.

The paper serves as another reminder that system architectures must be assessed in the context of a specific workloads. For the OpenCL kernels of the algorithm on unstructured meshes, “the different GPUs considered substantially outperform Intel Xeon Phi accelerator,” the team concludes, adding, “the AMD GPU tends to be more efficient than NVIDIA on heavy computing kernels.”

]]>https://www.hpcwire.com/2014/08/13/adapting-algorithms-modern-hybrid-architectures/feed/014622Building Parallel Code with Hybrid Fortranhttps://www.hpcwire.com/2014/07/31/building-parallel-code-hybrid-fortran/?utm_source=rss&utm_medium=rss&utm_campaign=building-parallel-code-hybrid-fortran
https://www.hpcwire.com/2014/07/31/building-parallel-code-hybrid-fortran/#respondThu, 31 Jul 2014 22:00:44 +0000http://www.hpcwire.com/?p=14437Over at the Typhoon Computing blog, Michel Müller addresses a topic that is top of mind to many HPC programmers: porting code to accelerators. Fortran programmers porting their code to GPGPUs (general purpose graphics processing units) have a new tool at their disposal, called Hybrid Fortran. Müller shows how this open source framework can enhance portability without sacrificing performance and maintainability. From the blog (editor’s note: the site […]

Over at the Typhoon Computing blog, Michel Müller addresses a topic that is top of mind to many HPC programmers: porting code to accelerators.

Fortran programmers porting their code to GPGPUs (general purpose graphics processing units) have a new tool at their disposal, called Hybrid Fortran. Müller shows how this open source framework can enhance portability without sacrificing performance and maintainability.

From the blog (editor’s note: the site was down at the time of publication):

Say, you are on the onset of programming a HPC application. No problem, right? You know how the underlying machine works in terms of memory architecture and ALUs. (Or not? Well, that’s no problem either, the compilers have become so good I’m hearing, they will surely figure it out). You know what numeric approximation will be used to map your problem most efficiently. You know all about Roofline performance modelling, such that you can verify whether your algorithm performs on the hardware the way you’ve expected. You know what you’re supposed to do when you encounter data parallelism. So – let’s sit down and do it!

But wait!

You’re hearing about your organisation ordering a new cluster. In order to get closer to Exascale, this cluster will sport these fancy new accelerators. So all new HPC software projects should evaluate, if and how they can make use of coprocessors. You start reading yourself into the accelerator landscape. OpenCL, CUDA, OpenACC, OpenMP, ArrayFire, Tesla, Intel MIC, Parallela… Your head starts getting dizzy from all this stuff – all these hardware and software tools have lots of overlap, but also significant differences. Especially, they’re very different from x86 CPU architecture. Why is that?

It essentially comes down to the fact that in 2005, the free lunch was over.

By free lunch, Müller is of course referring to the ramping down of Moore’s law. When processor clock rates topped out, chipmakers began cramming multiple cores on a chip, and the multicore era was born. This puts the burden on the programmer to harness this parallelism. But as long as you have to do all that multithreaded implementation, why not get the most out of it, asks Müller, or as he puts it: “Why care about six or eight threads if we can have thousands?”

From here Müller goes through a step by step process of the other potential roadblocks, such as applications that are limited by memory bandwidth, the slow PCI Express bus, and the temptation to let scientists use the old (non-accelerated) version of your code on existing CPU-only machines.

This is all leading up to the ultimate dilemma: what if increased portability comes at the expense of performance and maintainability?

For codes written in Fortran, there is hope in the form of an open source Fortran directive called Hybrid Fortran. The code’s github page explains it as “a way for you to keep writing your Fortran code like you’re used to – only now with GPGPU support.”

With this machine-driven solution, a Python-based preprocessor takes care of the necessary transformations at compile-time, so there is no runtime overhead. It parses annotations together with your Fortran code structure, declarations, accessors and procedure calls, and then writes separate versions of your code – one for CPU with OpenMP parallelization and one for GPU with CUDA Fortran.

The programmer only needs to add two things:

(1) Where is the code to be parallelized? (Can be specified for CPU and GPU separately.)
(2) What symbols need to be transformed in different dimensions?

Müller charts the performance differences of this approach below:

[1] If available, comparing to reference C version, otherwise comparing to Hybrid Fortran CPU implementation. Kepler K20x has been used as GPU, Westmere Xeon X5670 has been used as CPU (TSUBAME 2.5). All results measured in double precision. The CPU cores have been limited to one socket using thread affinity ‘compact’ with 12 logical threads. For CPU, Intel compilers ifort / icc with ‘-fast’ setting have been used. For GPU, PGI compiler with ‘-fast’ setting and CUDA compute capability 3.x has been used. All GPU results include the memory copy time from host to device.

Müller didn’t just stumble upon this solution, he is the primary developer of the codebase. At the Tokyo Institute of Technology, Müller ported the Physical Core of Japan’s national next generation weather prediction model to GPGPU. He ran into many of the problems he presents in this blog, and solving these issues led to the development of Hybrid Fortran. Müller currently works at the Tokyo Institute of Technology, where he is planning to port the internationally used open source weather model WRF to Hybrid Fortran.

]]>https://www.hpcwire.com/2014/07/31/building-parallel-code-hybrid-fortran/feed/014437ASC14 Marks Seventh Win for GPUshttps://www.hpcwire.com/2014/05/29/asc14-marks-seventh-win-gpus/?utm_source=rss&utm_medium=rss&utm_campaign=asc14-marks-seventh-win-gpus
https://www.hpcwire.com/2014/05/29/asc14-marks-seventh-win-gpus/#respondThu, 29 May 2014 21:49:10 +0000http://www.hpcwire.com/?p=13002The past decade has seen a sharp rise in heterogenous computing, processing or coprocessing using more than one processor type. One of the most prominent examples of heterogenous elements in HPC is the GPU computing ecosystem that has been fostered by NVIDIA and AMD. General-purpose GPU (GPGPU) adoption has become widespread in HPC, and student supercomputing […]

The past decade has seen a sharp rise in heterogenous computing, processing or coprocessing using more than one processor type. One of the most prominent examples of heterogenous elements in HPC is the GPU computing ecosystem that has been fostered by NVIDIA and AMD. General-purpose GPU (GPGPU) adoption has become widespread in HPC, and student supercomputing competitions are no exception.

For the last seven international supercomputing challenges – SC in the United States, ISC in Germany and ASC in China – the winning contestants have relied on “hybrid” CPU-GPU machines with NVIDIA parts. The most recent team to do so is from Shanghai Jiao Tong University (SJTU). The team took the top spot in the largest student supercomputer challenge, ASC14, held last month at Sun Yat-Sen University in Guangzhou, China.

Using a cluster comprised of eight NVIDIA K20 GPU accelerators that they built, SJTU earned the highest combined scores for a series of six tests, including an elastic wave modeling application, 3D-EW; a quantum chemistry application, Quantum ESPRESSO; and other real-world scientific codes.

Although SJTU performed best overall, China’s Sun Yat-sen University team set a new record using 216 processor cores and eight NVIDIA K40 GPUs. The team’s cluster achieved 9.27 teraflops as measured by the HPC industry standard Linpack performance benchmark, besting the previous record of 8.45 teraflops, set by Huazhong University of Science and Technology at ISC13.

According to Dr. Ye Weicai, advisor of Sun Yat-Sen University team, the participants were focused on deep and fine strategic optimization for LINPACK testing that could best exploit the heterogeneous acceleration technology and improve floating point computing capacity. Credit was also given to the HPC management software Cluster Engine for helping the contestants optimize performance and control power consumption simultaneously.

As detailed in a recent blog entry, NVIDIA’s Simon See, who is also an adjunct professor at SJTU, reached out to James Lin, team advisor and vice director of the Center for HPC, to get his thoughts on the contest and the role of GPUs. Preparation was critical, notes Lin. He relates how the team practiced running code on SJTU’s “π” supercomputer. With 100 NVIDIA Tesla K20 GPUs, graphics coprocessing comprises half of the system’s computational power. The team also reviewed the source codes used in the competition and identified the best optimization methods.

When asked about the most challenging aspect of the competition, Lin hits on one of the main issues in HPC, the separation of computer science and domain science.

“All of my students are from the computer science department, so they knew very little about the background of scientific applications, like Quantum ESPRESSO, before the contest,” Lin says. “Fortunately, some of the π users are experienced with these applications, so they were able to help. In the end, we received the top score for three of the five applications.”

]]>https://www.hpcwire.com/2014/05/29/asc14-marks-seventh-win-gpus/feed/013002HPC Boosts Medical Physicshttps://www.hpcwire.com/2013/09/05/hpc_boosts_medical_physics/?utm_source=rss&utm_medium=rss&utm_campaign=hpc_boosts_medical_physics
https://www.hpcwire.com/2013/09/05/hpc_boosts_medical_physics/#respondThu, 05 Sep 2013 07:00:00 +0000http://www.hpcwire.com/2013/09/05/hpc_boosts_medical_physics/When it comes to employing physics in medicine, there are two major fields in terms of their relevance in clinical practice: medical imaging and radiation therapy. An Argentinian research duo addresses how these domains can benefit from high-performance computing techniques...

When it comes to employing physics in medicine, there are two major fields in terms of their relevance in clinical practice: medical imaging and radiation therapy. A recent paper from an Argentinian research duo addresses how these domains can benefit from high-performance computing techniques.

Medical imaging and radiation therapy both rely heavily on computational resources. Ideally, computational work can be performed in real-time or near-real-time to benefit patient outcome as much as possible, the researchers note.

While execution times have dropped significantly with the advent of faster CPUs, wait times are still problematic. In tomographic image reconstruction, internal dosimetry calculation and radiotherapy planning, accelerating these processes is enormously important, “not only for the patient – whose quality of life improvement is the ultimate goal-, but also for optimizing professional work in a busy hospital environment.”

Over the last several years, the rise of multicore and GPU-based computing has boosted many technical computing domains, including the field of medical physics. The research paper explores the ways that medical physics has benefited from advances in HPC and specifically GPU computing.

The authors describe two typical lines of research in medical image processing, image segmentation and registration, that are good candidates for parallel computing on GPU cores. Image segmentation, which falls under general image processing, involves the identification and further classification of different constituents or textures depicted in a given dataset. In the case of biomedical images, this discovery process is crucial to both diagnosis and therapy. The authors found that implementing an image segmentation algorithm on GPU delivered impressive results, a 15x speedup in comparison to the optimized code running on a CPU-only setup.

The second medical imaging process, known as registration, involves bringing two or more datasets into spatiotemporal alignment. There are many reasons this is done, including diagnostic power enhancement after comparing different modalities, disease follow-up, and assistance in radiotherapy planning. It’s a complex process and the algorithm designed by the researchers requires 30-40 minutes of CPU to register two 512x512x50 voxel datasets. Because the algorithm uses a hierarchical subdivision scheme, the authors are confident that it will benefit from acceleration using parallel computing.

Radiotherapy is the second main area examined in the paper. “In Radiation Therapy, the calculation of the dose delivered by ionizing radiation and the use of optimization algorithms on advanced methods of treatment, are the main areas where GPU programming has its greatest impact,” write the authors. There are different ways of computing this dose. There is a 2D solution, known as the pencil beam algorithm, and a 3D algorithm known as convolution/superposition. The authors note that other research groups have developed reformulated pencil beam and convolution/superposition algorithms for GPU-based processing, with speedups of 200-400x.

At the authors’ home institution, Fundación Escuela de Medicina Nuclear de Mendoza, they are working to refine these techniques using the accelerative power of the GPU when it’s feasible to do so. It’s worth noting that even when an algorithm, e.g. Monte Carlo, is ideal for parallel computation, the complexity of the method can limit the acceleration potential.

The clinical value of this work is the development of treatment plan that strikes the best compromise between the dose of radiation delivered to the tumor and dose received by healthy organs located around it.

]]>https://www.hpcwire.com/2013/09/05/hpc_boosts_medical_physics/feed/03849The Modern GPU: A Graphic Historyhttps://www.hpcwire.com/2013/08/21/the_modern_gpu_a_graphic_history/?utm_source=rss&utm_medium=rss&utm_campaign=the_modern_gpu_a_graphic_history
https://www.hpcwire.com/2013/08/21/the_modern_gpu_a_graphic_history/#respondWed, 21 Aug 2013 07:00:00 +0000http://www.hpcwire.com/2013/08/21/the_modern_gpu_a_graphic_history/What do the Atari 2600 and Tianhe-1A have in common? It may be difficult to imagine, but both systems are examples of the use of cutting-edge graphic processers for their times. This demonstrates the fascinating evolution of the GPU, which today is one of the most critical hardware components of supercomputer architectures.

What do the Atari 2600 and Tianhe-1 have in common? It may be difficult to imagine, but both systems are examples of the use of cutting-edge graphic processers for their times. This demonstrates the fascinating evolution of the GPU, which today is one of the most critical hardware components of supercomputer architectures.

Techspot’s Graham Singer recently put together a compelling series on the history of the GPU, stretching from the earliest 3D work in the 1950s through today’s GPGPU market. Singer broke his history into four distinct stories.

Singer’s first installment looked at the early days of 3D consumer graphics, a period that lasted from 1976 to 1995. Although 3D graphic systems were being built as early as 1951, when MIT built the Whirlwind flight simulator for the Navy, the graphic 3D systems that developers created for the burgeoning consumer computer market in the mid-1970s formed the foundation for today’s GPU, Singer writes.

The “Pixie” video chip that RCA built in 1976 was capable of outputting a video signal at a resolution of 62×128. 1977 saw the release of the Atari 2600 game system, which included the Television Interface Adapter (TIA) 1A. Motorola followed suit a year later with MC6845 video address generator, which became the basis for the Monochrome and Color Display Adapter (MDA/CDA) cards that IBM started using in its PC of 1981.

The Extended graphics Adapter (EGA) developed by Chips and Technologies started to provide some competition to the MDA/CDA cards starting in 1985. The same year, three Hong Kong immigrants formed Array Technology Inc. The company, which soon changed its name to ATI Technologies Inc., would lead the market for years with its Wonder line of graphics boards and chips.

In 1992, SGI released OpenGL, an open API for 2D and 3G graphics. As OpenGL gained traction in the workstation market, Microsoft attempted to corner the emerging gaming market with its proprietary Direct3D API. Many other proprietary APIs were introduced, such as Matrox Simple Interface, Creative Graphics Library, C Interface (ATI), and others, but they would eventually fall by the wayside.

Meanwhile, the early 1990s was a period of great volatility in the graphics market, with many companies being found, and then being acquired or going out of business. Among the winners that would be founded during this time was NVIDIA.

The second epoch in Singer’s series lasts from 1995 to 1999, and is characterized by the utter domination of the market by 3DFx’s Voodoo graphics card, which launched in November 1996 and soon came to account for about 85 percent of the market. Cards that could only render 2D were made obsolete nearly overnight, Singer writes.

3DFx went public in 1997, but the launch of its budget-minded Voodoo Rush board was a flop. And in a bid to boost profits, the company decided to market and sell graphics boards itself, which further helped competitors, including Rendition, ATI, and Nvidia.

Nvidia laid the groundwork for future success with the 1997 launch of the RIVA 128 (Real-time Interactive Video and Animation accelerator), which featured Direct3D compatibility and topped several performance benchmarks. By the end of 1997, Nvidia had nearly 25 percent of the graphics market. Nvidia was sued by SGI in 1998, but Nvidia emerged stronger after the settlement in 1999, in which SGI gave Nvidia access to its professional graphics portfolio. This amounted to a “virtual giveaway of IP” that hastened SGI’s bankruptcy, Singer writes.

The battle between ATI and Nvidia marks Singer’s third era of the GPU’s history, which lasted from 2000 to 2006. During this period, 3dfx became increasingly irrelevant, as its cards, such as the Voodoo 4 4500, could not keep up with the graphics performance offered by Nvidia’s GeForce 2 GTS and ATI’s Radeon DDR.

Nvidia and ATI would go head to head and deliver graphics cards with features are now commonplace, such as the capability to perform specular shading, volumetric explosion, refraction, waves, vertex blending, shadow volumes, bump mapping and elevation mapping.

The coming of the general purpose GPUs would begin in 2007, which kicks off the fourth era of Singer’s GPU history. Both Nvidia and ATI (since acquired by AMD) had been cramming ever-more capabilities into their graphics cards, and the practice of using these cards for HPC workloads became common.

But the two companies would take different tracks to GPGPU, with Nvidia releasing its CUDA development environment, and AMD using OpenCL. Nvidia gained considerable market- and mindshare in the HPC market with the launch of the Tesla, the first dedicated GPGPU.

]]>https://www.hpcwire.com/2013/08/21/the_modern_gpu_a_graphic_history/feed/03911Harlan Targets Complexity for GPGPU Programminghttps://www.hpcwire.com/2013/07/11/harlan_hides_complexity_for_gpgpu_programming/?utm_source=rss&utm_medium=rss&utm_campaign=harlan_hides_complexity_for_gpgpu_programming
https://www.hpcwire.com/2013/07/11/harlan_hides_complexity_for_gpgpu_programming/#respondThu, 11 Jul 2013 07:00:00 +0000http://www.hpcwire.com/2013/07/11/harlan_hides_complexity_for_gpgpu_programming/HPC programmers who are tired of managing low-level details when using OpenCL or CUDA to write general purpose applications for GPUs (GPGPU) may be interested in Harlan, a new declarative programming language designed to mask the complexity and eliminate errors common in GPGPU application development.

HPC programmers who are tired of managing low-level details when using OpenCL or CUDA to write general purpose applications for GPUs (GPGPU) may be interested in Harlan, a new declarative programming language designed to mask the complexity and eliminate errors common in GPGPU application development.

GPUs are increasingly being used to provide a boost in computing power in HPC systems. Attaching NVIDIA Kepler or Intel Xeon Phi co-processing cards to a traditional CPU architecture can provide a big increase in the performance of parallel workloads. However, programming the GPUs can be difficult, as it requires different tools and a different skill set than traditional X86 development.

The idea with Harlan is to keep developers focused on the high-level HPC programming challenge at hand, instead of getting bogged down with the nitty gritty details of GPU development and optimization.

Eric Holk, a Ph.D. candidate at the University of Indiana, is the driving force behind the Harlan project. Harlan is a domain-specific language that uses a declarative approach to coordinating computation and data movement between a CPU and GPU, according to a paper that Harlan and his colleagues presented at the September 2011 International Conference on Parallel Computing.

Harlan’s syntax is based on the language Scheme, and compiles to Khronos Group’s OpenCL, a GPU framework that competes with NVIDIA’s Compute Unified Device Architecture (CUDA). The language was designed to provide a “straightforward mechanism for expressing the semantics the user wants” for areas such as data layout, memory movement, threading, and computation coordination. In effect, it lets developers declare the “what,” and leaves the “how” up to the language, the researchers say in their paper.

The benefits of this approach will be even higher for hybrid applications that utilize a combination of GPUs and CPUs, since they introduce even more complexity for the developer, who has to take into account additional levels of memory hierarchy and computational granularity, the researchers say.

“Not only does a declarative language obviate the need for the programmer to write low-level error-prone boiler-plate code, by raising the abstraction of specifying GPU computation it also allows the compiler to optimize data movement and overlap between CPU and GPU computation,” Holk and his colleagues write in the paper, titled “Declarative Parallel Computing for GPUs.”

In addition to Harlan, Holk and his colleagues are developing Kanor, another declarative language for specifying communication in distributed memory clusters. Kanor is unusual, Holk writes, in that can automatically handle the low-level details when appropriate, but gives the programmer the option to step in and hand code the communications when necessary. This provides a “balance between declarativeness and performance predictability and tenability.”

Harlan will provide a productivity boost, but don’t expect it to transform your average coder into a super coder. “It is important to emphasize at this point that we are not proposing a ‘silver bullet’ or ‘magic compiler’ that will somehow make GPGPU or hybrid cluster programming easy,” Holk and his colleagues write.

“Rather, we are seeking to abstract away many of the low-level details that make GPU/-cluster programming difficult, while still giving the programmer enough control over data arrangement and computation coordination to write high-performance programs,” they add.

Harlan will run on Mac OS and Linux. The Harlan project is hosted at GitHub, and has five contributors.

]]>https://www.hpcwire.com/2013/07/11/harlan_hides_complexity_for_gpgpu_programming/feed/03954Penguin Pushes Envelope on Compute Densityhttps://www.hpcwire.com/2013/03/21/penguin_server_pushes_envelope_on_compute_density/?utm_source=rss&utm_medium=rss&utm_campaign=penguin_server_pushes_envelope_on_compute_density
https://www.hpcwire.com/2013/03/21/penguin_server_pushes_envelope_on_compute_density/#respondThu, 21 Mar 2013 07:00:00 +0000http://www.hpcwire.com/?p=4158<img src="http://media2.hpcwire.com/hpcwire/Penguin_Computing_logo_172x.jpg" alt="" width="101" height="59" />Penguin Computing keeps finding increasing demand for servers that go heavy on the GPUs (or other coprocessors). Based on feedback from one such customer, it has designed the Relion 2808GT server, which it says now has the highest compute density of any server on the market.

In the midst of the GPU Technology Conference this week, Penguin Computing served up a new high-power system, heavy on the GPU/coprocessor side, to meet the needs of HPC customers with heavy processing needs. The Relion 2808GT offers 8x double-width GPGPUs or MICs, and dual Xeon E5-2600 series CPUs with up to eight cores per processor. The company says it is now the server with the highest compute density on the market.

Penguin CEO Charles Wuischpard told HPCwire that the server was designed based on feedback from one particular high-end customer, but that it will fill the needs of many companies that need to do very fast, data-intensive computing. “More and more the opportunity seems to be to have more of a coprocessing component, whether that’s an NVIDIA GPU or an Intel Phi or even AMD APU,” he says.

The ratio of CPU to GPU that his customers want varies. “Some customers want one-to-one, some want two-to-one, and then there’s that fringe out there – we see it in oil and gas mostly – that wants a lot of GPUs, very densely packed into a couple of processors,” he says. “We’ve always struggled with trying to find a design that can support that level of density and do it in a performant way. Were really pleased with this.”

The new Relion fits the needs of that GPU-hungry fringe. It can hold eight GPGPUs or other coprocessors in two rack units. Its dual-socket platform is based on Intel’s Xeon E5-2600 CPU family. If it’s loaded up with eight NVIDIA K20 GPUs, it can get 28 teraflops of single precision floating point performance. It has 16 DIMM sockets for up to 512 GB of 1600 MHz DDR3 RAM ECC memory. It also has an on-board dual 10GbE BASE-T controller and optional support for two 10GbE SFP+ ports.

Penguin CTO Phillip Pokorny says that some organizations already have software programs that run well completely inside a GPU and scale easily as the number of GPUs increase. Those customers want as many of the graphics chips as they can get in a small space. “The key things for us were finding a form factor that cooled effectively and had appropriate power. Those two challenges, cooling and power, are the ones that we run into most often,” he noted. The Relion 2808GT also features a dual 1600W high-efficiency power supply.

Pokorny adds, however, that the server can be configured in many different ways, depending on the customer needs. An advantage of the Xeon E5-2600 CPU is that it offers PCIe Gen 3, integrated on the processor die. With that, the server can deliver full bandwidth to every GPU socket. For compute jobs that require a lot of communication, Penguin can put additional PCIe switch chips on the risers. For applications that are dominated by computation on the GPU, the switch chips can be eliminated in order to add more nodes. “We’re like the old Burger King saw,” he quips. In other words: Have it your way.

Large GPU-centric applications, of course, require a lot of memory and storage to hold both the raw data and the results of the computations. Pokorny notes that it’s now very cost-efficient to add a lot of memory to support both CPUs and GPUs, and Penguin is able to easily double the RAM. It is offering many different memory configurations for the Relion 2808GT and a wide variety of hard drive options (including spinning discs with eight spindles for up to 1.6 terabytes).

Although the spec sheet for the new Relion 2808GT remains politically correct by not specifying what GPUs or coprocessors might best suit the device, it’s not a coincidence that the server was displayed at NVIDIA’s GTC13 conference. The need for speed, memory and energy efficiency makes this server a very good candidate for NVIDIA’s latest and future generations of GPGPUs.

At that conference, NVIDIA CEO Jen-Hsun Huang emphasized the speed of his latest processors, their dense memory, and fast I/O between the GPU and the DRAM. He even included in his keynote talk a surprise acknowledgment that two generations from now, the “Volta” GPU will offer stacked DRAM. Judging by hypothetical images Huang produced at the conference, several stacked memory chips can be placed very near the GPU on the same substrate, increasing both memory density and I/O speed to the processor. Huang said that Volta will be able to move data between them at 1TB/s.

As for plans to implement such future NVIDIA designs, Penguin’s response was vague. “R2808GT has been designed to accommodate the highest density of NVIDIA Tesla K10 and K20 generation GPUs, including future cards designed for similar physical [characteristics] and power envelope,” noted a company rep.

Penguin is seeing demand for the GPU-intense version of the server from a lot of different types of companies. Oil and gas businesses can use if for seismic studies, for example. Bioinformatics companies have a need to analyze huge volumes of images generated by DNA scanners that take photographic images of the DNA. Semiconductor companies use it to generate silicon mask designs. Pokorny says there is also strong demand from “government organizations,” but they don’t tell him what they need it for.

Wuischpard says the machine is a great example of Penguin’s ability to very quickly create and release customizable servers using the latest technology. He says he mentioned to one Intel executive recently that Penguin is always concerned about getting pounded by big players such as HP or Dell, but the Intel exec’s response was that the big companies are all so distracted by the tablet market that the datacenter business is being nibbled away from them by the more nimble, local OEMs like Penguin. This server, says Wuischpard, “is one aspect of that level of nimbleness.”

]]>https://www.hpcwire.com/2013/03/21/penguin_server_pushes_envelope_on_compute_density/feed/04158The Week in HPC Researchhttps://www.hpcwire.com/2013/02/14/the_week_in_hpc_research-11/?utm_source=rss&utm_medium=rss&utm_campaign=the_week_in_hpc_research-11
https://www.hpcwire.com/2013/02/14/the_week_in_hpc_research-11/#respondThu, 14 Feb 2013 08:00:00 +0000http://www.hpcwire.com/?p=4191<img style="float: left;" src="http://media2.hpcwire.com/hpcwire/research_globe_150x.jpg" alt="" width="95" height="89" />The top research stories of the week have been hand-selected from major science centers, prominent journals and leading conference proceedings. Here's another diverse set of items, including whole brain simulation; a look at High Performance Linpack; the coming GPGPU cloud paradigm; heterogenous GPU programming; and a comparison of accelerator-based servers.

The top research stories of the week have been hand-selected from major science centers, prominent journals and leading conference proceedings. Here’s another diverse set of items, including whole brain simulation; a look at High Performance Linpack; the coming GPGPU cloud paradigm; heterogenous GPU programming; and a comparison of accelerator-based servers.

Brain Simulation Project

The Human Brain Project, one of the most ambitious projects of its kind, has just been awarded half-a-million Euros over a 10-year timeframe. The European Commission funded the innovative program as part of its Future and Emerging Technologies (FET) flagship program. Led by Henry Markram, a neuroscientist at the Swiss Federal Institute of Technology in Lausanne, the project aims to reconstruct the brain piece-by-piece, using cutting-edge supercomputing resources.

As a result of this initiative, in neuroscience and neuroinformatics the brain simulation will collect and integrate experimental data, identifying and filling gaps in our knowledge. In medicine, the project’s results will facilitate better diagnosis, combined with disease and drug simulation. In computing, new techniques of interactive supercomputing, driven by the needs of brain simulation, will impact a range of industries, while devices and systems, modelled after the brain, will overcome fundamental limits on the energy-efficiency, reliability and programmability of current technologies, clearing the road for systems with brain-like intelligence.

The “Human Brain Project” is on track to become the world’s largest experimental facility for developing the most detailed model of the brain. The research will increase our understanding of how the human brain works, which has countless implications for technology and medicine, from personalized medical treatments to artificial intelligence breakthroughs.

Researchers are divided over the news. Detractors say it’s an impossible endeavor at our current stage of computational development to model the brain’s 86 billion neurons. To make it really interesting will mean capturing the brain’s actual creative potential and intelligence, otherwise it will just be a big computer.

The authors, a group of computer scientists from the Raja Ramanna Centre for Advanced Technology in Indore, India Computing acknowledge the fact that scientific endeavors increasingly rely on parallel programming techniques running on High Performance Computing Clusters (HPCC).

When it comes to measuring cluster performance, there are multiple factors to take into account. “Memory, interconnect bandwidth, number of cores per processor/ node and job complexity are the major parameters which affect and govern the peak computing power delivered by HPCC,” they write.

The paper describes the researchers’ experiments with High Performance Linpack (HPL). They use the benchmark to analyze the effect of job distribution among single processors versus distributed processors. They’re also investigating the effect of the system interconnect on job performance. The work centers on an InfiniBand-connected HPC cluster.

The increasing prevalence of hybrid HPC systems that use coprocessors like GPUs to improve performance has implications to HPC cloud. In a new research paper [PDF], a team of computer scientists from the College of Computer Science and Technology at Jilin University in Changchun, China, explores the idea of GPGPU cloud as a paradigm for general purpose computing. Their work appears in the February 2013 issue of the Tsinghua Science and Technology Journal.

The authors start with the premise that the “Kepler General Purpose GPU (GPGPU) architecture was developed to directly support GPU virtualization and make GPGPU cloud computing more broadly applicable by providing general purpose computing capability in the form of on-demand virtual resources.”

To test their theories, they developed a baseline GPGPU cloud system outfitted with Kepler GPUs. The system is comprised of a cloud layer, a server layer, and a GPGPU layer, and the paper further describes “the hardware features, task features, scheduling mechanism, and execution mechanism of each layer.” The work aims to uncover hardware potential while also improving task performance. In identifying the advantages to general-purpose computing on a GPGPU cloud, the authors show themselves to be on the forefront of an emerging paradigm.

A group of scientists from the University of Minnesota and University of Colorado Boulder have contributed to a recently-published book, GPU Solutions to Multi-scale Problems in Science and Engineering. Their chapter, titled High Throughput Heterogeneous Computing and Interactive Visualization on a Desktop Supercomputer, examines some of the computational improvements that have resulted from the GPU accelerator movement. Their test system, a “desktop supercomputer,” was constructed for less than $2,500 using commodity parts, including a Tesla C1060 card and a GeForce GTX 295 card. The GPU cluster runs on Linux, and employs CUDA, MPI and other software as needed.

The authors make some interesting observations, including the following:

MPI is used not only for distributing and/or transferring the computing loads among the GPU devices, but also for controlling the process of visualization. Several applications of heterogeneous computing have been successfully run on this desktop. Calculation of long-ranged forces in the n-body problem with fast multi-pole method can consume more than 85 % of the cycles and generate 480 GFLOPS of throughput. Mixed programming of CUDA-based C and Matlab has facilitated interactive visualization during simulations.

They explain that what sets their work apart from other published research is their use of multiple GPU devices on one desktop, employed by multiple users for various types of applications at the same time. They state that they have extended GPU acceleration from the single program multiple data paradigm to the multiple program multiple data paradigm, and claim “test runs have shown that running multiple applications on one GPU device or running one application across multiple GPU devices can be done as conveniently as on traditional CPUs.”

Johnsson traces the evolution of mass market, specialized processors, including the Cell Broadband Engine (CBE) and graphics processors. She notes that GPUs, in particular, have received significant attention. The addition of hardware support for double-precision floating-point arithmetic, introduced three years ago, was key to this signification uptick in adoption, as was the recent support of Error Correcting Code.

To analyze the feasibility of deploying accelerated clusters, PRACE (the Partnership for Advanced Computing in Europe) performed a study, investigating three types of accelerators, the CBE, GPUs and ClearSpeed. The study assessed several metrics, including performance, efficiency, power efficiency for double-precision arithmetic and programmer productivity.

The Department of Energy’s National Energy Research Scientific Computing Center (NERSC) unveiled the winners of their inaugural High Performance Computing (HPC) Achievement Awards. The announcement was made at the annual NERSC User Group meeting at the Lawrence Berkeley National Laboratory (Berkeley Lab).

All NERSC users, the awardees were selected for their innovative use of HPC resources to help solve major computational or humanitarian challenges. Two early career awards were also presented.

NERSC Director Sudip Dosanjh stated that “High performance computing is changing how science is being done, and facilitating breakthroughs that would have been impossible a decade ago. The 2013 NERSC Achievement Award winners highlight some of the ways this trend is expanding our fundamental understanding of science, and how we can use this knowledge to benefit humanity.”