8 There is a hardware platform for each end user Hundreds of researchers Largescale clusters More than a million $ Thousand of researchers Cluster of Tesla servers Between and dollars Millions of researchers Tesla graphics card Less than 5000 dollars 8

9 II. The GPU evolution 9

10 The graphics card within the domestic hardware marketplace (regular PCs) GPUs sold per quarter: 114 millions [Q4 2010] millions [Q3 2011] 124 millions [Q4 2011] The marketplace keeps growing, despite of global crisis. Compared to CPUs sold, 93.5 millions [Q4 2011], there are 1.5 GPUs out there for each CPU, and this factor keeps growing relentlessly over the last decade (it was barely 1.15x in 2001). 10

11 In barely 5 years, CUDA programming has grown to become ubiquitous More than 500 research papers are published each year. More than 500 universities teach CUDA programming. More than 350 million GPUS are programmed with CUDA. More than active programmers. More than a million compiler and toolkit downloads. 11

12 The three generations of processor design Before

13 ... and how they are connected to programming trends 13

14 We also have OpenCL, which extends GPU programming to non-nvidia platforms 14

15 III. Programming 15

16 III. 1. Libraries 16

17 A brief example. Google search is a must before starting an implementation 17

18 The developer ecosystem enables the application growth 18

19 III. 2. Switching among hardware platforms 19

20 Compiling for other target platforms 20

21 Ocelot It is a dynamic compilation environment for the PTX code on heterogeneous systems, which allows an extensive analysis of the PTX code and its migration to other platforms. The latest version (2.1, as of April 2012) considers: GPUs from multiple vendors. x86-64 CPUs from AMD/Intel. 21

22 Swan A source-to-source translator from CUDA to OpenCL: Provides a common API which abstracts the runtime support of CUDA and OpenCL. Preserves the convenience of launching CUDA kernels (<<<grid,block>>>), generating source C code for the entry point kernel functions.... but the conversion process is not automatic and requires human intervention. Useful for: Evaluate OpenCL performance for an already existing CUDA code. Reduce the dependency from nvcc when we compile host code. Support multiple CUDA compute capabilities on a single binary. As runtime library to manage OpenCL kernels on new developments. 22

23 PGI CUDA x86 compiler Major differences with previous tools: It is not a translator from the source code, it works at runtime. In 2012, it will allow to build a unified binary which will simplify the software distribution. Main advantages: Speed: The compiled code can run on a x86 platform even without a GPU. This enables the compiler to vectorize code for SSE instructions (128 bits) or the most recent AVX (256 bits). Transparency: Even those applications which use GPU native resources like texture units will have an identical behavior on CPU and GPU. 23

24 III. 3. Accessing CUDA from other languages 24

25 Some possibilities CUDA can be incorporated into any language that provides a mechanish for calling C/C++. To simplify the process, we can use general-purpose interface generators. SWIG [ (Simplified Wrapper and Interface Generator) is the most renowned approach in this respect. Actively supported, widely used and already successful with: AllegroCL, C#, CFFI, CHICKEN, CLISP, D, Go language, Guile, Java, Lua, MxScheme/Racket, Ocaml, Octave, Perl, PHP, Python, R, Ruby, Tcl/Tk. A connection with Matlab interface is also available: On a single GPU: Use Jacket, a numerical computing platform. On multiple GPUs: Use MatWorks Parallel Computing Toolbox. 25

26 III. 4. OpenACC 26

27 The OpenACC initiative 27

28 OpenACC is an alternative to computer scientist s CUDA for average programmers The idea: Introduce a parallel programming standard for accelerators based on directives (like OpenMP), which: Are inserted into C, C++ or Fortran programs to direct the compiler to parallelize certain code sections. Provide a common code base: Multi-platform and multi-vendor. Enhance portability across other accelerators and multicore CPUs. Bring an ideal way to preserve investment in legacy applications by enabling an easy migration path to accelerated computing. Relax programming effort (and expected performance). First supercomputing customers: United States: Oak Ridge National Lab. Europe: Swiss National Supercomputing Centre. 28

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU

Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

Introduction to Parallel and Heterogeneous Computing Benedict R. Gaster October, 2010 Agenda Motivation A little terminology Hardware in a heterogeneous world Software in a heterogeneous world 2 Introduction

Shattering the 1U Server Performance Record Supermicro and NVIDIA recently announced a new class of servers that combines massively parallel GPUs with multi-core CPUs in a single server system. This unique

Introduction to High Performance Computing Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 6-8 June

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

OpenPOWER Software Stack with Big Data Example March 2014 Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

The Top Six Advantages of CUDA-Ready Clusters Ian Lumb Bright Evangelist GTC Express Webinar January 21, 2015 We scientists are time-constrained, said Dr. Yamanaka. Our priority is our research, not managing

Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Open Data Partners and AdReady April 2012 1 Executive Summary AdReady is working to develop and deploy sophisticated