Transcription

2 Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as a great optimization example We ll walk step by step through 7 different versions Demonstrates several important optimization strategies 2

3 Parallel Reduction Tree-based approach used within each thread block Need to be able to use multiple thread blocks To process very large arrays To keep all multiprocessors on the GPU busy Each thread block reduces a portion of the array But how do we communicate partial results between thread blocks? 3

4 Problem: Global Synchronization If we could synchronize across all thread blocks, could easily reduce very large arrays, right? Global sync after each block produces its result Once all blocks reach sync, continue recursively But CUDA has no global synchronization. Why? Expensive to build in hardware for GPUs with high processor count Would force programmer to run fewer blocks (no more than # multiprocessors * # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency Solution: decompose into multiple kernels Kernel launch serves as a global synchronization point Kernel launch has negligible HW overhead, low SW overhead 4

20 Instruction Bottleneck At 17 GB/s, we re far from bandwidth bound And we know reduction has low arithmetic intensity Therefore a likely bottleneck is instruction overhead Ancillary instructions that are not loads, stores, or arithmetic for the core computation In other words: address arithmetic and loop overhead Strategy: unroll loops 20

21 Unrolling the Last Warp As reduction proceeds, # active threads decreases When s <= 32, we have only one warp left Instructions are SIMD synchronous within a warp That means when s <= 32: We don t need to syncthreads() We don t need if (tid < s) because it doesn t save any work Let s unroll the last 6 iterations of the inner loop 21

24 Complete Unrolling If we knew the number of iterations at compile time, we could completely unroll the reduction Luckily, the block size is limited by the GPU to 512 threads Also, we are sticking to power-of-2 block sizes So we can easily unroll for a fixed block size But we need to be generic how can we unroll for block sizes that we don t know at compile time? Templates to the rescue! CUDA supports C++ template parameters on device and host functions 24

31 Algorithm Cascading Combine sequential and parallel reduction Each thread loads and sums multiple elements into shared memory Tree-based reduction in shared memory Brent s theorem says each thread should sum O(log n) elements i.e or 2048 elements per block vs. 256 In my experience, beneficial to push it even further Possibly better latency hiding with more work per thread More threads per block reduces levels in tree of recursive kernel invocations High kernel launch overhead in last levels with few blocks On G80, best perf with blocks of 128 threads elements per thread 31

Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x

Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples

Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

IMAGE PROCESSING WITH CUDA by Jia Tse Bachelor of Science, University of Nevada, Las Vegas 2006 A thesis submitted in partial fulfillment of the requirements for the Master of Science Degree in Computer

Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.

OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

OpenL on FPGs for GPU Programmers Introduction The aim of this whitepaper is to introduce developers who have previous experience with generalpurpose computing on graphics processing units (GPUs) to parallel