Transcription

3 Modern GPU Hardware GPUs have many parallel execution units and higher transistor counts, while CPUs have few execution units and higher clock speeds A GPU is for the most part deterministic in its operation (quickly changing). GPUs have much deeper pipelines (several thousand stages vs for CPUs) GPUs have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs

23 Quick Glimpse At Programming Models Application can include multiple kernels Threads of the same block run on the same SM So threads in SM can operate and share memory Block in an SM is divided into warps of 32 threads each A warp is the fundamental unit of dispatch in an SM Blocks in a grid can coordinate using global shared memory Each grid executes a kernel

24 Scheduling In Fermi At any point of time the entire Fermi device is dedicated to a single application Switch from an application to another takes 25 microseconds Fermi can simultaneously execute multiple kernels of the same application Two warps from different blocks (or even different kernels) can be issued and executed simultaneously

25 Scheduling In Fermi two-level, distributed thread scheduler At the chip level: a global work distribution engine schedules thread blocks to various SMs At the SM level, each warp scheduler distributes warps of 32 threads to its execution units.

26 Source: Nvidia

27 An SM in Fermi 32 cores SFU = Special Function Unit 64KB of SRAM split between cache and local mem Each core can perform one single-precision fused multiply-add operation in each clock period and one double-precision FMA in two clock periods

28 The Memory Hierarchy All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address space. Global, shared, and local addresses are defined as ranges within this address space and can be accessed by common load/store instructions. The load/store instructions support 64-bit addresses to allow for future growth.

29 The Memory Hierarchy Local memory in each SM The ability to use some of this local memory as a first-level (L1) cache for global memory references. The local memory is 64K in size, and can be split 16K/48K or 48K/16K between L1 cache and shared memory. Because the access latency to this memory is also completely predictable, algorithms can be written to interleave loads, calculations, and stores with maximum efficiency.

30 The Memory Hierarchy Fermi GPU is also equipped with an L2. The L2 cache covers GPU local DRAM as well as system memory. The L2 cache subsystem also implements a set of memory readmodify-write operations that are atomic

31

32

33 Conclusions By looking at the hardware features, can you see how you can write more efficient programs for GPUs? Start forming groups of up to 5 students for the project Two papers to read (links on the web page)

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

8 Performance Basics; Computer Architectures 8.1 Speed and limiting factors of computations Basic floating-point operations, such as addition and multiplication, are carried out directly on the central

1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

Utrecht University GPU performance prediction using parametrized models Andreas Resios A thesis submited in partial fulfillment of the requirements for the degree of Master of Science Supervisor: Utrecht

Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general