Search the Site

ORNL’s supercomputing program grew from humble beginnings to deliver the most powerful system ever seen. On the way, it has helped researchers deliver practical breakthroughs and new scientific knowledge in climate, materials, nuclear science, and a wide range of other disciplines.

Learn more about the OLCF’s comprehensive suite of hardware and software resources for the creation, manipulation, and retention of scientific data. For User support in using these resources, please visit the For Users section of this website.

Learn more about the OLCF’s comprehensive suite of hardware and software resources for the creation, manipulation, and retention of scientific data. Please note that the For Users section of this website provides extensive information on accessing and employing these resources.

The Oak Ridge Leadership Computing Facility (OLCF) engages a world-class team from national laboratories, research institutions, computing centers, universities, and vendors to take a dramatic step forward to field a new capability for high-end science.

Look here for the latest news, reports, and graphics from the OLCF, including branding tools, logos, and acknowledgement statements. You will also find archived information and resources like annual reports and operational assessments.

ORNL’s supercomputing program grew from humble beginnings to deliver the most powerful system ever seen. On the way, it has helped researchers deliver practical breakthroughs and new scientific knowledge in climate, materials, nuclear science, and a wide range of other disciplines.

Learn more about the OLCF’s comprehensive suite of hardware and software resources for the creation, manipulation, and retention of scientific data. For User support in using these resources, please visit the For Users section of this website.

Learn more about the OLCF’s comprehensive suite of hardware and software resources for the creation, manipulation, and retention of scientific data. Please note that the For Users section of this website provides extensive information on accessing and employing these resources.

The Oak Ridge Leadership Computing Facility (OLCF) engages a world-class team from national laboratories, research institutions, computing centers, universities, and vendors to take a dramatic step forward to field a new capability for high-end science.

Look here for the latest news, reports, and graphics from the OLCF, including branding tools, logos, and acknowledgement statements. You will also find archived information and resources like annual reports and operational assessments.

PGI Accelerator Vector Addition

This sample shows a minimal conversion of our vector addition CPU code to a PGI accelerator directives version. Consider this a PGI Accelerator ‘Hello World.’ Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov

This tutorial covers PGI Accelerator directives, If you are interested in PGI OpenACC support please see: OpenACC Vector Addition

The code inside of the acc region is computed on the GPU. The region begins with the #pragma acc region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region as deemed necessary by the compiler.

Compiling vecAdd.c

We add the target accelerator flag to specify we want to compile for NVIDIA accelerators

Changes to vecAdd.f90

!$acc region
do i=1,n
c(i) = a(i) + b(i)
enddo
!$acc end region

The code inside of the acc region is computed on the GPU. The region begins with the !acc region directive and ends with the !acc end region directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

Compiling vecAdd.f90

We add the target accelerator flag to specify we want to compile for NVIDIA accelerators

Running vecAdd.f90

$ aprun ./vecAdd.out
final result: 1.000000

Additional Information

Much information is obscured from the programmer so let’s add the Minfo compiler flag to see what the compiler is doing. With the Minfo flag we will see memory transfer and thread placement information.

We see that at line 33, the start of our acc region, that elements 0 to 99999 of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU but does need to come out and we see it has been correctly handled by the compiler.

Next the compiler tells us it has generated binaries for both compute capability 1.0 and compute capability 1.3 devices. The binary with the highest compute capability less than or equal to the GPU it is being run on will be used, allowing the executable to be portable yet highly tuned.

35, Loop is parallelizable
Accelerator kernel generated

Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.

35, #pragma acc for parallel, vector(256)

In CUDA terminology this translates to a kernel that has a block size of 256, that is, 256 threads will be in each logical thread block.