PGI 2018 Key Features

Accelerate Your HPC Applications with Tesla V100 GPUs

PGI OpenACC and CUDA Fortran now support CUDA 10.0 running on Tesla
Volta GPUs. Tesla V100 offers more memory bandwidth, more streaming
multiprocessors, next generation NVLink and new microarchitectural
features that add up to better performance and programmability. For
OpenACC and CUDA Fortran programmers, Tesla V100 offers improved
hardware support and performance for CUDA Unified Memory features on
both x86-64 and OpenPOWER processor-based systems. With PGI 2018, you
get the best of both worlds — world-class CPU performance plus
comprehensive GPU support.

PGI in the Cloud

PGI Community Edition compilers for Linux/x86-64 are now available as a container image on the NVIDIA GPU Cloud (NGC) and as an Amazon Machine Image (AMI) on the Amazon Web Services (AWS) Marketplace. These images provide OpenACC-enabled Fortran, C and C++ compilers supporting the latest multicore CPUs and NVIDIA GPUs including the Volta V100 family. NGC users can pull the PGI container to develop HPC applications on Alibaba Cloud, AWS, Google Cloud Platform, the Oracle Cloud Infrastructure or on local workstations and HPC systems. AWS users can run the PGI AMI on a variety of AWS-supported platforms. PGI in the Cloud is ideal for users who want to build, test, benchmark and run their own applications in the cloud using the latest NVIDIA GPUs, and for development and deployment of cloud-based parallel programming education and training.

PGI Auto-compare for OpenACC

Results can diverge between programs running on a CPU versus a GPU due to programming errors, precision of numerical intrinsics, or variations in compiler optimizations. Available since PGI 18.7, this option causes OpenACC compute regions to run redundantly on both the CPU and GPU. When data is copied from the GPU back to the CPU at data region boundaries, GPU results are compared with those computed on the CPU. Auto-compare works on both structured and unstructured data regions, with difference reports controlled by environment variables. With OpenACC Auto-compare you can quickly pinpoint where results start to diverge and adapt your program or compiler options as needed. Read more about the new auto-compare feature on the PGI Compiler Assisted Software Testing overview page.

OpenACC Deep Copy Directives

PGI 18.7 and later versions include an implementation of the draft OpenACC 3.0 true deep
copy directives in Fortran, C and C++. Many modern HPC applications
make extensive use of deeply nested aggregate data structures - Fortran
derived types, C++ classes and C structs. With true deep copy directives
you can specify a subset of members to move between host and device
memory within the declaration of an aggregate, including support
for named policies that allow distinct sets of members to be copied
at different points in a program. Once the deep copy pattern is defined,
a single data clause (copy(a)) can be used to copy the selected members
of the aggregate, including dynamically allocated members, some of which
can themselves be aggregate structures with dynamically allocated members.

Using the LLVM Code Generator:

LLVM/x86-64 Performance

PGI 2018 compilers for Linux/x86-64 platforms include an optional
LLVM-based code generator that delivers performance improvements of
up to 15% on many HPC applications. OpenACC and CUDA Fortran are fully
supported with the LLVM-based code generator, and it enables support
for OpenMP 4.5 features on the latest multicore x86-64 and OpenPOWER CPUs.
It can be
invoked with a simple compiler command-line option, using compiler path
settings, or using the environment modules commands included in PGI
installations. The LLVM-based code generator will become the default
on x86-64 targets in a future PGI release. Get started using it now
to see improved multicore CPU performance, take advantage of the latest
OpenMP features, and simplify migration to future PGI releases.

Support for the Latest CPUs

Multicore CPU performance remains one of the key strengths of the PGI
compilers, which now support the latest generation of HPC CPUs including
Intel Skylake, IBM POWER9 and AMD EPYC. PGI Fortran 2003, C11 and C++14
compilers deliver state-of-the-art SIMD vectorization and benefit from
newly optimized single and double precision numerical intrinsic functions
on Linux x86, Linux OpenPOWER, and macOS. See the benchmarks section for PGI 2018 performance
results on a variety of HPC industry standard benchmarks.

Full OpenACC 2.6

All PGI compilers now support the latest OpenACC features on both Tesla
GPUs and multicore CPUs. New OpenACC 2.6 features include manual deep copy
directives, the serial compute construct, if_present clause in the
host_data construct, no_create data clause, attach/detach clauses,
acc_get_property API routines and improved support for Fortran optional
arguments. Other OpenACC features added or enhanced include cache directive
refinements and support for named constant arrays in Fortran modules.

Dramatically Lower Development Effort

OpenACC for CUDA Unified Memory

PGI compilers leverage Pascal and Volta GPU hardware features, NVLink
and CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated
x86-64 and OpenPOWER processor-based servers. When OpenACC allocatable
data is placed in CUDA Unified Memory, no explicit data movement or data
directives are needed. This simplifies GPU acceleration of applications
that make extensive use of allocatable data, and allows you to focus on
parallelization and scalability of your algorithms. See the
OpenACC and CUDA Unified
MemoryPGInsider post for details.

OpenMP 4.5 for Multicore CPUs

Previously available with PGI compilers for Linux/OpenPOWER, PGI 2018
introduces support for OpenMP 4.5 syntax and features in the PGI Fortran,
C and C++ compilers on Linux/x86-64. You can now use PGI to compile
OpenMP 4.5 programs for parallel execution across all the cores of a
multicore CPU or server. TARGET regions are implemented with default
support for the multicore host as the target, and PARALLEL and DISTRIBUTE
loops are parallelized across all OpenMP threads.

New C++17 Features

Release 2018 of the PGI C++ compiler introduces partial support for
the C++17 standard when compiling with ‑‑c++17 or ‑std=c++17.
Supported C++17 core language features are available on all supported macOS
versions and on Linux systems with GCC 5 or newer. New C++ language features
include compile-time conditional statements (constexpr if), structured bindings,
selection statements with initializers, fold expressions, inline variables,
constexpr lambdas, and lambda capture of *this by value.

AVX-512 Support

Intel AVX-512 CPU instructions available on the latest generation
Skylake CPUs enable twice the number of floating point operations
compared to the previous generation AVX2 SIMD instructions. At 512
bits wide, AVX-512 doubles both the register width and the total
number of registers, and can help improve the performance of HPC
applications.

PGI Unified Binary Performance

PGI Unified Binary for Tesla and Multicore

Use OpenACC to build applications for both GPU acceleration and parallel
execution across all the cores of a multicore server. When you run the
application on a GPU-enabled system, the OpenACC regions will offload
and execute on the GPU. When the same application executable is run on a
system without GPUs installed, the OpenACC regions will be executed in
parallel across all CPU cores in the system. If you develop commercial
or production applications, now you can accelerate your code with OpenACC
and deploy a single binary usable on any system, with or without GPUs.

Use C++14 Lambdas with Capture in OpenACC Regions

C++ lambda expressions provide a convenient way to define anonymous
function objects at the location where they are invoked or passed as
arguments. The auto type specifier can be applied to lambda parameters
to create a polymorphic lambda-expression. With PGI compilers you can use
lambdas in OpenACC compute regions in your C++ programs. Using lambdas
with OpenACC is useful for a variety of reasons. One example is to drive
code generation customized to different programming models or platforms.
C++14 has opened up doors for more and more lambda use cases, especially
for polymorphic lambdas, and all of those capabilities are now usable in
your OpenACC programs.

Enhanced Profiling Features

New CPU Detail View shows a breakdown of the time spent on the CPU for
each thread. Three call tree options allow you to profile based on caller,
callee or by file and line number. View time for all threads together or
individually, quickly sort events by min or max time, and more. Other new
features include an option to adjust program counter sampling frequency,
and an enhanced display showing the NVLink version of the NVLink topology.

See the What's New section in the PGI Release Notes for complete details.