Scientists/Developers who want to understand performance-critical hardware features of modern CPUs (like SIMD, ILP, caches, out-of-order execution) and utilize these features in their code. (Advanced course)

Contents:

Modern HPC hardware has a lot of advanced and not easily accessible features that contribute significantly to the overall intra-node performance. However, many compute-bound HPC applications are historically grown to just use more cores and were not designed to utilize these features.

To make things worse, modern compilers cannot generate fully vectorized code automatically, unless the data structures and dependencies are very simple. As a consequence, such applications use only a low percentage of available peak performance. Hence, scientists have the additional responsibility to design generic data layouts and data access patterns. This gives the compiler a fighting chance to generate code that utilizes most of the available hardware features. Those data layouts and access patterns are vital to utilize performance from vectorization/SIMDization.

Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations. But what happens if your problem does not fall into this category and 3rd-party libraries are not available? The training course sheds some light on achieving on-core performance.

We provide insights in today's CPU microarchitecture and apply this knowledge in the hands-on sessions. As example applications we use a plain vector reduction and a simple Coulomb solver. We start from basic implementations and advance to optimized versions using hardware features such as vectorization, unrolling and cache tiling to increase performance. The course also contains training on the use of open-source tools to measure and understand the achieved performance results.

This course is for you if you ever asked yourself one of the following questions:

What is the performance of my code and how fast could it actually be?

Why is my performance so bad?

Does my code use SIMD?

Why does my code not use SIMD and why does the compiler not help me?

Is my data structure optimal for this architecture?

Do I need to redo everything for the next machine?

Why is it that complicated, I thought science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step. Even if you do not speak C++, it will be possible to follow along and understand the underlying concepts.

In Part II of the course you will learn how to utilize these features in a performance portable way on multiple cores of a node. Furthermore, we will show how to use abstraction layers to separate the hardware-specific optimizations from the algorithm.

Prerequisites:

Linux (ssh), Command line tools (grep, less), knowledge of Fortran, C or C++;
Experience with own code exhibiting performance/scaling bottlenecks;
optional:
Git: examples are provided in a git repository
Editors: vim or emacs to work on remote machines

Please register with Andreas Beckmann (a.beckmann@fz-juelich.de).
If you do not belong to the staff of Forschungszentrum Jülich, we need these data for registration:
Given name, name, birthday, nationality, complete home address, email address