Modern HPC hardware has a lot of advanced and not easily accessible features that contribute significantly to the overall intra-node performance. However, many compute-bound HPC applications are historically grown to just use more cores and were not designed to utilize these features.

To make things worse, modern compiler cannot generate fully vectorized code automatically, unless the data structures and dependencies are very simple. As a consequence, such applications use only a low percentage of available peak performance. As scientists we therefore have the added responsibility to design generic data layouts and data access patterns to give the compiler a fighting chance to generate code utilizing most of the available hardware features. Such data layouts and access patterns are vital to utilize performance from vectorization/SIMDization. Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations.

But what happens if your problem does not fall into this category and 3rd-party libraries are not available? This training course will shed some light on how the goal of utilizing on-core performance and ultimatively performance portability can be achieved.

In the first part of the training course we want to give insights in today's CPU microarchitecture and apply this knowledge in the hands-on session. As a demonstrator we will use a simple Coulomb solver and improve the code step-by-step. We will start from a basic implementation and advance to an optimized version using hardware features like vectorization to increase performance.

The exercises will also contain training on the use of open-source tools to measure and understand the achieved performance. Such optimizations, however, depend heavily on the targeted hardware and should not be part of the algorithmic layer of the code.

In the second part we will present a detailed description of possible abstraction layers to hide such hardware-specifics and therefore maintain readability and maintainability. We will also discuss the overhead costs of our introduced abstraction and show compile-time SIMD configurations and corresponding performance results on different platforms.

If you ever asked yourself one of the following questions, this course is for you.

What is the performance of my code and how fast could it actually be?

Why is my performance so bad?

Does my code use SIMD?

Why does my code not use SIMD and why does the compiler not help me?

Is my data-structure optimal for this architecture?

Do I need to redo everything for the next machine?

Why is this so complicated, I thought the science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step. Even if you do not speak C++, it will be possible to follow along and understand the underlying concepts.