Scientists/Developers who want to understand performance-critical hardware features of modern CPUs such as SIMD, ILP, caches or out-of-order execution, and utilize these features in their applications in a performance portable way. (Advanced course)

Contents:

Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations. But what happens if your problem does not fall into this category and 3rd-party libraries are not available?

In Part I of this course we provided insights in today's CPU microarchitecture. As example applications we used a plain vector reduction and a simple Coulomb solver. We started from basic implementations and advanced to optimized versions using hardware features such as vectorization, unrolling and cache tiling to increase on-core performance. Part II sheds some light on achieving portable intra-node performance.

Continuing with the example applications from Part I, we use threading with C++11 std::thread to exploit multi-core parallelism and SMT (Simultaneous Multi-Threading). In this context, we discuss the fork-join model, tasking approaches and typical synchronization mechanisms.

To understand the parallel performance of memory-bound algorithms we take a closer look at the memory hierarchy and the parallel memory bandwidth. We consider data locality in the context of shared caches and NUMA (Non-Uniform Memory Access).

In this course we present several abstraction concepts to hide the hardware-specific optimizations. This improves readability and maintainability. We also discuss the overhead costs of the introduced abstractions and show compile-time SIMD configurations as well as corresponding performance results on different platforms.

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step.

Prerequisites:

Participation in the Part I course or deep knowledge of the covered topics;
Linux (ssh), Command line tools (grep, less), knowledge of Fortran, C or C++ and a threading framework (std::thread, pthreads, ...);
Experience with own code exhibiting performance/scaling bottlenecks;
optional:
Git: examples are provided in a git repository
Editors: vim or emacs to work on remote machines

Please register with Andreas Beckmann (a.beckmann@fz-juelich.de) until 20 October 2020.
If you do not belong to the staff of Forschungszentrum Jülich, we need these data for registration:
Given name, name, birthday, nationality, complete home address, email address

JSC Events - Measures Regarding the Coronavirus Pandemic

Due to the preventive measures at Forschungszentrum Jülich regarding the spread of the Coronavirus, JSC courses were cancelled or postponed. For the time being and with the rapidly changing situation in mind, JSC cannot foresee whether courses and events in the next months can take place as scheduled as face-2-face events. Seminars will preferably be streamed as video conferences, courses might be postponed or partly given as webinars. We still take registrations for the upcoming courses. All participants who registered for courses so far will be notified by e-mail after the regular registration deadline whether and how the courses will be held.