This course will introduce you to the multiple forms of parallelism found in modern Intel architecture processors and teach you the programming frameworks for handling this parallelism in applications. You will get access to a cluster of modern manycore processors (Intel Xeon Phi architecture) for experiments with graded programming exercises.
This course can apply to various HPC and datacenter workloads and framework including artificial intelligence (AI). You will learn how to handle data parallelism with vector instructions, task parallelism in shared memory with threads, parallelism in distributed memory with message passing, and memory architecture parallelism with optimized data containers. This knowledge will help you to accelerate computational applications by orders of magnitude, all the while keeping your code portable and future-proof.
Prerequisite: programming in C/C++ or Fortran in the Linux environment and Linux shell proficiency (navigation, file copying, editing files in text-based editors, compilation).

Преподаватели

Andrey Vladimirov

Текст видео

How do you approach programming with this variety of processors, in such a way that your program doesn't become obsolete in the next few years? Modern code is a set of practices that allows you to design your computational applications in such a way that they are highly optimized for modern processors. And also portable across the different varieties of modern processors. And also future proof meaning ready to exploit the upcoming generations of processors. So satisfy these requirements, modern code should be cognisant of the underlying computer architecture. And at the same time compliant with standards. Usually, code modernization comes up in the context of taking an older application which we might call a legacy application. And optimizing it for performance. To optimize a legacy application for performance, you must teach it to use multiple cores, vectorization. And other features that were not important at the time that this application was developed. But at the same time, you don't want to specialize your application so much that it runs very well on one processor. But cannot scale to future processors. To avoid specialization, you have to rely on a high level framework, such as OpenMP. OpenMP for example, can give you access to threading, vectorization and offload by means of directives in the code. So performance optimization and code modernization come hand in hand. And it is important to understand the scope of this discussion. When you are solving a real world problem in your field, you may be making decisions that are specific to your field. Such as the decisions on describing your problem with an analytical model and on discretizing it. And decisions related to the choice of numerical algorithm may be the subject of computer science. Code modernization becomes important when you are taking a particular numerical algorithm. And implementing it as close as possible to the Computer Architecture, as a C or a C++ or four gen code. So we will talk about the Implementation level performance optimization. Such performance optimization, maybe as structured as five areas. Scalar Tuning, Vectorization, Threading, Memory access, and Communication. These Optimization Areas corresponds to the fundamental building blocks of modern processors. The pipeline, vector processing units, cores, caches and memory and fabrics in the case of cluster applications. We'll present, in this course, targeted exercises for each of these Optimization Areas. And you will see how to enable legacy applications for support for these features without making them too specialized for a particular platform. The typical result of code modernization is shown in this block. What happened here is we took an educational exercise, a direct N-Body simulation. And initially implemented it as a Legacy application, with Legacy practices. And then with some performance optimization, we achieved what could be called a Modern code. For each performance optimization step, this application, we benchmark this application on three platforms. A high end Xeon, a 1st generation Xeon Phi co-processor, and a 2nd generation Xeon Phi processor. For every optimization step on old platforms, we use the same exact code. So what you can observe from these numbers, is first, that legacy codes, that this legacy code did not perform well on specialized platforms. It performed in fact worse than on a Xeon. And with performance optimization the second observation is that performance grew on all platforms. At the final stage of optimization we observe that Xeon Phi performs better than a Xeon. And 2nd generation Xeon Phi performs better than 1st generation which proves the portability and future readiness of this code. This course will present the fundamental methods used in each of these modernization steps. And I will outline the course road map in the next video.