Tag: Loop Parallelization

During SC14, Michael Klemm from Intel and myself teamed up to give an OpenMP 4.0 overview talk at the OpenMP booth. Our goal was to touch on all important aspects, from thread binding over tasking to accelerator support, and to entertain our audience in doing so. Although not all jokes translate from German to English as we intended, I absolutely think that the resulting video is a fun-oriented 25-minutes run-down of OpenMP 4.0 and worth sharing here:

Since 2001 already, the IT Center (formerly: Center for Computing and Communication) of RWTH Aachen University offers a one week HPC workshop on Parallel Programming during spring time. This course is not restricted to scientists and engineers from our university, in fact we have about 30% of external attendees each time. This year we were very happy about a record attendance of up to 85 persons for the OpenMP lectures on Wednesday. As usual we publish all course materials online, but this year we also created screencasts from all presentations. That means you see the slides and the live demos and you hear the presenter talk. This blog post contains links to both the screencasts as well as the other course material, sorted by topic.

OpenMP

We have three talks as an introduction to OpenMP from Wednesday and two talks on selected topics from Thursday, which were vectorization and tools.

Introduction to OpenMP Programming (part 1), by Christian Terboven:

Getting OpenMP up to Speed, by Ruud van der Pas:

Introduction to OpenMP Programming (part 2), by Christian Terboven:

Vectorization with OpenMP, by Dirk Schmidl:

Tools for OpenMP Programming, by Dirk Schmidl:

MPI

We have two talks as an introduction to MPI and one on using the Vampir toolchain, all from Tuesday.

Introduction to MPI Programming (part 1), by Hristo Iliev:

Introduction to MPI Programming (part 2), by Hristo Iliev:

Introduction to VampirTrace and Vampir by Hristo Iliev:

Intel Xeon Phi

We put a special focus on presenting this architecture and we have one overview talk and one talk on using OpenMP 4.0 constructs for this architecture.

Quoting from openmp.org: OpenMP, the de-facto standard for parallel programming on shared memory systems, continues to extend its reach beyond pure HPC to include embedded systems, real time systems, and accelerators. Release Candidate 1 of the OpenMP 4.0 API specifications currently under development is now available for public discussion. This update includes thread affinity, initial support for Fortran 2003, SIMD constructs to vectorize both serial and parallelized loops, user-defined reductions, and sequentially consistent atomics. The OpenMP ARB plans to integrate the Technical Report on directives for attached accelerators, as well as more new features, in a final Release Candidate 2, to appear sometime in the first Quarter of 2013, followed by the finalized full 4.0 API specifications soon thereafter.

Expect big news on OpenMP 4.0 for next week’s SC12. The OpenMP Language Committee – responsible for developing the standard – always planned to release the next version of the standard as a draft for public comment in time for SC12. We worked very hard during the last weeks to stay within our schedule. And we will do the following:

Release OpenMP 4.0 RC1 as a draft for public review. This document will be in a pretty good shape and will represent the foundation of OpenMP 4.0. It will contain several new features, to be discussed and explained during SC12 at our booth and/or the OpenMP BoF. Among these new features is the SIMD construct, to vectorize both serial as well as parallelized loops, taskgroups (no task dependencies yet), thread binding via places (I talked a lot on this already), array sectioning, basic support for Fortran 2003, and some other minor corrections and improvements.

Publish a Technical Report on OpenMP for Accelerators, more specifically on “Directives for Attached Accelerators”. This was always planned to be the major addition for OpenMP 4.0. However, integrating support for accelerators with the rest of OpenMP is a hard task and a lot of work, and it is not 100% done yet. There were many discussion on how to deal with this situation: do as outlined here, wait for just some more weeks, come up with a completely new schedule and wait until we are completely done, … . Almost all technical aspects have been discussed and answered. But the wording is not yet completed. And support for NVIDIA-like GPUs might not be optimal. However, I personally think the proposal is really good and the big opportunity in making the current state of work public is that the HPC community can take a look at it, think about it, comment on it, and possibly improve it. It is already online: http://openmp.org/wp/openmp-specifications/.

Hoping for constructive feedback and taking the additional time to work on the OpenMP for Accelerator extension, the current plan is to come up with a second draft for public comment (RC2) in January 2013 and then finalize the standard quickly after. Quickly in terms of a few weeks. This plan is still ambitious, but I think this is a good plan.

If you want to learn more, come to the OpenMP booth, and come to the BoF on Tuesday afternoon, 17:30h, which unluckily I will not be able to attend myself. Listen to what the people will show you and let us know what you like and what you dislike.

Whenever Prof. Christian Bischof, the head of our institute, is on duty to give the Introduction to Programming (de) lecture for first-year Computer Science students, he is keen on giving the students a glimpse on parallel programming. Same as in 2006, I was the guest lecturer this task has been assigned to. While coping with parallelism in various aspects consumes most time of my work day, these students just started to learn Java as their first programming language. Same as in 2006, I worried about how to motivate the students and what level of detail would be reasonable, and what tools and techniques to present within a timeframe of just 1.5 hours. In the following paragraphs I briefly explain what I did, and why. The slides used in the lecture are available online: Introduction to Parallel Programming; and my student Christoph Rackwitz made a screen cast of the lecture available here (although the slides are English, I am speaking German).

Programming Language: As stated above, the target audience are first-year Computer Science students attending the Introduction to Programming course. The programming language taught in the course is Java. In a previous lecture we once tried to present the examples and exercises in C, assuming that C is very similar to Java, but the students did not like that very much. Although they were able to follow the lecture and were mostly successful in the exercises, C just felt kind of foreign to most of them. Furthermore, C is not well-used in other courses later on in the studies, except for System Programming. The problem with Java is, however, that it is not commonly used in technical computing and the native approach to parallel programming in Java is oriented more towards building concurrent (business) applications than reasoning about parallel algorithms. Despite this issue we decided to use Java in the Introduction to Parallel Programming lecture in order to keep the students comfortable, and to not mess around with the example and exercise environment already provided for them. The overall goal of the lecture was to give the students an idea of the fundamental change towards parallelism, to explain the basic concepts, and to motivate them to develop an interest in this topic. We thought this is independent from the programming language.

Parallelization Model: We have Shared-Memory parallelization and Message-Passing for Clusters. It would be great to teach both, and of course we do that in advanced courses, but I do not think it is reasonable to cover both in an introductory session. In order to motivate the growing need for parallel programming at all, the trend towards Multi-Core and Many-Core is an obvious foundation. Given that, and the requirement to allow the students to work on examples and exercises on their systems at home, we decided to discuss multicore architectures and present one model for Shared-Memory parallel programming in detail, and just provide an overview of what a Cluster is. Furthermore, we hoped that the ability to speed-up the example programs by experiencing parallelism on their very own desktops or laptops would add some motivation. This feels more real than logging in to a remote system in our cluster. In addition, providing instructions to set up a Shared-Memory parallelization tool on a student’s laptop was expected to be simpler than for a Message-Passing environment (this turned out to be true).

Parallelization Paradigm: Given our choice to cover Shared-Memory parallelization, and the requirement to use Java and to provide a suitable environment to work on examples and exercises, we basically had three choices: (i) Java-Threads and (ii) OpenMP for Java, (iii) Parallel Java (PJ) – maybe we could have looked at some other more obscure paradigms as well, but I do not think they would have contributed any new aspects. In essence, Java-Threads are similar to Posix-Threads and Win32-Threads and are well-suited for building server-type programs, but not good for parallelizing algorithms or to serve in introductory courses. Using this model, you first have to talk about setting up threads and implementing synchronization before you can start to think parallel😉. I like OpenMP a lot for this purpose, but there is no official standard of OpenMP for Java. We looked at two implementations:

JOMP, by the Edinburgh Parallel Computing Center (EPCC). To our knowledge, this was the first implementation of OpenMP for Java. It comes as a preprocessor and is easy to use. But the development has long stopped, and it does not work well with Java 1.4 and later.

JaMP, by the University of Erlangen. This implementation is based on the Eclipse compiler and even extends Java for OpenMP to provide more constructs than the original standard, while still not providing full support for OpenMP 2.5. Anyhow, it worked fine with Java 1.6, was easy to install and distribute among the students and thus we used it in the lecture.

Parallel Java (short: PJ), by Alan Kaminsky at the Rochester Institute of Technology, also provides means for Shared-Memory parallelization, but in principle it is oriented towards Message-Passing. Since it provides a very nice and simplified MPI-style API, we would have used it if we included Cluster programming, but sticking to Shared-Memory parallelization we went for JaMP.

Content: What should be covered in just 1.5 hours? Well, of course we need a motivation in the beginning of why parallel programming will be more and more important in the future. We also explained why the industry is shifting towards multicore architectures, and what implications this will or may have. As explained above, the largest part of the lecture was spent on OpenMP for Java along with some examples. We started with a brief introduction on how to use JaMP and how OpenMP programs look like, then covered Worksharing and Data Scoping with several examples. I think experiencing a Data Race is a very important thing every parallel programmer should have made🙂, as well as learning about reductions. This was about it for the OpenMP part then. The last minutes of the lecture were spent on clusters and their principle ideas, followed by a Summary.

Given the constraints and our reasoning outlined above, we ended up using Java as the programming language and JaMP as the paradigm to teach Shared-Memory parallelization; just mentioning that there are Clusters as well. Although the official course evaluation is not done yet, we got pretty positive feedback regarding the lecture itself, and the exercises were well-accepted.What unnerves me is the fact, that there is no real OpenMP for Java. The Erlangen team provided a good implementation along with a compiler to serve our example and exercises, but it does not provide full OpenMP 2.5 support, not to speak of OpenMP 3.0. Having a full OpenMP for Java implementation at hand would be a very valuable tool for teaching parallel programming to first-year students, since Java is the language of choice not only at RWTH Aachen University.

Do you have other opinions, experiences, or ideas? I am always in for a discussion.

Not yet carved in stone, but the current plan of the OpenMP Language Committee (LC) is to publish a draft OpenMP 3.1 standard for public comment by IWOMP 2010 and to have the OpenMP 3.1 specification finished for SC 2010 – given that the Architecture Review Board (ARB) accepts the new version. Bronis R. de Supinski (LLNL) has taken on the duty of acting as the chair of the LC and since introduced some process changes. Besides weekly telephone conference calls, there are three face-to-face meetings per year and attendance is required for voting rights. The first face-to-face meeting was held on June 1st and 2nd in Dresden attached to IWOMP 2009, the second one was on September 22nd and 23rd in Chicago. This blog post is intended to report on this last meeting and to present an overview of what is going on with OpenMP right now, obviously from my personal point of view.

In the course of resuming work on OpenMP after the 3.0 specification was published, the LC voted on the priority of (small) extensions and clarifications for 3.1 as well as new topics for 4.0. We ended up with 12 major topics and 5 subcommittees, as outlined in Bronis talk during IWOMP 2009, which are still in use as identifiers of the different topics people are working on.

1: Development of an OpenMP Error Model. This is the feature the LC people think OpenMP is missing most desperately, but in contrast to that it did not receive too much effort yet. A subcommittee has been formed to be lead by Tim Mattson (Intel) and Michael Wong (IBM), and currently there are three proposals on the table for discussion: (i) an extension of the API routines and some constructs to return error codes or the introduction of a global error indication variable, (ii) an exception-based mechanism to catch errors, and (iii) a callback-based mechanism allowing to react on errors based on the severity and origin. The absence of an error model is clearly a reason for not using OpenMP in applications with certain requirements on reliability, but introducing the wrong error model could easily spoil OpenMP for that audience. It seems that most LC people do not like error codes too much (I don’t either), using exceptions is not suitable for C and FORTRAN, so the third approach seems most promising by allowing a program to react on errors depending on the severity and to still allow the compiler to ignore OpenMP if it is not enabled. In fact, this mechanism has been proposed back in 2006 by Alex Duran (BSC) and friends already. Since nothing has been decided yet, I guess the error model is targeted for OpenMP 4.0.

2: Interoperability and Composability. This subcommittee is lead by myself (RWTH) and Bronis R. de Supinski (LLNL) and is looking for ways of allowing OpenMP to coexist with other threading packages, maybe even with other OpenMP runtime environments in the same application. We are also looking into how to allow the creation of parallel software components that can safely be plugged together, which I consider prominently missing in virtually all threading paradigms. This is a very broad topic and there is no OpenMP version number I would assign this topic as target for being solved to, but with a little bit of luck we can make some progress even for version 3.1. We have some ideas on the table of how to specify some basic aspects of OpenMP interacting with the native threading packages (POSIX-Threads on Linux/Unix, Win32-Threads on Windows), driven by application observations and known deficiencies in current OpenMP implementations. We might also attack the problem of orphaned reductions. I am not so certain of solving the issue of allowing or detecting nested Worksharing constructs, respectively.

3: Incorporating Tools Support into the OpenMP Specification. This has been on the feature wishlist for OpenMP 3.0 already, but there is hardly any activity regarding this topic. Most vendors provide their own tools to analyze the performance (or correctness) of OpenMP programs by making their own runtime talk to their specific tool, but this situation is far from optimal for research / academia tools. As early as back in 2004 there were some proposal (i.e. POMP by Bernd Mohr and friends), but they did not made it into the specification or into actual implementations.

4: Associating Computation or Memory across Workshares. Today, the world of OpenMP is flat (memory), so this topic is mostly about supporting cc-NUMA architectures in OpenMP. There are two subcommittees working on this issue, the first is lead by Dieter an Mey (RWTH) and the goal is to standardize common practices (used in today’s applications) of dealing with cc-NUMA optimizations. If nothing comes in between, OpenMP 3.1 will allow the user to bind threads to cores by either specifying an explicit mapping, or by telling the runtime a strategy (like compact vs. scatter). Of course there are more ideas (and features needed), like influencing the memory allocation scheme or using page migration if supported by the operating system or interacting with resource management systems (batch queuing systems), but these are very hard to specify in a portable and extensible fashion. The other subcommittee is lead by Barbara Chapman (UH) and deals with thread team control. Using the Worksharing in OpenMP, it is very hard to dedicate a special task (i.e. I/O) to just one thread of the Parallel Region. There are applications asking for that, but I don’t see a proposal that the LC would agree on for 3.1. Nevertheless, they presented some interesting ideas at the last F2F based based on HPCS language capabilities, which hopefully have the potential to influence OpenMP 4.0.

5: Accelerators, GPUs and More. Of course we have to follow the trend / hype😉. But since no one knows for sure in which directions the hardware is evolving, there are so many different ideas on how to deal with this. Out of my head I can enumerate that PGI has some directives loosely based on OpenMP Worksharing (plus they have CUDA for FORTRAN), IBM has OpenMP for cell with several ideas on extensions, BSC has a proposal that is in principle based on their *SS concept, and CAPS Entreprise has the HMPP constructs + compiler. In summary: No clear direction yet, nothing for OpenMP in the scope of 3.1.

6: Transactional Memory and Thread Level Speculation. Some people thought that OpenMP might need something for Transactional Memory. To the best of my knowledge no one from the LC did any work on this regard.

7: Refinements to the OpenMP Tasking Model. There are two things that most people agree Tasks are missing: Dependencies and Reductions. With respect to the former, there were three proposals on the table from Grant Haab (Intel), Federico Massaioli (Caspur) and Alex Duran (BSC) and the BSC proposal looks most promising because it avoid deadlocks. It employs existing program variables to define the dependencies between tasks, i.e. the result of a computation can be the input of another task. With a good portion of luck, Task Dependencies could actually make it into OpenMP 3.1, I think. With respect to the latter thing, namely Task Reductions, there has been only little progress so far.

8: Extending OpenMP to C++0x and FORTRAN 2003. Since the C++0x standard dropped Concepts, the work that Michael Wong (IBM) and myself (RWTH) made so far became obsolete. To the best of my knowledge there has been no progress made with respect to investigate the opportunities or issues that could arise with FORTRAN 2003.

9: Extending OpenMP to Additional Languages. Well, there are Java and C#, and at least for Java there are some implementations of OpenMP available (incomplete, though). Anyhow, there was never any real attempt to write a formal specification of OpenMP for Java, nor for C#, and I don’t think there is one now.

10: Clarifications to the Existing Specifications. The LC already approved several minor corrections (i.e. mistakes in the examples, improvements in the wording, and the like) that will make their way into OpenMP 3.1. Nothing spectacular, though, but this is something that has to be done.

11: Miscellaneous Extensions. I might be wrong, but I think that User-defined Reductions (UDR) belong to this topic. Yes, there is a chance that UDRs will make it into OpenMP 3.1! This will bring obvious things like min and max for C and C++, but we are aiming higher: The goal is to enable the programmer to write any type of reduction operation for any type in the base language (including non-PODs) and this is achieved by introducing an OpenMP declare statement to define a reduction operation that can be specified in a reduction clause. There are two problems that are under discussion right now: (i) C++ templates and (ii) pointers / arrays. The first can be addressed by an extension of the current proposal and I got the feeling that most LC people like the new approach, but the second is a bit more complex. If you want to reduce an array that is described by a pointer, you need to know how much space to allocate for the thread private copy and how many elements the array consists of. There has been some discussion on this, but no strong agreement on how to solve this issue in general, as it also arises with the private, firstprivate, … clauses. We only agreed that we need a one-fits-all solution. With some good portion of luck we can solve this issue, otherwise we hopefully get UDRs with some limitations in OpenMP 3.1 and the full functionality in a later version of the specification.

12: Additional Task / Threads Synchronization Mechanisms. Again I might be wrong, but I think that the Atomic Extension proposal by Grant Haab (Intel) belongs in here. This is a feature you will also find in threading-aware languages (such as C++0x), but the current base languages of OpenMP are not of that kind. This will almost certainly make it into OpenMP 3.1 and will allow for a portable way to write atomic updates that capture a value and atomic writes. This is already supported by most machines and using an atomic operations can be so much more efficient than using a Critical Region.

If you are interested in more details, you are invited to stop by the OpenMP booth at SC 2009 in Portland and ask the nice guy on booth duty some good questions🙂.