Month: November 2008

This blog post has been hanging around in my draft folder for quite some time now. I have been talking about the changes a lot during the last couple of days and I just found a few minutes to finish it to publish my notes here. From a C++ programmer’s point of view, OpenMP 3.0 comes with a few improvements and since I was involved in getting those into the specification, this post is about explaining what we achieved and why we did not (yet) went further with some features…

1: The for-Worksharing has been enhanced to include RandomAccessIterators and both signed and unsigned integers and even C-style pointers.

RandomAccessIterators allow for constant-time increment, decrement, advance and distance computation, as they basically encapsulate pointer arithmetic. The 3.0 specification allows the use of loop variables of RandomAccessIterator type for loops used with for-Worksharing, as long as only the following relational operators are used inside the loop expression: <, <=, >, >=.
While it is certainly nice to have this in, many programmers (including me) will find the != operator missing. The reason for the exclusion is that not all language committee members could be convinced that the number of loop iterations could be computed beforehand when this operator would be allowed (or: that the overflow-behavior would be equivalent to the integer case). By the time we decided to stop adding / extending features, we did not had answers for all questions and theoretical counter-examples, so != did not make it; nevertheless we hope to get it into the next specification.

The 2.5 specification only allowed signed integer loop variables for loops used with for-Worksharing. This was bad, as it was incompatible with size_t, which is used to query the number of elements in a STL container, for example. The 3.0 specification allows both signed and unsigned integer variables, which is a rather small but nice improvement.

C-style pointer loops have been allowed as well with the 3.0 specification. The same restrictions apply on the operator list as for the C++ RandomAccessIterator case.

2: It is now possible to threadprivatize static class member variables.

It is important to note that only *static* class member variables can be made threadprivate. There have been certain use cases for this, for example the implementation of a Singleton pattern and thread-specific allocators, and the changes to the specification were minimal. This was probably just overlooked in the 2.5 specification process.
There were some requests to allow the privatization of class member variables in general, but this cannot be done by using the threadprivate clause since the address of general class member variables is not know at compile time.

3: We specified the lifetime and initialization of non-POD data types used in privatization clauses.

The lifetime and initialization of non-POD data types was kind of unclear in the 2.5 specification and because of that, the behavior changed between different compilers. It was important to get this right, so we made some updates to the semantics of private variables of non-POD data types. There are a few things that are important to note:

The order in which constructor calls and destructor calls for different threads happen is undefined. This is because we do not (want to) define the order in which threads are started, and you should never do any assumptions on that.

The default constructor and destructor have to be accessible. If, for example, the default constructor is private, the program is non-conforming if such an object occurs in a private clause.

We stated some things explicitly, e.g. that private objects have to be destructed at the end of a Parallel Region. This should force implementations to become consistent. Of course you should be aware that implementations are allowed to introduce additional objects of automatic storage duration if they “like”, this is granted by the C++ standard. The following lists a brief overview what happens with C++ non-POD data types for the different privatization clauses:

private: There has to be an accessible, unambiguous default constructor which is called for each object, and the object is destructed at the end of the Parallel Region via the accessible, unambigous destructor.

firstprivate: The private instances are copy-constructed, the argument for the copy constructor call is the original list item. Of course it is required for such a data type to have an accessible, unambiguous copy constructor.

lastprivate: The value is written back to the original list item by using the accessible, unambiguous copy assignment operator. A suitable constructor has to be available unless the data type is used in a firstprivate clause, then a suitable copy constructor is needed.

threadprivate: Here we have to differentiate three kinds of initialization: (i) no initialization, then the default constructor is called; (ii) direct initialization, then the constructor accepting the argument is called; (iii) copy initialization, then the copy constructor is called. In any case, the objects have to be constructed before the first reference, and have to be destructed after the last reference and before the program has been terminated.

threadprivate+copyin and threadprivate+copyprivate: Regarding the initialization the rules for threadprivate apply for the first encountered Parallel Region, at any following Parallel Region the copy assignment operator is invoked.

This is just a brief summary of the changes we made, I hope this is of interest for at least some person other than me🙂. From my point of view, there is still one “simple” thing missing: Allowing non-POD data types in reductions. By the time we decided to stop adding / extending features, we did not find a consensus on how the initialization for the reduction may occur. This is imporant, because with overloading you basically can implement user-defined reductions. We really hope to have that in the next specification update!

When I was asked to give an answer to the question of How to kill OpenMP by 2011 during the OpenMP BoF panel discussion at SC08, I decided against listing the most prominent issues and challenges OpenMP is facing. It turned out that the first two speakers – Tim Mattson from Intel and Bronis de Supinski from LLNL – did exactly this very well. Instead, my claim is that OpenMP is doing quite well today and we “just” have to continue in riding on the multi-core momentum by outfitting OpenMP with a little more features. Our group is pretty involved in the OpenMP community and I got the feeling that since around early 2008 OpenMP is gaining moment and I tried to present this in an entertaining approach. This is a brief textual summary of my panel contribution (please do not take all things too seriously).

RWTH Aachen University is a member of the OpenMP ARB (= Architecture Review Board), as OpenMP is very important for many of our applications: All large codes (in terms of compute cycle consumption) are hybrid today, and in order to server some complex applications for which no MPI parallelization exists (so far) we offer the largest SPARC- and x86-based SMP systems one could buy. Obviously we would be very sad if OpenMP would disappear, but in order to find an answer for the question what an university could do to kill OpenMP 2011 it just needed a few domestic beers and a good chat with friends at one of the nice pubs in Austin, TX: Go teaching goto-based spaghetti style programming, as branching in and out of Parallel Regions is not allowed by OpenMP and as such this programming style is inherently incompatible with OpenMP.

By the next day this idea hat lost some of it’s fascination🙂, so I went off to evaluate OpenMP’s current momentum. In 2007, we have been invited to write a chapter for David Bader’s book on Petascale Computing. What we did just recently was to do a keyword search (with some manual postprocessing):

Petascale Computing: Algorithms and Applications, by David Bader.

Keyword

Hits

MPI

612

OpenMP

150

Thread

109

Posix-Threads

2

UPC

30

C++

87

Fortran

69

Chapel

49

HPF

11

X10, Fortress, Titaium

< 10

This reveals at least the following interesting aspects:

MPI is clearly assessed to be the most important programming paradigm for Petascale Systems, but OpenMP is also well-recognized. Our own chapter on how to exploit SMP building blocks contributed for only 28 of the 150 hits on OpenMP.

The term Thread is often used in conjunction with OpenMP, but other threading models are virtually not touched at all.

C/C++ and Fortran are the programming languages considered to be used to program current and future Petascale systems.

There was one chapter on Chapel and because of that it had a comparably high number of hits, but otherwise the “new” parallel programming paradigms are not (yet ?) considered to be significant.

In order to take an even closer look at the recognition of OpenMP we asked our friend Google:

Google Trends: OpenMP versus Native Threading.

One can cleary see that the interest in OpenMP is increasing, as opposite to Posix-Threads and Win32-Threads. At the end of 2007 there is a peak when OpenMP 3.0 was announced and a draft standard for public comment was released. Since Q3/2008 we have compilers supporting OpenMP 3.0 which is accounting for increasing interest again. As there is quite some momentum in OpenMP it is hard for us, representing a University / the community it is hard of not impossible to kill OpenMP – which is actually quite nice.

But going back to finding an answer for the question posed on us, we found a suitable assassin: The trend of making Shared-Memory Systems more and more complex in terms of the architecture. For example all current x86-based systems (as announced this week at SC08) are cc-NUMA systems if you have more than one socket, and maybe we will eventually see NUMA (= non-uniform cache architecture) systems as well. So actually the hardware vendors have a chance to kill OpenMP by designing systems that are hard to exploit efficiently with multithreading. So the only chance to really kill OpenMP by 2011 is leaving it as it is and not equipping it with means to aid the programmer in squeezing performance out of such systems with an increasing depth of the memory hierarchy. In terms of OpenMP, the world is still flat:

OpenMP 3.0: The World is still flat, no support for cc-NUMA (yet)!

OpenMP is hardware agnostic, it has no notion of data locality.

The Affinity problem: How to maintain or improve the nearness of threads and their most frequently used data.

Or:

Where to run threads?

Where to place data?

Yesterday evening I arrived in Austin, TX. I had to spend about six hours at the Chicago airport and that was not as bad as I anticipated, because as this was my third visit of this aiport I knew where to find power outlets and the like (even when the official ones are full). By the way, if you don’t know it yet take a look at this WIKI from Jeff Sandquist listing power outlets at several airports.

The next week is tightly packed with a couple of HPC events and I will try to summarize interesting notes picked up on these events throughout the next couple of days:

Since about two years I am running a blog on Parallel Programming over at Live Spaces. As opposed to several friends and colleagues I liked the service, but with a growing number of readers the following three issues really bugged me:

Only people logged in with a Windows Live ID are allowed to comment. From the emails I got this stopped several people from commenting, but did not helped on my following complaint (as this functionality is intended assumingly).

Most of the comments I get are spam. While that is bad already, there is no function to authorize comments individually, thus I had to remove the spam manually, which is pretty tedious (even error-prone).

The advertising sometimes was inappropriate. I don’t mind advertising on a free service, I also don’t mind banners of dating services in general. But every now and then when I checked that a blog post is formatted correctly I found some inappropriate banners.

Because of these reasons I did a brief evaluation phase and just now decided to switch to WordPress. Using Wei Wei’s Live Space Mover I imported the last five blog posts (to have some content to be found by search engines) to this blog. I will leave the Live Space alive, but stop blogging to it and disable the comment option.

I hope the people interested in my experiments and findings will follow.

Because of conflicting dates and my intent of keeping some scheduled events as they were planned for quite some time I decided against attending Microsoft’s PDC 2008 – which probably was the wrong decision😦. The mainstream media is talking a lot about Windows Azure and Windows 7 – which are interesting news items for sure – but not closely related to my “business”, maybe except for the fact that Windows Server 2008 R2 will be capable of handling more cores than just 64.

According to my opinion PDC 2008 brought us great news regarding the feature set of Visual Studio 2010 and Microsoft’s activities on multi-core (aka Shared-Memory parallel) programming and I hope to see and learn more about this during SC08 in about two weeks from now. Meanwhile you can do the following, if you are like me interesting in this stuff:

Grab the Visual Studio 2010 and .NET Framework 4.0 CTP virtual machine image from this download site and play with it. From my experience with the Visual Studio 2008 beta program I know that although my laptop is dual-core and has 2 GB of memory, running the VM on that hardware does not make a lot of fun. I am so happy that just recently I virtually “found” a two-socket quad-core (Clovertown) machine that is not suited for production as it is a Intel Software Development test machine with a pre-series chipset and pre-series CPU. I released that machine instantly from doing some stupid software and system testing task🙂.

Watch the PDC 2008 videos that are available online. Channel 9 has a section covering PDC 2008. That is nice if you are in front of a PC and connected to the Internet, but downloading all the interesting videos from that site is a pain. Greg Duncan has put together a site containing the download links to the videos (in various formats) and the PowerPoint slides here. Grep the content for “parallel” or “concur” or “Studio”!

Since a while I am involved in several teaching activities on parallel programming and in my humble opinion this also includes talking about parallel computer architectures. As I am usually responsible for Shared-Memory parallel programming with OpenMP and TBB and the like, examples and exercises include learning about and tuning for the recent multi-core architectures we are using, namely Opteron-based and Xeon-based multi-socket systems. Well, understanding the perils of Shared-Memory parallel programming is not easy, but my impression is that several students are challenged when they are asked to carry the usual obstacles of parallel programming (e.g. load imbalance) forward to the context of different systems (e.g. UMA versus cc-NUMA). So this blog post has two goals: Examine and tune a sparse Matrix-Vector-Multiplication (SMXV) kernel on several architectures with (1) putting my oral explanations into text as a brief reference and (2) showing that one can do all the analysis and tuning work on Windows as well.

From school you probably know how to do a Matrix-Vector-Multiplication for dense matrices. In the field of high performance technical computing, you typically have to deal with sparse linear algebra (unless you do a LINPACK🙂 benchmark). In my example, the matrix is stored in CRS format and has the following structure:

Matrix Structure Plot: DROPS.

The CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector has the same dimension as the val-vector and contains the column indices for each nonzero element, the row-vector is of the same length as there are rows in the matrix (+1) and points to the first nonzero element index (in val and col) for each matrix row. While there a several different format to save sparse matrices, the CRS format is well-suited for matrices without special properties and allows for an efficient implementation of the Matrix-Vector-Multiplication kernel. The intuitive approach for a parallel SMXV kernel may look as shown below. Let Aval, Acol and Arow be the vector-based implementations of val, col and row:

How good is this parallelization for the matrix as shown above? Lets take a look at a two-socket quad-core Intel Xeon E5450-based system (3.0 GHz), Below, I am plotting the performance in MFLOP/s for one to eight threads using just the plain Debug configuration of Visual Studio 2008 in which OpenMP has been enabled:

Performance plot of a parallel SMXV: Intuitive Parallelization.

The speedup for two threads (about 1.7) is not too bad, but the best speedup of just 2.1 is achieved with eight threads. It does not pay off significantly to use more than four threads. This is because the Frontside Bus has an insuperable limit of about eight GB/s in total and using dedicated memory bandwidth benchmarks (e.g. STREAM) one can see that this limit can already be reached with four threads (sometimes even using just two threads). Since we are working with a sparse matrix, most accesses are quasi-random and neither the hardware prefetcher nor the compiler inserting prefetch instructions can help us any more.

In many cases, thread binding can be of some help to improve the performance. The result of thread binding is also shown as Debug w/ “scatter” binding – using this approach the threads are distributed over the machine as far away from each other as possible. For example with two threads, each thread is running on a separate socket. This strategy has the advantage of using the maximal possible cache size, but does not improve the performance significantly for this application (or: Windows is already doing a similarly good job with respect to thread binding). Nevertheless, I will use the scattered thread binding strategy in all following measurements. Now, what can we do? Let’s try compiler optimization:

Performance plot of a parallel SMXV: Compiler Optimization.

Switching to the Release configuration does not require any work from the user, but results in a pretty nice performance improvement. I usually enabled architecture-specific optimization as well (e.g. SSE-support is enabled in the ReleaseOpt configuration), but that does not result in any further performance improvement for this memory-bound application / benchmark. Anyway, as the compiler has optimized our code for example with respect to cache utilization, this also increases the performance when using more than one thread!

In sequential execution (aka with one thread only) we get about 570 MFLOP/s. This is only a small fraction of the peak performance one core could deliver theoretically (1 core * 3 GHz * 4 instructions/sec = 12 GFLOP/s), but this is what you have to live with given the gap between CPU speed and memory speed. In order to improve the sequential performance, we would have to examine the matrix access pattern and re-arrange / optimize this with respect to the given cache hierarchy. But for now, I would rather like to think about the parallelization again: When you look at the matrix structure plot above, you will find that the density of nonzero elements is decreasing with the matrix rows counting. Our parallelization did not respect this, so we should expect to have a load imbalance limiting our parallelization. I used the Intel Thread Profiler (available on Windows as well as on Linux) to visualize this:

Intel Thread Profiler: Load Imbalance with SMXV.

The default for-loop scheduling in OpenMP is static (well, on all implementations I know), thus the iteration space is divided into as many chunks as we have threads, all of approximately equal size. So the first thread (T1 in the image above) gets the part of the matrix containing the more dense rows, thus it has more work to do than the other threads. Note: The reason why the Thread Profiler claims the threads two to four have “Barrier”-overhead instead of “Imbalance”-overhead is caused by my benchmark kernel, which looks slightly different than the code snippet above, but let’s ignore that differentiation here.

So, what can we do about it? Right, OpenMP allows for pretty easy and efficient ways of influencing the for-loop scheduling strategy. We just have to extend the line 01 of the code snippet above to look like this:

With guided scheduling, the initial chunks have an implementation-specific size which is decreased exponentially down to the chunksize specified, or 1 in our case. For the matrix with a structure as shown above, this results in a good load balance. So this is the performance we get including all optimization we discussed so far:

Performance plot of a parallel SMXV: Load Balancing.

We started with an non-optimized serial code delivering about 350 MFLOP/s and finished with a parallel code delivering about 1000 MFLOP/s! This is still far away from a linear scaling, but this is what you see in reality with complex (aka memory-bound) applications. Regarding these results, please note the following:

We did not apply any dataset-specific optimization. That means if the matrix structure changes (which it does over the time of a program run in the application I took this benchmark from) we will still do well and not run into any new load balance. This is clearly an advantage of OpenMP over manual threading!

We did not apply any architecture-specific optimization. This code will deliver a reasonable performance on most machines. But we did not yet take a look at cc-NUMA machines (e.g AMD Opteron-based or Intel Nehalem-based systems), this will be done in part 2. On a cc-NUMA system, there is a lot of performance to win or to loose, depending on if you are doing everything right or making a mistake.

Was anything in here OS-specific? No, it wasn’t. I did the experiments on Windows, but could have done everything on Linux in exactly the same way. More on this in the next post as well…