OpenMP® Forum

Discussion on the OpenMP specification run by the OpenMP ARB. OpenMP and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board in the United States and other countries. All rights reserved.

For an assignment I am asked to implement an algorithm using OpenMP. I am working on OSX 10.8.3 and have installed g++-4.7 from Homebrew and everything compiles and works fine, except for the speed. The serial version of the algorithm performs in 0.98 seconds, the openmp version performs in 4.8 seconds. Executing the same algorithm on my laptop but in Parallels, on Ubuntu 12.04 (also compiled with g++-4.7), both execute in 1.55 seconds, which is somewhat more what I expected. On some remote server (running Ubuntu as well) the sequential code runs in 1.39 seconds and the OpenMP code in 0.78 seconds, which is more what I was expecting.

So I think it is not my algorithm that is not properly programmed. Adding num_threads(2) to the pragma improves the result slightly, down to 1.67 seconds, but this is still slower than what I see in my parallels session with Ubuntu and also slower than the sequential code (but marginally). I would expect at least the same speedup from my Parallels Ubuntu since it is ran on the same machine, just with a different OS. Is there something not correct for OSX ?

I am not forcing any number of threads, but omp_get_num_threads() returns 8. On Ubuntu in Parallels it returns 4 (because it only has access to 4 cores through the settings).I have a Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz (4 cores, running up to 8 threads).I will try to post a minimum example of my assignment, the execution times are very consistent and I am using gettimeofday to measure time.

Where the first one is the serial execution and the second is the omp version.

On Ubuntu in Parallels I get : 0.214129 0.448152

Interestingly if I enable the pragma on line 99, both OS's get a nice speedup. On OSX the time then becomes: 0.226577 0.158224

And on Ubuntu the time becomes: 0.209734 0.176547

Which is basically the same. I don't quite understand why this is? I am supposed to run the algorithm a number of times, but not in parallel, to calculate the time it takes. I can understand that there is perhaps a speedup from 0.44 to 0.17 since it might start a new execution of the algorithm before the previous one is done and therefore 'cheat' a little, but a speedup of 40.81 seconds to 0.15 seconds? Where does that come from?

The reason your code gets very poor speedup is that the parallelism is too fine grained and the overhead of setting up the parallel regions and synchronising threads at the end of them outweighs any benefit of splitting the computation over multiple threads. Most likely this is much worse on 8 threads than on 4 because there will be other processes (e.g. from the OS, or other applications which are running) competing for the CPUs, and your 8 threads are time-sharing instead of all running continuously. This means that at the end of every parallel region the code has to wait until all the 8 threads have been scheduled on the CPU, which can typically take tens of milliseconds. I expect that if you run on OSX with 4 threads you will see similar performance as on Ubuntu.

Enabling the pragma on line 99 is definitely cheating! In this case the other parallel constructs will be ignored, since nested parallelism is disabled by default. The code now has a bug, because multiple threads are now reading and writing all of the shared arrays, and the results will be incorrect. Performance is much better, because you are only synchronising the threads once instead of tens of 1000s of times, but you still do not see good speedup: this is most likely because of contention for cache lines in the shared arrays.

hansg91 wrote:Thank you for your fast reply, it cleared up some things.

You're very welcome!

How much slowdown you observe may depend on the OS's scheduling policy: it would not surprise me if this is different between OSX and Ubuntu. It may also depend on what the OpenMP runtime does with threads between parallel regions (spin/yield/sleep), which again might vary between OSs. Do the settings of OMP_WAIT_POLICY and OMP_PROC_BIND make any difference?

Why would they have chosen for these scheduling policies for OSX then? Or is my example just a bad example? The environment variables you mentioned did not make a noticeable difference. I tried lowering the number of threads through the environment variable OMP_NUM_THREADS instead of pragma (which does the same thing of course but it is easier to manipulate). These are my results:

OMP_NUM_THREADS=1 0.023248 0.041343

OMP_NUM_THREADS=2 0.023420 0.338772

OMP_NUM_THREADS=3 0.023594 0.434246

OMP_NUM_THREADS=4 0.023598 0.571594

And it goes on like this. The only times I would have expected is the one where it is forced to one thread, causing a little overhead. I wouldn't expect a slowdown of 24x for 4 threads.. It frustrates me a little that I can't seem to get the expected results on OSX

Best regards,Hans

edit: Also I was looking into the nowait clause, but I am not sure if I am using it well. I tried doing :

hansg91 wrote:Why would they have chosen for these scheduling policies for OSX then?

OS scheduling policies are chosen for general purpose workloads and are rarely optimised for multi-threaded codes, let alone code that does crazy things like trying to synchronise its threads every few microseconds!

You can't suppress the barrier at the end of a parallel region with nowait (and in any case the barriers are needed for correctness of your program). An omp for directive without an enclosing parallel region is essentially ignored.