OpenMP® Forum

Discussion on the OpenMP specification run by the OpenMP ARB. OpenMP and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board in the United States and other countries. All rights reserved.

Can someone please tell me what am I doing wrong with the following?Please take a look at the following snippet I'm using in my code. Single -threaded version runs fine, but I want to distribute the 16 functions calls to 4 cores, each executing 4 calls of the function, in parallel of course. After doing this, it even take longer to execute than before using any pragmas in the single threaded. I don't want to distribute iteration, I've done that before, but I want to distribute the computeBlac76() function calls to different threads. I have to do 16 calculation in one loop, so it would be optimal to have 4 threads running 4 calculations each.

This is the version that is not working (I found a 2 threaded version of QuickSort implemented in the very same way):

It seems to me that my CPU is executing the very same 16 calculations 4 times, on the four different threads.

How do you know this?

HTH.

I suppose this is happening. Because I'm monitoring CPU activity with the "top" command in linux, and I can see CPU usage goes up to 400% - this means 4 cores, however the overall execution time is much longer than in a single-threaded case.

Is there any way to tell the compiler to execute those calulation is parallel? WHat am I doing wrong?

In this regard, running the last pass - where I do 10 000 000 passes, it executes in 37 seconds on the Core i7 CPU by using the above pragmas, but without them it executes in about 17 seconds. By using OMP on 4 cores, I would except to get about 1/4 of the original execution time (or soething similar, but any way the program should be quicker than before).

Can someone point out the problem here?

I have used the QuickSort example I have found as a base for my calculations. It's somewhat similar, but I have 16 calculations to divide into 4 threads, rather than just 2 to divide to 2 threads:

A small remark, I'm using 64-bit Linux server, with CentOS 6.2 Release, with Intel core i7, and 50 GB of RAM. Tried compiling the code, and making measurements, with both Intel and G++ compilers. My goal is to have this compiled and running with Intel C++ compiler, and still getting better result than without OMP.

I suppose this is happening. Because I'm monitoring CPU activity with the "top" command in linux, and I can see CPU usage goes up to 400% - this means 4 cores, however the overall execution time is much longer than in a single-threaded case.

The top command is just telling you that every one of the four processors are being used, the overall runtime is not necessarily always used for useful computing, but computing. Example: having spin locks every core is busy, but just waiting for something to happen.

Is there any way to tell the compiler to execute those calulation is parallel? WHat am I doing wrong?

Well, that what the sections construct is used for... but now I think you are suggesting something different of your previous post... Anyway, there should be no problem such as that of a section executed more than once, since the spec. defines:

Each structured block is executed once by one of the threads in the team in the contextof its implicit task.

However, there could be a scheduling problem (from a performance point of view), since the spec. defines

The method of scheduling the structured blocks among the threads in the team isimplementation defined.

i.e. there could be a extreme case in which only one thread executes every section. This does not seem to be the case since the top reports 400% CPU usage, but everything else could be happening. My first suggestions would be:1) Use the function omp_get_wtime() in order to measure execution time.2) Set OMP_WAIT_POLICY to passive3) If every call to computeBlack76() takes about the same execution time, then group them in 4 sections.4) Using a for with a case inside seems to be unnatural, but maybe helps for checking performance measurements.5) Tasks seems to be another natural way of distributing execution among threads.

I also inserted a barrier primitive here, as I want all 4 threads to finish before starting the next iteration.

3) If every call to computeBlack76() takes about the same execution time, then group them in 4 sections.

Well, this will be the ideal case here - each 4 section would require roughly the same execution time, in the ideal case - by using the code above, I would have to wait until the longest calculation is over, and this would be the maximum time required for the program, even if it takes slightly longer to compute than the other ones, I would expect to have at least half the execution time here.

4) Using a for with a case inside seems to be unnatural, but maybe helps for checking performance measurements.

Can you be a little bit more specific, as I don't understand here neither the benefits, nor the purpose of using a case statement, As I want to parallelize, and not have different cases in each iteration that execute separately, and one-by-one.

2) Set OMP_WAIT_POLICY to passive

I'm nut sure whether it's turned of or not. I used the shell to set this environment variable:

Before doing this, I have echoed the shell variable, and had no value, don't even think it was defined. So it is now passive if I display it's content. BTW, I understand that the #pragma omp barrier should have the same effect, should not?

I'm doing testings and step-by-step execution, but I still cannot figure out which thread executes which part of the code. I'm using Visual Studio 2010 with Intel Parallel Studio Extension... I've enabled the openmp for the project, I can even see using it. I have a breakpoint at the first section, then I see in the Visual Studio debugger, that I have 1 master thread, and 3 worker threads. I hit F10, then it stays on the same line in the code, and it jumps to a different thread at the threads section.

Some simple suggestions:1) if you want to see the thread executing each section just use (and print the result of) the function omp_get_thread_num()2) Having barriers is not the same as setting OMP_WAIT_POLICY to passive. The env. var. "controls" or "suggests" the way in which barriers are implemented. Please see the spec for full explanations

zeusz4u wrote:In this regard, running the last pass - where I do 10 000 000 passes, it executes in 37 seconds on the Core i7 CPU by using the above pragmas, but without them it executes in about 17 seconds. By using OMP on 4 cores, I would except to get about 1/4 of the original execution time (or soething similar, but any way the program should be quicker than before).

Can someone point out the problem here?

I think the clue is in these timings: each pass takes 1.7 microseconds when executed sequentially, and this is too small to offset the overheads of the parallel region (the parallel execution time suggests this overhead is of the order of 3 microseconds on 4 cores, which is around what I would expect).

So the problem is that there is not enough computation in the 16 function calls to offset the costs of parallelisation. Is there any scope in the application for doing more than these 16 calls in parallel?