OpenMP® Forum

Discussion on the OpenMP specification run by the OpenMP ARB. OpenMP and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board in the United States and other countries. All rights reserved.

My openMP loop code sample shown below reaches a performance plateau on large Shared Memory Processors. i.e. it scales well to about 32 cores and then reaches a performance plateau when using more cores. The plateau is possibly an artifact of the memory access yet would liketo know whether anyone has an opinion on an improved more efficient coding scheme with openMP.

Couple of questions for you: What are the typical values of ie and je, and what is the execution time for the loop on 1 thread? Are the immediately preceding accesses to the arrays used in this loop scheduled to threads in the same way?

magicfoot wrote:The values of ie and je lie in the range 1000 to 100000.

There is no timing data for the single loop but I can derive that. There are three of these loops in the program, all with different variables, and these loops use 98% of the total execution time.

That seems large enough such than the overhead of the parallel region (typically in the 10s to 100s of microseconds range) is likely the be negligible.

MarkB wrote:Are you considering memory affinity or cache coherence issues ? Is there some way to stabilise that with openMP ?

On a multi-socket machine it can be important to get the distribution of data in main memory right. This means that the first access to large arrays (typically initialisation) should be made inside a parallel region. Your code might be getting some cache reuse (at least in L3), so making sure the same thread accesses the same data items ion different parallel loops might help.

The loop you posted looks very bandwidth-intensive, so you may simply be running into the limits of the hardware bandwidth scalability.