OpenMP® Forum

Discussion on the OpenMP specification run by the OpenMP ARB. OpenMP and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board in the United States and other countries. All rights reserved.

I am using GCC compiler in intel i7 quad code processor and without using any optimization flag I am getting around 95 % efficiency in 4 processors.i.e. run time for serial code = 4 h, run time parallel code = 1 h 3 min. (in 4 processor)But when I use –O3 flag, serial code starts taking 2 h 30 min and for parallel code it is around 55 min.Though I am getting fast results, but efficiency decreases

So my questions are:(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ? (2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.

Looking at your code, I don't understand why i is not annotate as private.

(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ?

I would say YES. User will use optimisation to you should do the same. Some optimisations are well known like loop unrooling (and many others) and compiler can do it while you keep your code clean.I think optimisation are not related to OpenMP (I am not sure about it) but it is well know that some time O2 is faster than O3. You can also use Ofast but you have to test each of them to know which one is the better.

About efficiency, some time the CPU is not the bottleneck for your programme. Your band-width for memory may be slower than computation and it may explains that efficiency is decreasing.Looking at your code, this happen because your data are not allign in your memory.When you write COEFF[5][i][j][k] you are jumping 4 time in your memory while if it is align, you can write : COEFF[(((5 *COEFF.shape[1] + i) * COEFF.shape[2]) + j) * COEFF.shape[3] + k]. In this case it is allign in memory so your computer should do it faster (and you may be able to use SSE instruction).

(2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.

I don't know what is a “false sharing” so I may say something wrong but I think that shared data avoid copy so having all variable as shared should not cause any performance issue.

I think the main bottleneck in this code is likely to be memory bandwidth. Turning optimisation on (which is clearly the right thing to do because it reduces the wall clock time) will reduce the number instructions executed, but cannot really do anything about the number of loads/stores required. The memory system becomes saturated by 4 threads all demanding data at the same time.

Reordering the COEFF array from COEFF[7][nx][ny][nz] to COEFF[nx][ny][nz][7] might improve the cache locality a bit: this might be what Pierrick is trying to say, but I'm not sure!

False sharing occurs where multiple threads access addresses which are on the same cache line (and at least one of the threads is writing the data). This does not look like a problem in your code as the data accessed by different threads is well separated in memory.

pierrick wrote:Looking at your code, I don't understand why i is not annotate as private.