OpenMP® Forum

Discussion on the OpenMP specification run by the OpenMP ARB. OpenMP and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board in the United States and other countries. All rights reserved.

My problem is on that big_array updating statement: big_array(i,j,k)=m.

If I comment this statement from executing, the parallel performance is just what I excepted: T_t=42s for a seriel code; T_t=41 for parallel code (threads=2); T_t=31 for parallel code (threads=4), given nlayer=10.

But if I keep this array updating statement, the parallel code becomes very slow, T_t=413s. I have tried to put '$OMP CRITICAL' OR '$OMP FLUSH' in front of this statement, but the problem remains.

This seems strange to me, because the array updating is outside the big-time loop, and each thread should have different (i,j,k) so there is no way multiple threads need to access the same memory location of this array.

How many threads were you running to get the 413s time? If you have the big_array assignment in, what is the sequential time, and what is the parallel time on one thread? What are the values of nrow and ncolumn?

jiwa wrote:If I comment this statement from executing, the parallel performance is just what I excepted: T_t=42s for a seriel code; T_t=41 for parallel code (threads=2); T_t=31 for parallel code (threads=4), given nlayer=10.

The sequential time with big_array assignment is about 40s. I did the one thread parallel test but I forgot the time (im at home now). The nrow and ncolumn is quite small for this trail problem: 10 and 20. So the big_array is not big here.

The real problem I am trying to solve has some 500 for nrow and ncoloum and about 100~200 for nlayer. The idea is to reduce the runtime of a half-week job to half day or so with a 16 micro processor machine.

I tested the 10 layer trial problem with 10 threads parallel without the big_array assignment line, it only took 4 to 9 seconds. So speedup isn't a problem from what i can see.

Are you actually outputting any results? If not, it is possible that the compiler optimisation is eliminating dead code and not doing all the computation, except in the case where big_array is being assigned to, and is in a parallel region (which would require interprocedural analysis).

To Fernando: the rest of code is rather simple I can post tomorrow. It's just about reading a file of grid coordinates into grd_x, grd_y and grd_z, and reading another big coordinate file into coord(3,nn). The big coordinate file contains ~10 million lines of coordinate data (700Mb), which is essential to make the nn loop time-consuming.

To Mark: the purpose on these code is to update a grid property index array (big_array(i,j,k)), by comparing the cell spacial location and volume with a huge database array coord(3,nn) and code_num(nn). So my output is the big_array. If the assignment line is commented, the code will be meaningless.

btw: all the array is dynamic array, they are allocated before the parallel region.

jiwa wrote:To Mark: the purpose on these code is to update a grid property index array (big_array(i,j,k)), by comparing the cell spacial location and volume with a huge database array coord(3,nn) and code_num(nn). So my output is the big_array. If the assignment line is commented, the code will be meaningless.

Sorry, what I meant was: does your test code actually write out the values in big_array (or something that depends on them)?

It's just about reading a file of grid coordinates into grd_x, grd_y and grd_z, and reading another big coordinate file into coord(3,nn). The big coordinate file contains ~10 million lines of coordinate data (700Mb), which is essential to make the nn loop time-consuming.

Do not worry about input files, sincea) We are interested only in performance right nowb) The processing requirements of the code you posted is almost data independent Thus, we can fill in the arrays with constant values. The important stuff is, I think, to have a run similar to the one you reported, i.e. the actual values of nlayer, nrow, ncolumn and nn (for the original post).

I still think that removing the assignment to big_array is causing the compiler to optimise away most of the code (I've seen compilers do strange things in this situation before, such as only executing every nth iteration of the innermost loop).

I strongly suspect the lack of scaling of the code is due to memory bandwidth contention: the code is basically just repeatedly trawling through the coord array with no re-use. On an AMD Interlagos system (which has a better memory subsystem than the i5) with the PGI compiler I get:

You might be able to improve the performance by swapping the loop order so that the do l=1,nn is outermost, which will require expanding volume_max into a 3-D array. Then the coord array only gets trawled once instead of nlayer*ncol*nrow times.