OpenMP® Forum

Discussion on the OpenMP specification run by the OpenMP ARB. OpenMP and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board in the United States and other countries. All rights reserved.

I think, I have a problem with "false sharing". I solve a large, sparse, banded Matrix with a Cholesky-factorysation. It works with OpenMP, but it scales not very good. Using 8 cores instead of 4 has no(!) speedup.

In "Using OpenMP" I read, what false sharing is and under what circumstances it probably occours: using shared data, writing these data in array-locations located near by each other, doing this very often.

I have exact these conditions. The book recommends using private data instead of shared data. Of course I could copy the lokal parts of the matrix in threadprivate arrays and do the computation there, but the private data of the threads will be lost after the parallel region ist finished.

I dont find a solution to copy back the private data, to gather the data back in one arry or so still existing after the end of the parallel region. Is there a possibility? I dont find one.

One solution might be to declare a shared array with an extra, outer, dimension of size the number of threads. If each thread uses its thread ID to index this extra dimension, then false sharing should be largely avoided, and the data will persist beyond the parallel region.

If you could post the main computational loop(s) from the code, I may be able to advise you better!

MarkB wrote:One solution might be to declare a shared array with an extra, outer, dimension of size the number of threads. If each thread uses its thread ID to index this extra dimension, then false sharing should be largely avoided, and the data will persist beyond the parallel region.

Thank you, that could work. But it would depend on the length of a cache line. Do you know, how long it is an an Intel I7 board? And which cache is important, L1,L2,L3? I dont know much avout these caches.

HeinzM wrote:Hi Mark,Thank you, that could work. But it would depend on the length of a cache line. Do you know, how long it is an an Intel I7 board? And which cache is important, L1,L2,L3? I dont know much avout these caches.

You can get false sharing between L2 caches on a chip and between L3 caches on multiple sockets, though the latter is likely to be the worst in terms of performance hits. I believe the line size is 64 bytes in both cases.

! first computing the adresses, the start and the end of the dot-products

ADRI= ...... ADRK= ...... ADRHI= ...... L= ......

! the following loops do the most work:!! dot-products of the columns of the matrix! The length L of the dot-product is in the range from about 1000 to 0, depending on the finite-element-net.! Length always decreases to 0 because of the triangular form of the matrix. Therefore I used the if-clause in the openmp-directive. !! the vectors FR(ADRI+II) and FR(ADRK+II) do not overlay each other

IF(L.gt.0) THEN S=0.0 DO II=0,L-1 S=S+FR(ADRI+II)*FR(ADRK+II) END DO FR(ADR)=(FR(ADR)-S)*FR(ADRHI) ! in this line I assumed false sharing, because I update the matrix-element. But it seems not do be so. ELSE FR(ADR)=FR(ADR)*FR(ADRHI) END IF

I think it's unlikely that false sharing is a big problem in this code: the accesses to the shared array are relatively infrequent. I don't think the EQUIVALENCE is a problem either.

Let's try to eliminate some other possible causes of the bad scaling:

Are you convinced that the amount of spent outside of parallel regions is insignificant? It is maybe worth confirming this by putting a timer around the parallel region and summing up the time spent in it.

Is the amount of time in each parallel region long enough to offset the associated overheads? How long does the call to LE0130Sl take sequentially, and how many times is the parallel construct encountered (i.e. how many iterations are there in the DO I=IA,N loop)?.

Is the load balance OK? Guided schedule may not work very well for loops where the amount of work decreases with iteration number, as it issues large chunks first. A quick test would be to run the parallel loop backwards (i.e. DO K=KE,KA,-1 ).

Ultimately, the code likely spends most of its time doing dot-products which are very memory bandwidth intensive. You may simply be hitting the hardware limits here. Are you on a multi-socket system? If so, then NUMA effects can cause serious bandwidth degradation. Is the array FR initialised sequentially or in parallel? In the former case, most likely all the data ends up being allocated on one socket, which can result in a bottleneck.

MarkB wrote:Are you convinced that the amount of spent outside of parallel regions is insignificant? It is maybe worth confirming this by putting a timer around the parallel region and summing up the time spent in it.

Yes, the the amount of time spent outside of the parallel region is very low, about 1%.

MarkB wrote:Is the amount of time in each parallel region long enough to offset the associated overheads? How long does the call to LE0130Sl take sequentially, and how many times is the parallel construct encountered (i.e. how many iterations are there in the DO I=IA,N loop)?.

The parallel construct is encountered for each line of the sparse matrix, f.e. several 100000 times. This could be the bottleneck.

MarkB wrote:Is the load balance OK? Guided schedule may not work very well for loops where the amount of work decreases with iteration number, as it issues large chunks first. A quick test would be to run the parallel loop backwards (i.e. DO K=KE,KA,-1 ).

Running the loop vive versa is even a little bit slower, so the load balance seems to be ok.

MarkB wrote:Ultimately, the code likely spends most of its time doing dot-products which are very memory bandwidth intensive. You may simply be hitting the hardware limits here. Are you on a multi-socket system? If so, then NUMA effects can cause serious bandwidth degradation. Is the array FR initialised sequentially or in parallel? In the former case, most likely all the data ends up being allocated on one socket, which can result in a bottleneck.

No, I have no multi-socket system, no NUMA. I use a simple Intel I7 board with 4 cores and hyperthreading. Could hyperthreading be the bottleneck?

The first vector within the dotproduct is the same for all dotproducts in the loop. So I tried to copy it in an own allocatable vector outside the matrix stored in FR(....). I thought that simultanous reading the same array-elements from several threads could slow down. Of course I got a little speed up, but only 3%. But the code became unreadable, so I dropped.

I have no more ideas to achieve better scaling. It seems I have to accept that this special code doesnt yield more. With four threads it speeds up by a factor of three. Eight threads slow even down a little bit.

Another code of mine, which does no dotproducts and no matrix times vector operations scales well on the same hardware. Eight threads nearly speeds up by a factor of seven.

HeinzM wrote:The parallel construct is encountered for each line of the sparse matrix, f.e. several 100000 times. This could be the bottleneck.

The typical overhead for a parallel region is in the tens of microseconds range (it depends a bit on the implementation, the hardware, and the number of threads). If you know how long the whole call takes, you should be able to estimate whether this is part of the problem or not.

HeinzM wrote:No, I have no multi-socket system, no NUMA. I use a simple Intel I7 board with 4 cores and hyperthreading. Could hyperthreading be the bottleneck?

With a bandwidth-intensive code, you are unlikely to see much benefit from hyperthreading, since running one thread per core can already use up all the available bandwidth. Adding an extra thread per core can even cause some slowdown due to additional contention.

HeinzM wrote:The first vector within the dotproduct is the same for all dotproducts in the loop. So I tried to copy it in an own allocatable vector outside the matrix stored in FR(....). I thought that simultanous reading the same array-elements from several threads could slow down.

Simultaneous repeated reads should be no problem: each thread will get a copy of this data in its local cache.

My feeling is that this code is simply limited by the memory bandwidth, so there is likely not much you can do to improve the scaling.