OpenMP® Forum

Discussion on the OpenMP specification run by the OpenMP ARB. OpenMP and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board in the United States and other countries. All rights reserved.

where config->max_threads is a value passed as an argument to the programi've been testing with a few different machines (an AMD with 2 Magny-cours, for a total of 24 cours), a Nehalem machine with 12 cores (also a total of 24, due to hyperthreading), and a Sandy Bridge machine (again, 12 cores, 24 due to HT)

The execution time results i get are understandable when i'm using between 1 and 24 threads. I get the best speedups when using around 12 threads, but then speed drops above that, probably due to HyperThreading not being a good solution for this algorithm (or in the AMD case, due to start using the second CPU)PS: links to plots of each result at the bottom of the post

The strange thing happens when i try to use more than 24 threads. On the Sandy Bridge machine, the result kind of stabilizes (it has around the same exec times as the 20~24 threads versions). they're not as high as when using 12 threads, but that is to be expected.

On both the AMD and the Nehalem machine, results seem to indicate that when i ask for more than 24 threads, omp truncates them to 12, since the execution time suddenly drop to the same values of the 12 threads executions.

I tried playing around a bit with the scheduling type, and printing out omp_get_num_threads inside the parallel region, but that didn't seem to indicate the number of threads was being truncated.What exactly is going on here?

1) Having more OpenMP threads than hardware threads/cores can improve load balancing, but my feeling is that this probably isn't the case here.

2) NUMA effects: if the initialisation of your main data structures is done outside of parallel regions, then the usual default allocation policy is "first touch", which means that all the data will be allocated on the socket/NUMA domain where the master thread is executing. This can cause a memory bottleneck when all threads inside a parallel region access the data. When you have more OpenMP threads than hardware threads/cores, it is possible that the master thread migrates during the intialisation, and so the data gets spread across multiple sockets/NUMA domains. You could try doing the data initialisation in parallel, or, under Linux, using numactl -i nodes to change the allocation policy to round robin.

If you want to be certain that the OpenMP implementation is giving you the number of threads you requested, you can set OMP_DYNAMIC=false

In any case, it probably doesn't really matter that much in practice: it might be more productive to try and figure out why you are only getting ~4x speedup on 12 threads!

1) Having more OpenMP threads than hardware threads/cores can improve load balancing, but my feeling is that this probably isn't the case here.

2) NUMA effects: if the initialisation of your main data structures is done outside of parallel regions, then the usual default allocation policy is "first touch", which means that all the data will be allocated on the socket/NUMA domain where the master thread is executing. This can cause a memory bottleneck when all threads inside a parallel region access the data. When you have more OpenMP threads than hardware threads/cores, it is possible that the master thread migrates during the intialisation, and so the data gets spread across multiple sockets/NUMA domains. You could try doing the data initialisation in parallel, or, under Linux, using numactl -i nodes to change the allocation policy to round robin.

If you want to be certain that the OpenMP implementation is giving you the number of threads you requested, you can set OMP_DYNAMIC=false

In any case, it probably doesn't really matter that much in practice: it might be more productive to try and figure out why you are only getting ~4x speedup on 12 threads!

Hope that helps,Mark.

thanks, but that wasn't really the answer i was looking forI understand the effects of using a NUMA machine. That can explain the low speedups, but not the strange "jump" that is shown in the plots when using more than 24 threads. I would expect some kind of curve, but not that suddend constant valueI know the speedups aren't good, and that's exactly why i was trying to measure the scalability. The next step would be the to re-implement the algorithm (the current code is not mine, i'm just profiling it

naps62 wrote:thanks, but that wasn't really the answer i was looking for

Sorry!

naps62 wrote: I understand the effects of using a NUMA machine. That can explain the low speedups, but not the strange "jump" that is shown in the plots when using more than 24 threads. I would expect some kind of curve, but not that suddend constant value

I was trying to suggest that it could explain the jump if up to 24 threads the OS scheduler keeps the master thread in the same place, whereas above 24 threads it starts to migrate around due to the oversubscription, and hence changes the way memory is allocated.

naps62 wrote:thanks, but that wasn't really the answer i was looking for

Sorry!

naps62 wrote: I understand the effects of using a NUMA machine. That can explain the low speedups, but not the strange "jump" that is shown in the plots when using more than 24 threads. I would expect some kind of curve, but not that suddend constant value

I was trying to suggest that it could explain the jump if up to 24 threads the OS scheduler keeps the master thread in the same place, whereas above 24 threads it starts to migrate around due to the oversubscription, and hence changes the way memory is allocated.

ok i misunderstood, i though you were trying to explain the low speedups. sorry thenstill, if that was the case, wouldn't that negatively affect the performance of the program? You can notice that with more than 24 threads (or at least, 24 "requested" threads), i get almost the best performance, and quite similar results compared to using 12 threads, which seems to be the optimal. Thus my theory that OpenMP would be truncating the number of threads to 12 for some reason

PS: I still have to try your suggestion of setting OMP_DYNAMIC=false. But i might not be able to do that until tomorrow. I just didn't understand what that OMP_DYNAMIC variable was. An environment variable? a #define?

I just have some questions, right now:1) What are 511, 611, and 711, and Cornell, Kitchen and Luxball?2) What's the value of max (or max more likely value)?3) Why did you use schedule(guided)? Can you try with schedule(static, max/(omp_get_num_threads( )*10))?4) Do you know the core microarchitecture of each AMD Magny-cours core?

I just have some questions, right now:1) What are 511, 611, and 711, and Cornell, Kitchen and Luxball?2) What's the value of max (or max more likely value)?3) Why did you use schedule(guided)? Can you try with schedule(static, max/(omp_get_num_threads( )*10))?4) Do you know the core microarchitecture of each AMD Magny-cours core?

Fernando.

1) 511, 611 and 711 are the names of the machines (AMD, Nehalem and Sandy Bridge respectively). I didn't think it would be relevant, sorryCornell, Kitchen and Luxball are the difference 3D scenes i'm testing (it's a Ray Tracing application)

2) by max i guess you meant config->max_threads? i explained on the first post that it is the argument i give to the program with the number of requested threads. It should probably be called just config->num_threads instead of max_threads.I will try your suggestion when i can (don't currently have access to the machines)

3) schedule(guided) was already being used in the code (as i said, this was not implemented by me, i'm just profiling it). But i did try to run it without that option, but with no conclusive results regarding my question

naps62 wrote:still, if that was the case, wouldn't that negatively affect the performance of the program? You can notice that with more than 24 threads (or at least, 24 "requested" threads), i get almost the best performance, and quite similar results compared to using 12 threads, which seems to be the optimal. Thus my theory that OpenMP would be truncating the number of threads to 12 for some reason

I'm obviously not explaining this very well! For more than 12 threads, having the data distributed across all the memory will likely give better performance than having it all allocated in one NUMA domain, because it avoids the bottleneck of all threads accessing the same physical memory device. It is possible that such data distribution will only happen if the cores are oversubscribed, and the OS scheduler migrates the threads around, instead of keeping them always on the same core.

OMP_DYNAMIC is an environment variable. Alternatively you can insert a call to omp_set_dynamic().

naps62 wrote:still, if that was the case, wouldn't that negatively affect the performance of the program? You can notice that with more than 24 threads (or at least, 24 "requested" threads), i get almost the best performance, and quite similar results compared to using 12 threads, which seems to be the optimal. Thus my theory that OpenMP would be truncating the number of threads to 12 for some reason

I'm obviously not explaining this very well! For more than 12 threads, having the data distributed across all the memory will likely give better performance than having it all allocated in one NUMA domain, because it avoids the bottleneck of all threads accessing the same physical memory device. It is possible that such data distribution will only happen if the cores are oversubscribed, and the OS scheduler migrates the threads around, instead of keeping them always on the same core.

OMP_DYNAMIC is an environment variable. Alternatively you can insert a call to omp_set_dynamic().

ok, you were explaining well, i just wasn't understading the same thing. It makes sense that it gives better performance, and i've though of that in a previous test where i just ran with [2, 4, 8, 16, 24, 32] threads (it's the same principle employed by a GPU, where threads "hide" each other's latency).but i'm still confused by the unusually constant values for all executions with more than 24 threads. I would at least expect some kind of curve.Also this does not happen for the 711 machine (the sandy bridge), which is even weirder

Anyway, this is just pure speculation. All i wanted was a sure way to force the number of threads, and to make sure OpenMP was not doing some magic behind my back. I'll try your suggestions as soon as possible (probably tomorrow morning) and will post the conclusions

2) by max i guess you meant config->max_threads? i explained on the first post that it is the argument i give to the program with the number of requested threads. It should probably be called just config->num_threads instead of max_threads.I will try your suggestion when i can (don't currently have access to the machines)

uh, no, sorry, I meant the "max" in

for (uint i = 0; i < max; i++)

I'll see about the AMD microarchitecture and its possible relationship to the number of threads (unlikely, I think).