I cannot reproduce the problem, the part of the code using OpenMP print the correct number of threads and apparently use the correct number of thread. I added some print in the for loop to know which thread is executing each iteration of the loop and all threads appear.

I was afraid you were going to say that... Thanks for testing! I tried stripping out some of the unused baggage from my makefile (GSL FFTs, Toeplitz and filtering stuff that support other code in my build tree) just in case one of these unused things was causing the problem, but it made no difference. I'll keep looking at my code/makefiles and try different size problems to see if I can figure out what is happening. Out of curiosity, are you running Linux and/or gcc? If so, what versions? Any reason to think it would matter?

I'm running Linux (ubuntu 4.11) with gcc/gfortran (4.5.2) and gotoblas2 but I don't think it makes a difference. I will give it another shot with a similar problem I had before with MKL. So I will see if I still have it and if it is somehow related.

Hi to all,we can confirm the behaviour found by srdegraaf, since I also combine OpenMP and PLASMA 2.4.0.(Fedora12+atlas+gcc-4.4.4+OMP3.0)We use both within a field simulation program employing boundary elements. In our case in a single process1.) runs the OpenMP accellerated calculation of the elements of the dense matrix2.) runs the LU-factorization by means of PLASMA_dgetrf() and evtl. PLASMA_dgetrs()3.) runs the OpenMP accellerated evaluation of the solution.

My observation: 1)+2) run in parrallel, 3) in just one thread.If I choose LAPACK for 2.) 1)+3) run in parallel.

I have put omp_get_num_procs() before and after 2.) and surprisinglyI get 16 before and 1 after. Thus PLASMA eats cpus regards, Stephan

I ran an experiment/comparison of OpenMP thread info printed out using LAPACK (serial) or PLASMA (parallel) to do the initial linear algebra part of my algorithm. Within my #pragma omp parallel for loop I printed out several variables: the index of the loop (nx), the operating system's thought on what the thread is (OSthread) as reported by syscall(SYS_gettid), OpenMP's thought on what the thread is (OMPthread) as reported by omp_get_thread_num(), as well as OpenMP's thoughts on what the number of threads (OMPnumthreads) and maximum available number of threads (OMPmaxthreads) at each point in the loop. Specifically, I was looking to see if there was a difference in the behaviour of the OSthread and OMPthread variables.

What I expected to see with PLASMA (where OpenMP fails) was that there would be 24 different values of OMPthread in use, but only one value of OSthread. Further, since things are apparently running "sequentially", I expected to see the print statements come out in some kind of sequential order. This didNOT happen. I see 24 different OSthread values and see a "random" interleaving of the print statements. All of this suggests, to me, that parallelism is actually happening, as it should. (Below are snippet of the printouts for both the PLASMA and LAPACK "experiments".) HOWEVER, THESE PRINTOUT LINES DO NOT COME OUT IN A STEADY STREAM. INSTEAD, THEY COME IN BURSTS, AS IF THE LINUX SCHEDULER IS ONLY ALLOWING ONE PHYSICAL COR TO BE USED AT A TIME. OpenMP and the OS both seem to think that 24 cores/threads are available, yet gkrellm/top shows that only one is actually being used. Interestingly, the core that is busy does not seem to hop around amongst the CPUs shown in the gkrellm display as it often does when running a single threaded application.

When I do this using LAPACK (where OpenMP works), while I also see 24 values of OSthread and OMPthread being used (also shown below) and "random" ordering of the lines, the printout comes out in a steady stream, and gkrellm/top shows that all 24 cores/threads are being used fully.

My knowledge of how all this works is (obviously) limited. However, I've shown these results to a colleague who is quite knowledgeable in these matters, and he suspects that the PLASMA_Finalize() routine is somehow causing the Linux scheduler to restrict the number of available physical cores to one. (God knows how.)

I hope these clues, together with the corroborating "testimony" of Stephan/uhle89, help you to discover the underlying problem. Is it possible that this only happens in conjunction with using ATLAS CBLAS? (Again, God knows why.) You weren't able to duplicate the problem, but then perhaps you didn't use ATLAS.For what it's worth, I'm using gcc/gfortran version 4.5.1, not quite as new as yours.

I appreciate your efforts to track down this insidious/subtle "bug". I suspect that it will be to many people's benefit.

I confirm Stephan's observation that omp_get_num_procs() returns a different value before PLASMA_Init() and after PLASMA_Finalize(). This doesn't seem right, but is consistent with the behaviour I described in the last post. The following output was produced by the subsequent code:

The problem is that PLASMA binds all the threads used by PLASMA, including the master thread. Once you enter the next openmp section, the threads that are created by the master thread are thus binded to the same core and are all running on core 0.

Here is a patch, IF you are using hwloc. I'm in meeting today but I will fix the problem in case you are not using hwloc too and generate a new release tomorrow.

+ plasma_unsetaffinity();+ pthread_mutex_lock(&mutextopo); plasma_nbr--; if ((topo_initialized ==1) && (plasma_nbr == 0)) {@@ -66,7 +68,7 @@ If there are multiple instances of PLASMA then affinity will be wrong: all ranks 0 will be pinned to core 0.

Hi Mathiew+Stuart,here are further observations and reasonings:1) I've replaced libcblas.so by libgslcblas.so -another blas lib hanging around.Beside the fact that PLASMA is now 4x slower (which proves that the different lib is used)the number of procs is still affected by the PLASMA usage.

2) In our case I can recover the OpenMP multithreading after PLASMA usage, if I just avoidthe evaluation of the result of omp_get_num_procs().

Diving in our code I found a forced thread number limitation (useful in case of only a few iterations):

However, if the execution reaches the next parallel omp loop WITHOUT any thread number limitation,this section is executed by 16 threads but at 4 cpus! I guess this is the point where OMP cannot figure out, howmany (new) LWPs have to be forked. It just uses the number of LWPs which are still there. (I've checked that with LAPACK only, LWP number is 4 and 16, respectively -this is ok since LWPs are killed if not neccessary in a parallel section/loop)

My conclusion is, that the situation of Stuart differs wrt to the number of LWPs forked before the firstPLASMA call. Stuart, could you check please, if executing a loop in 24 threads @ 24cpus before the first PLASMA call can restore the execution to 24 @ 24 after PLASMA usage?

If I'm not wrong, a more precise description of the bug is"omp_get_num_procs() returns always 1 after PLASMA usage" in conjunction with"New LWPs are not/cannot be forked by OMP (if neccessary) after PLASMA usage."

This also means that the actual thread number constructed by OMP is correct after PLASMA usage.