i am writing a critical real-time application, where i need to solve as fast as possible smaller dense un-symmetric linear equation of sizes 100x100 to 300x300. The usage of LAPACK's DGETRF and IntelsBLAS MKL showed a poor performance/scalability on a multi-core system.Can PLAMA deliver a better performance i.e. scalability for those type of small problems? Thanks for any help.

i am writing a critical real-time application, where i need to solve as fast as possible smaller dense un-symmetric linear equation of sizes 100x100 to 300x300. The usage of LAPACK's DGETRF and IntelsBLAS MKL showed a poor performance/scalability on a multi-core system.Can PLAMA deliver a better performance i.e. scalability for those type of small problems? Thanks for any help.

carsten

I seriously doubt it.I think the problem size is simply too small.But give it a shot.Say, for the 300x300 problem, set the tile size to something small, e.g. 60 or 50.Make sure to use PLASMA with static scheduling, not dynamic:

PLASMA_Set(PLASMA_SCHEDULING_MODE, PLASMA_STATIC_SCHEDULING);

Use only one socket, i.e., 4 to 6 cores.Let us know what happens.Good luck,Jakub

i am writing a critical real-time application, where i need to solve as fast as possible smaller dense un-symmetric linear equation of sizes 100x100 to 300x300. The usage of LAPACK's DGETRF and IntelsBLAS MKL showed a poor performance/scalability on a multi-core system.Can PLAMA deliver a better performance i.e. scalability for those type of small problems? Thanks for any help.

carsten

I seriously doubt it.I think the problem size is simply too small.But give it a shot.Say, for the 300x300 problem, set the tile size to something small, e.g. 60 or 50.Make sure to use PLASMA with static scheduling, not dynamic:

PLASMA_Set(PLASMA_SCHEDULING_MODE, PLASMA_STATIC_SCHEDULING);

Use only one socket, i.e., 4 to 6 cores.Let us know what happens.Good luck,Jakub

finally i have a time slot to do some testing with plasma. I compared the execution time for dense systems of different sizes with different solvers. The solvers are basedon LAPACK, the C++ Eigen library, a simple C++ Gaussian-Elimination algorithm with partial pivoting and plasma. The plasma code looks basically as followed:

I would guess the poor results for the smaller systems are from the overhead of plasma's scheduler and the copying into the tile based format. This effect is probably amplified, since the smaller systems are solved several times. What makes me wonder, is that even for larger systems, where the overhead should be small, the performance does not scale well. Any idea, what i might do wrong?

I performed some additional tests using the provided timing test-code time_sgetrf_incpiv.c. To get comperable results i made an additional version where i replaced PLASMA_sgetrf_incpiv( .. ) with the original LAPACK method sgetrf_(...). Generally i think it would be helpfull to have addtionally the orginal LAPACK routines in the timig code to directly compare the performance. The results on my system have been: