all examples in test_basic/ folder will be linked with the parallel blas implementation. Examples in test_cham/ are linked with chameleon are since chameleon is linked with sequential blas/lapack implementation (since the multi-thread parallelism is generated directly by the chameleon application itself), these example will also be linked with the sequential cblas/lapack implementation. So it does not make much sense to compare kernels performance other than gels or incremental qr factorization between examples in test_basic/ and in test_cham/ because these are the only kernels that use chameleon. Comparing other kernels means that you compare multi-threaded kernels against sequential kernels

number of threads to 1 before callink chameleon kernels and reset it back to the number of available cores after the chameleon kernels calls are completed. This way chameleon can theorically work correctly and the other lapack/blas kernels can take advantage of multithreaded parallelism. For this to work in pratice, OpenMP threads binding must be performed correctly. In particular, it must be ensured than all threads are not bound to the same core during chameleon kernel calls, otherwise the whole point of multithreaded parallelism is destroyed (because all threads will execute on the same core, one after each other) For this, at least one kernel with omp_num_threads set to the maximum number of available core must be called before calling the first chameleon kernels (in order to bind the threads correctly in the first parallel region)

* QR factorization and DeflatedRestarting** description The QRDR algorithms is a little different from the others. The factorize_last_column(incremental qr) call was put in HessQRDR::notify_orthogonalization_end(). The reason for this is that if it was not put there, no other call would factorize the first block column of the hessenberg (H1new). This is not a problem in QRIBDR since solve is called on F1new in order to perform R criterion and detect inexact breakdown. the problem with this is that the actual QR factorization is not performed during the call measuring the least square time. I.E time is not measured properly** TODO ? IDEA1: solve this problem by adding a notify_restart_end call to Hessenbergs!?? this does not solve the problems, factorization time is still not measured properly** TODO ? IDEA2: add the code to measure time directly in the Hessenberg classes inconvenient: code must be added for all Hessenbergs; If not done correctly this could be problematic in IBDR and QRIBDR because IB_update is done one iteration after the corresponding R_criterion* IB+DR with inexact breakdown on R0 and RHS update The IB+DR algorithm theorically handle inexact breakdown on R0 before the first restart (after the first restart there is no R0 anymore) In the IB only algorithm, Inexact breakdown on R0 typically occurs after several restart (usually when IB occurs during previous restart) In order to test inexact breakdown on R0 for IB+DR, bound right hand sides must be passed on purpose as input to the algorithm. The fact that the algorithm can handle both inexact breakdown on R0 and IB+DR is the reason why there is two ways for update the right hand sides to the local GELS problem. When there is inexact breakdown on R0 (before 1st restart) init_phi is called but _restarted is set to false, so compute_Lambda perform the "inexact breakdown on R0" computation: Lambda <- Phi * Lambda_1. After a restart init_phi_restarted is called, _restarted is set to true and compute_Lambda perform the other computation: Lambda <- [[eye(p1);zeros(nj+p-p1,p1)], Phi] * Lambda_1* IB+DR+QR double udpate on (local GELS) RHS The IB+DR algorithm implies that there is an update on the right hand sides of GELS problem (either Inexact breakdown on R0, or update after IB+DR restart) QR versions also imply an update on the right hand sides. The solution that was adopted to handle this problem is that we compute Lambda the same way it is done in IB+DR versions: either (IB on R0) Lambda <- Phi * Lambda_1; OR (restart) Lambda <- [[eye(p1);zeros(nj+p-p1,p1)], Phi] * Lambda_1; and then, we apply all the Q^{H} or Q^{T} transformation coming from the incremental QR, at each iteration!