but what become immediately very apparent is also that only very large matrices 10,000 x 10,000 benefits from the vectorized codes (i.e. AVX2) in the *decompose* part. if you have tiny matrices say 100x100 in size that gives a paltry 514.36 Mflops, less than 100 times (or could say 1/200) of that speed of 10,000 x 10,000.

The other apparent thing is the *solve* part of the computation, you could see that while the decompose part which involves a matrix multiplication (e.g. DGEMM) can reach speeds of 128 Ghz, *but* the *solve* part *did not benefit* from all that AVX2 vectorized codes showing little improvements for different matrices sizes!

this has major implications, it means that whether you have a good cpu with AVX2 etc or that you have a large GPU that can process say vectorized / parallel 1000s floating point calcs per clock cycle.

But if your problems are small (e.g. 100x100) or that it cannot benefit from such vectorized codes much of that GPU capacity and even for this instance AVX2 may simply be *unused* and will *not benefit* from all that expensive vectorized hardware (e.g. AVX2 and expensive GPU cards capable of processing possibly thousands of vector computation per cycle, e.g. thousands of gpu simd cores)

I'll consider starting to make some BOINC applications for OpenCL AFTER I find and take an online class in OpenCL that is intended for programming GPUs, not FPGAs. I cannot travel enough to use in-person classes instead.

BOINC now supports use of CUDA and OpenCL; I've seen no information on whether it can also handle the various other ways of programming GPUs that have already been mentioned in this thread. I'd expect library compatibility problems to block use of some of them at least until future versions of BOINC build in replacements for the incompatible sections of the libraries.