Thanks, Peter, for the reproducer and for bringing this matter to our attention. You are right – daxpy is parallelized, while daxpby is not. We will look into this matter. Any further information you can provide as to use cases (number of elements, number of threads, platforms used, etc.) would be helpful. Thank you!

I've tested this on a 2.6.38.2 Linux Xeon E5-2660 system, a 3.4.63 Linux Xeon X5690 system, and a Xeon Phi coprocessor. All three demonstrated the same problem.

I believe that I can work around the problem by chunking the vector manually, iterating over the chunks in an OpenMP parallel for loop, and calling daxpby for each chunk.

In general, we expect daxpby performance to be identical to daxpy performance ​when b=1.0, and we expect the performance to scale similarly for b != 1.0. (In fact, I'm a bit surprised that daxpy isn't just a wrapper for daxpby.)