ACML 5.1.0 and zgemm performance

Basically the same question as in my other post (is it known? is it possible to estimate when a fix will be available? other suggestions?)

Maybe this does not provide much of an insight, but one example of a problematic input size for zgemm on acml5.1.0 is

transa=C transb=N m=166 n=6 k=5124 lda=5240 ldb=5240 ldc=166.

This, as far as I can tell, only affects zgemm and not dgemm (I have not tested the sgemm/cgemm).

It seems ACML-5.1.0 end up using way more instructions for the same zgemm call as compared to ACML-4.4.0. E.g., there seems to be a huge number of extra branches when running the 5.1.0 version. The non-fma4 version seems to behave the same w r t instruction and branch counters, but of course performs worse.