For coldfire it probably makes sense to create two variants of vector_fixmul_scalar for the two different shifts used (16 and 24) since the fixmulshift function is pretty slow with 2 branches (othoh those branches could be eliminated here anyway since the shift is known to be less than 31).

edit: the coldifre fixmul32 function in libwma/wmafixed.h is doing the same as the fixmul16 in this patch but is 2 cycles faster and uses one register less

I did not see there are only used two variant of vector_fixmul_scalar(). You are right, we should definately define a 24 and a 16 bitshift variant of either vector_fixmul_scalar().

I will change Coldfire's fixmul16() to your proposed implementation. Is this also valid for other codecs implementations (e.g. atrac, mpc)? Or does this faster implementation use any knowledge about the codec's fixed point representation (e.g. 14 bits fract part)...

it does what the one in this patch does, takes the lower 16 bits of the high half of the 64bit result and make them the high 16 bits of the 32 bit final result and the high 16 bits of the lower half and make them the lower 16 bits of the final 32 bit result or (int32_t)((int64_t)x*y>>16), but you are right that it will not work for shifts other than 16.

Differ between
a) models with large IRAM -> put <WMAProDecodeCtx.tmp> to IRAM.
b) models with normal IRAM -> cannot put <WMAProDecodeCtx.tmp> to IRAM, but move several window tables to IRAM as second best option.

Minor change. We do not need vector_fixmul_scalar_16(), use vector_fixmul_scalar_24() instead. vector_fixmul_scalar_16() was only used to scale with sqrt(2), we can use higher precision for this rare use case.

v07: Approach to use larger fract part for mdct and windowing. This preserves higher precision. The larger fract part is introduced via using fixmul16 instead of fixmul24 in the vector_fixmul_scalar() function. The asm for Coldfire has not been changed yet, especially Coldfire will speedup a bit through this change as well.

ToDo: Somebody to change the Coldfire asm and some cleanup of this change.

Doing this with the current coldfire fixmul16 asm is pretty trivial but it think there's a better way: switching the emac to integer mode because we can then get the result with only one multiplication instead of the 2 used in the fixmul16 function, otoh, if this multiplication overflows 48 bits that will not work very well i think, do you know if this can overflow into the top 16 bits of the 64 bit result?

Hi, i tested a version similar to your version 1 and also another approach switching the emac to unsigned integer mode and using that to get the lower half of the result but both turned out to be slower than the c loop using the fixmul16 macro so i think we can leave this as it is and close this task.
edit: my idea above about using the extension word of the emac to the the full result with one multiplication can not work.

no version 2 is not correct, we cannot get away with doing only one multiply because the multiplier yields only a 40 bit result, i was confused and thought it gave a 48 bit result but that is not the case