It works, but it's also 20% slower.That's why I haven't tried to read all 4 values in one step in the asm.

And, regarding your advice to not use x86/64 instructions to fill the SSE2 registers, to my shame I don't know how to do it using SSE2 commands. This is my first SSE2 code + I searched on Google but couldn't find any information.I tried all SSE2 mov* commands in "mov* xmm0, [xmm2 + xmm4]" but the compiler didn't like any of them.

You say to work on 16 bit values but the problem is in that lines of code there are 4 32 bit variables: w11, w12, w21, w22. So somehow I have to work on 32 bit.That's why I used xmm0, xmm2, xmm4 and xmm6 (each one is 16x2).Plus the result of multiplying 2 32 bit values is a 64 bit value so I had to "sacrifice" xmm2 and xmm6 by putting the 64 bit results in xmm0 and xmm4.

But the problem is that your code it is so complex (for assembler I mean) and has many variables. Even if I use pointers it's not easy to find a way to parallelize the operations on the BGRA and to use ONLY SSE2 registers,A genius in assembler is needed for this. Unfortunately, I'm only a beginner.

Thank you very much for your effort and time to help me.And sorry I couldn't do it.