Either MOVMSKPS - Extract Packed Single-Precision Floating-Point Sign Mask - isn't following the specification correctly, or I am misunderstanding something. When I use it, it reverses the order of the bits that it extracts from the XMM register before putting them in the low order bits of the general purpose register. The Intel manual and every source I can find online says that the order should be preserved and not reversed, as shown in this extraction:

I've done some further testing, and confirmed that the order of the mask bits is consistently reversed before being placed in the general purpose register.

I can work with it like this if it continues to and reliably works like this, but I'd like to make sure this is what's supposed to be happening. I imagine that I am having a little-endian problem in my understanding somewhere, but I've more than triple checked, and it really does look like the Intel manual is wrong in this case, or my processor is wrong :shock: (Oh, it's an Intel Core 2 Duo E8400 Wolfdale).

uint128 would suggest that movdqa is indeed reversing the byte order, and this sounds consistent for memory/register data transfer operations on the x86, especially since you are using successive db's instead of something like do.

I was considering that too, but look at what v16_int8 shows. Seeing that, I was thinking that when GDB does a print for an XMM register, it might format the appearance of uint128 to match as it would appear if you placed it back in memory, due to little-endianness. I might be wrong there, though.

I'll try some more tests to see if I can figure out the exact operation and endianness of MOVDQA when I get back to my machine. If it really is little-endian and I've been reading GDB's "print/x $xmm0" incorrectly all this time, then I have to question how on earth the rest of my program works.

OK, I did a bit of testing, and I was indeed reading the debugger's output incorrectly. Instead of looking at "print/x $xmm0"'s v16_int8, I should have been looking at its uint128. Also, little endian does take place for XMM registers. I'll probably have to watch out for whether the little-endianess is reversing the order for all 16 bytes as one chunk, or for smaller chunks, depending on what kind of data the instruction I'm using thinks it's dealing with.

To help me test, I used MOVLPD (Move Low Packed Double-Precision Floating-Point Value) and PEXTRB (Extract Byte). I learned that the low order bytes are on the right side of what uint128 shows, and that an offset starts from the right side of what uint128 shows. I'm all good now; thank you for your help.