X86 Assembly/SSE

SSE stands for Streaming SIMD Extensions. It is essentially the floating-point equivalent of the MMX instructions. The SSE registers are 128 bits, and can be used to perform operations on a variety of data sizes and types. Unlike MMX, the SSE registers do not overlap with the floating point stack.

SSE, introduced by Intel in 1999 with the Pentium III, creates eight new 128-bit registers:

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7

Originally, an SSE register could only be used as four 32-bit single precision floating point numbers (the equivalent of a float in C). SSE2 expanded the capabilities of the XMM registers, so they can now be used as:

The following program (using NASM syntax) performs data movements using SIMD instructions.

;; nasm -felf32 -g sseMove.asm; ld -g sseMove.o;global _start
section.dataalign16
v1:dd1.1,2.2,3.3,4.4; Four Single precision floats 32 bits each
v1dp:dq1.1,2.2; Two Double precision floats 64 bits each
v2:dd5.5,6.6,7.7,8.8
v2s1:dd5.5,6.6,7.7,-8.8
v2s2:dd5.5,6.6,-7.7,-8.8
v2s3:dd5.5,-6.6,-7.7,-8.8
v2s4:dd-5.5,-6.6,-7.7,-8.8
num1:dd1.2
v3:dd1.2,2.3,4.5,6.7; No longer 16 byte aligned
v3dp:dq1.2,2.3; No longer 16 byte alignedsection.bss
mask1:resd1
mask2:resd1
mask3:resd1
mask4:resd1section.text
_start:;; op dst, src;;; SSE;; Using movaps since vectors are 16 byte alignedmovapsxmm0,[v1]; Move four 32-bit(single precision) floats to xmm0 movapsxmm1,[v2]movupsxmm2,[v3]; Need to use movups since v3 is not 16 byte aligned;movaps xmm3, [v3] ; This would seg fault if uncommented movssxmm3,[num1]; Move 32-bit float num1 to the least significant element of xmm4movssxmm3,[v3]; Move first 32-bit float of v3 to the least significant element of xmm4movlpsxmm4,[v3]; Move 64-bits(two single precision floats) from memory to the lower 64-bit elements of xmm4movhpsxmm4,[v2]; Move 64-bits(two single precision floats) from memory to the higher 64-bit elements of xmm4; Source and destination for movhlps and movlhps must be xmm registersmovhlpsxmm5,xmm4; Transfers the lower 64-bits of the source xmm4 to the higher 64-bits of the destination xmm5movlhpsxmm5,xmm4; Transfers the higher 64-bits of the source xmm4 to the lower 64-bits of the destination xmm5movapsxmm6,[v2s1]movmskpseax,xmm6; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eax mov[mask1],eax; Should be 8movapsxmm6,[v2s2]movmskpseax,xmm6; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eaxmov[mask2],eax; Should be 12movapsxmm6,[v2s3]movmskpseax,xmm6; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eaxmov[mask3],eax; Should be 14movapsxmm6,[v2s4]movmskpseax,xmm6; Extract the sign bits from four 32-bits floats in xmm6 and create 4 bit mask in eaxmov[mask4],eax; Should be 15;; SSE2;movapdxmm6,[v1dp]; Move two 64-bit(double precision) floats to xmm6, using movapd since vector is 16 byte aligned ; Next two instruction should have equivalent results to movapd xmm6, [vldp]movhpdxmm6,[v1dp+8]; Move a 64-bit(double precision) float into the higher 64-bit elements of xmm6 movlpdxmm6,[v1dp]; Move a 64-bit(double precision) float into the lower 64-bit elements of xmm6movupdxmm6,[v3dp]; Move two 64-bit floats to xmm6, using movupd since vector is not 16 byte aligned

The following program (using NASM syntax) performs a few SIMD operations on some numbers.

global _start
section.data
v1:dd1.1,2.2,3.3,4.4;first set of 4 numbers
v2:dd5.5,6.6,7.7,8.8;second setsection.bss
v3:resd4;resultsection.text
_start:movupsxmm0,[v1];load v1 into xmm0movupsxmm1,[v2];load v2 into xmm1addpsxmm0,xmm1;add the 4 numbers in xmm1 (from v2) to the 4 numbers in xmm0 (from v1), store in xmm0. for the first float the result will be 5.5+1.1=6.6mulpsxmm0,xmm1;multiply the four numbers in xmm1 (from v2, unchanged) with the results from the previous calculation (in xmm0), store in xmm0. for the first float the result will be 5.5*6.6=36.3subpsxmm0,xmm1;subtract the four numbers in v2 (in xmm1, still unchanged) from result from previous calculation (in xmm1). for the first float, the result will be 36.3-5.5=30.8movups[v3],xmm0;store v1 in v3;end program

short for print, prints a given register or variable. Registers are prefixed by $ in GDB.

x

short for examine, examines a given memory address. The "/4f" means "4 floats" (floats in GDB are 32-bits). You can use c for chars, x for hexadecimal and any other number instead of 4 of course. The "&" takes the address of v1, as in C.

shufps can be used to shuffle packed single-precision floats. The instruction takes three parameters, arg1 an xmm register, arg2 an xmm or a 128-bit memory location and IMM8 an 8-bit immediate control byte. shufps will take two elements each from arg1 and arg2, copying the elements to arg2. The lower two elements will come from arg1 and the higher two elements from arg2.

IMM8 control byte is split into four group of bit fields that control the output into arg2 as follows:

IMM8[1:0] specifies which element of arg1 ends up in the least significant element of arg2:

IMM8[1:0]

Description

00b

Copy to the least significant element

01b

Copy to the second element

10b

Copy to the third element

11b

Copy to the most significant element

IMM8[3:2] specifies which element of arg1 ends up in the second element of arg2:

IMM8[3:2]

Description

00b

Copy to the least significant element

01b

Copy to the second element

10b

Copy to the third element

11b

Copy to the most significant element

IMM8[5:4] specifies which element of arg2 ends up in the third element of arg2:

IMM8[5:4]

Description

00b

Copy to the least significant element

01b

Copy to the second element

10b

Copy to the third element

11b

Copy to the most significant element

IMM8[7:6] specifies which element of arg2 ends up in the most significant element of arg2:

IMM8[7:6]

Description

00b

Copy to the least significant element

01b

Copy to the second element

10b

Copy to the third element

11b

Copy to the most significant element

IMM8 Example

Consider the byte 0x1B:

Byte value

0x1B

Nibble value

0x1

0xB

2-bit integer (decimal) value

0

1

2

3

Bit value

0

0

0

1

1

0

1

1

Bit number (0 being LSB)

7

6

5

4

3

2

1

0

The 2-bit values shown above are used to determine which elements are copied to arg2. Bits 7-4 are "indexes" into arg2, and bits 3-0 are "indexes" into the arg1.

Since bits 7-6 are 0, the least significant element of arg2 is copied to the most significant elements of arg2, bits 127-96.

Since bits 5-4 are 1, the second element of arg2 is copied to third element of arg2, bits 95-64.

Since bits 3-2 are 2, the third element of arg1 is copied to the second element of arg2, bits 63-32.

Since bits 0-1 are 3, the fourth element of arg1 is copied to the least significant elements of arg2, bits (31-0).

Note that since the first and second arguments are equal in the following example, the mask 0x1B will effectively reverse the order of the floats in the XMM register, since the 2-bit integers are 0, 1, 2, 3. Had it been 3, 2, 1, 0 (0xE4) it would be a no-op. Had it been 0, 0, 0, 0 (0x00) it would be a broadcast of the least significant 32 bits.

SSE 4.2 adds four string text processing instructions PCMPISTRI, PCMPISTRM, PCMPESTRI and PCMPESTRM. These instructions take three parameters, arg1 an xmm register, arg2 an xmm or a 128-bit memory location and IMM8 an 8-bit immediate control byte. These instructions will perform arithmetic comparison between the packed contents of arg1 and arg2. IMM8 specifies the format of the input/output as well as the operation of two intermediate stages of processing. The results of stage 1 and stage 2 of intermediate processing will be referred to as IntRes1 and IntRes2 respectively. These instructions also provide additional information about the result through overload use of the arithmetic flags(AF, CF, OF, PF, SF and ZF).

The instructions proceed in multiple steps:

arg1 and arg2 are compared

An aggregation operation is applied to the result of the comparison with the result flowing into IntRes1

An optional negation is performed with the result flowing into IntRes2

An output in the form of an index(in ECX) or a mask(in XMM0) is produced

IMM8 control byte is split into four group of bit fields that control the following settings:

IMM8[1:0] specifies the format of the 128-bit source data(arg1 and arg2):

IMM8[1:0]

Description

00b

unsigned bytes(16 packed unsigned bytes)

01b

unsigned words(8 packed unsigned words)

10b

signed bytes(16 packed signed bytes)

11b

signed words(8 packed signed words)

IMM8[3:2] specifies the aggregation operation whose result will be placed in intermediate result 1, which we will refer to as IntRes1. The size of IntRes1 will depend on the format of the source data, 16-bit for packed bytes and 8-bit for packed words:

IMM8[3:2]

Description

00b

Equal Any, arg1 is a character set, arg2 is the string to search in. IntRes1[i] is set to 1 if arg2[i] is in the set represented by arg1:

arg1 = "aeiou"
arg2 = "Example string 1"
IntRes1 = 1010001000010000

01b

Ranges, arg1 is a set of character ranges i.e. "09az" means all characters from 0 to 9 and from a to z., arg2 is the string to search over. IntRes1[i] is set to 1 if arg[i] is in any of the ranges represented by arg1:

arg1 = "09az"
arg2 = "Testing 1 2 3, T"
IntRes1 = 0111111010101000

10b

Equal Each, arg1 is string one and arg2 is string two. IntRes1[i] is set to 1 if arg1[i] == arg2[i]:

There are literally hundreds of SSE instructions, some of which are capable of much more than simple SIMD arithmetic. For more in-depth references take a look at the resources chapter of this book.

You may notice that many floating point SSE instructions end with something like PS or SD. These suffixes differentiate between different versions of the operation. The first letter describes whether the instruction should be Packed or Scalar. Packed operations are applied to every member of the register, while scalar operations are applied to only the first value. For example, in pseudo-code, a packed add would be executed as:

* CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings.