Introduction

The Intel Streaming SIMD Extensions technology enhance the performance of floating-point operations. Visual Studio .NET 2003 supports a set of SSE Intrinsics which allow the use of SSE instructions directly from C++ code, without writing the Assembly instructions. MSDN SSE topics [2] may be confusing for the programmers who are not familiar with the SSE Assembly progamming. However, reading the Intel Software manuals [1] together with MSDN gives the opportunity to understand the basics of SSE programming.

SIMD is a single-instruction, multiple-data (SIMD) execution model. Consider the following programming task: computing of the square root of each element in a long floating-point array. The algorithm for this task may be written by such way:

foreach f inarray
f = sqrt(f)

Let's be more specific:

foreach f inarray
{
load f to the floating-point register
calculate the square root
write the result from the register to memory
}

Processor with the Intel SSE support have eight 128-bit registers, each of which may contain 4 single-precision floating-point numbers. SSE is a set of instructions which allow to load the floating-point numbers to 128-bit registers, perform the arithmetic and logical operations with them and write the result back to memory. Using SSE technology, algorithms may be written as:

foreach4 members inarray
{
load 4 members to the SSE register
calculate 4 square roots in one operation
write the result from the register to memory
}

The C++ programmer writing a program using SSE Intrinsics doesn't care about registers. He has a 128-byte __m128 type and a set of functions to perform the arithmetic and logical operations. It's up to the C++ compiler to decide which SSE register to use and to make code optimizations. SSE technology may be used when some operation is done with each element of a long floating-point arrays.

SSE Programming Details

Include Files

All SSE instructions and __m128 data type are defined in xmmintrin.h file:

#include<xmmintrin.h>

Since SSE instructions are compiler intrinsics and not functions, there are no lib-files.

Data Alignment

Each float array processed by SSE instructions should have 16 byte alignment. A static array is declared using the __declspec(align(16)) keyword:

__declspec(align(16)) float m_fArray[ARRAY_SIZE];

Dynamic array should be allocated using new _aligned_malloc function:

m_fArray = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);

Array allocated by the _aligned_malloc function is released using the _aligned_free function:

_aligned_free(m_fArray);

__m128 Data Type

Variables of this type are used as SSE instructions operands. They should not be accessed directly. Variables of type _m128 are automatically aligned on 16-byte boundaries.

Detection of SSE Support

SSE instructions may be used if they are supported by the processor. The Visual C++ CPUID sample [4] shows how to detect support of the SSE, MMX and other processor features. It is done using the cpuid Assembly command. See details in this sample and in the Intel Software manuals [1].

SSETest Demo Project

SSETest project is a dialog-based application which makes the following calculation with three float arrays:

ARRAY_SIZE is defined as 30000. Source arrays are filled using sin and cos functions. The Waterfall chart control written by Kris Jearakul [3] is used to show the source arrays and the result of calculations. Calculation time (ms) is shown in the dialog. Calculation may be done using one of three possible ways:

Now let's rewrite this function using the SSE Instrinsics. To find the required SSE Instrinsics I use the following way:

Find Assembly SSE instruction in Intel Software manuals [1]. First I look for this instruction in Volume 1, Chapter 9, and after this find the detailed Description in Volume 2. This description contains also appropriate C++ Intrinsic name.

Search for SSE Intrinsic name in the MSDN Library.

Some SSE Intrinsics are composite and cannot be found by this way. They should be found directly in the MSDN Library (descriptions are very short but readable). The results of such search may be shown in the following table:

This doesn't show the function using inline Assembly. Anyone who is interested may read it in the demo project. Calculation times on my computer:

C++ code - 26 ms

C++ with SSE Intrinsics - 9 ms

Inline Assembly with SSE instructions - 9 ms

Execution time should be estimated in the Release configuration, with compiler optimizations.

SSESample Demo Project

SSESample project is a dialog-based application which makes the following calculation with float array:

fResult[i] = sqrt(fSource[i]*2.8)
i = 0, 1, 2 ... ARRAY_SIZE-1

The program also calculates the minimum and maximum values in the result array. ARRAY_SIZE is defined as 100000. Result array is shown in the listbox. Calculation time (ms) for each way is shown in the dialog:

C++ code - 6 ms on my computer;

C++ code with SSE Intrinsics - 3 ms;

Inline Assembly with SSE instructions - 2 ms.

Assembly code performs better because of intensive using of the SSX registers. However, usually C++ code with SSE Intrinsics performs like Assembly code or better, because it is difficult to write an Assembly code which runs faster than optimized code generated by C++ compiler.

I'm very new to this topic and have a question. When using SSE, does the number of iterations of each loop always have to be a multiple of 4?
Lets say you need to do a check (if statement inside the loop) at every iteration, is there a way to use SSE? or is there any use using it?

Well, actually, arrays do not need to be a multiple of 4. What you can do is for the portion that is a multiple of 4, do the SSE instructions, and with what's left over, do the regular way without SSE (which will be at max 3 iterations). This lets your optimization be dynamic across multiple array sizes.

So say you want to mess with an array of size 37. The first 36 you do with the SSE implementation, the last 1 you do with the normal implementation (without SSE).

It was a great question that wasn't addressed in the article. It's best practice to assume when creating such a function using SSE that it allows for arrays of any size.

I have never try to make _m128[] array, I don't know exactly whether it is aligned or not. What is a purpose to make such array? We need _m128 variable to work with SSE registers, input and output vectors should be kept in float array.

I executed your sample apps, and there is a significant performance boost when using SSE instead of just C++. However, the functions I've written in with SSE intrinsics have been taking 2-3 times as long to execute as their C++ counterparts. Do you know what might cause this?

Below is a function I wrote to get the minimum and maximum values of an array. This executes in roughly 80-90 microseconds on an array of 640 numbers. The C++ function that does the same thing takes 28-31 microseconds. What gives? The SSE version has to do the memcpy to get the input array aligned correctly, but this only accounts for about 26 microseconds of the difference. I realize that I'm using shorts instead of floats, but it should still work. I converted your SSESample program to use shorts and only calculate the min and max of the input array. The SSE code executed less than twice as fast as the C++ code after that, but it was still faster.

640 is not significant number to use SSE. You need to do this for very long arrays, whuch are used in image processing, graphics, 3D etc.
My second sample shows how to find minimum and maximum, I don't see something similar in your code. Does it give right result? Instead of copying of the whole array to aligned array, you need to start from the first aligned input array member.
Anyway, you need to use MMX for this short numbers, take a look at my MMX article. On Pentium 4 you can use SSE2.
Sorry that I don't try to understand your code, SSE programming takes a lot of time. I can try to do this, but code must be clear, without float-short tricks.

Thanks for your response. I realize that 640 is not a lot of elements, but this function is called many, many, times and it is slowing down my app. I do use code similar to yours to find the min and max, except that I'm using _mm_min/max_pi16 instead of _mm_min/max_ps. It does return the correct result; I've checked it against the C++ version of the function. There aren't min and max functions in MMX, but I was able to get it working by using the greater than function. Unfortunately, it takes more instructions and is a little slower than SSE. I don't know what you mean by "float-short" tricks in my code. There were no floats at all in the code that I posted. You don't have to read my code if you don't want to. The example I posted isn't the only time I've had SSE code run slower than C++. I just thought you or someone else might have some ideas why SSE in general would run slower than normal C++ code.

How do you determine which element of an array is the first aligned input array member?

Tests must be done in Release configuration. Again, there is no need to use SSE for small arrays. It doesn't matter that you call function many times. Array must be very long to get performance boost from SSE. In your case, use C++ code.

sorry to bother u again with beginner's questions, but i'm quite stuck.
I have a class using SSE. I'm declaring a member private variable:

__declspec(align(16))unsignedchar m_nodes[ARRAY_SIZE];

later on i try to use it in an asm block,

movaps xmm0, [esi]

with esi pointing to the array base address. This however throws an exception, which is because the array is not aligned (the base address should be a multiple of 16, am i right?).
I can't figure it out. why isn't my array aligned?
another, final, question: do you know, or can u point me to the actual performance difference between movaps and movups

A realy interesting and enlightening article. I have a small question: as I understand, MMX uses mm0-mm7 registers, which are actually CPU floating point registers, whereas SSE/2 uses xmm0-xmm7 registers, which where especially defined for SIMD purposes. And finally, the question(s) -

Does this mean that I can use both types of registers simultaneously?

Does this mean that I can do without the EMMS instruction when writing pure SSE/2 code?

The EMMS instruction must be used to clear the MMX™ technology state at the end of all MMX™ technology routines and before calling other procedures or subroutines that may execute floating-point instructions. If a floating-point instruction loads one of the registers in the FPU register stack before the FPU tag word has been reset by the EMMS instruction, a floating-point stack overflow can occur that will result in a floating-point exception or incorrect result.

SSE doesn't require this instruction.
I don't have experience in using SSE2.

I´m trying to measure some codes (beginning to SSE) and when compiling the below code in Release (optimized for speed) in VC++ 2003 the optimizer makes some weird things (put a breakpoint at the start of the main and you will see).
// SSE.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include

// *** BEGIN OF INCLUDE SECTION 1
// *** INCLUDE THE FOLLOWING DEFINE STATEMENTS FOR MSVC++ 5.0