Vectorization with SSE

Using OpenCL, CUDA, or OpenACC to take advantage of the computing power offered by GPGPUs is the most effective way to accelerate computationally expensive code nowadays. However, not all machines have powerful GPGPUs. On the other hand, all modern x86 processors from Intel and AMD support vector instructions in the form of Streaming SIMD Extensions (SSE – SSE4) and most new processors support Advanced Vector Extensions (AVX). Utilizing these instructions makes it easy to improve the performance of your code by as much as a factor of eight (AVX-512 will increase this to a factor of 16, but this is only for Intel MIC cards). In some cases, compilers can automatically vectorize pieces of code to take advantage of the CPU's vector units. Furthermore, OpenMP 4.0 allows you to automatically vectorize certain sections of code, but writing code with vector operations in mind will generally yield better results. In this post, I will briefly explain how to use SSE to vectorize C and C++ code using GCC 4.8. Most of the instructions should also apply to LLVM/Clang and Microsoft Visual Studio.

SSE registers are 128 bits (16 bytes) wide. This means that each register can store 4 single precision floating point numbers or 2 double precision floating point numbers as well as other types that add up to 16 bytes. Using SSE, we can perform 4 single precision or 2 double precision floating point operations simultaneously on one CPU core using one instruction. Note: A superscalar processor, containing multiple floating point units, can also perform multiple floating point operations simultaneously, however a separate instruction is issued for each operation and the programmer has little, if any, control over which operations are performed simultaneously. Using SSE, we also have more control over the cache and prefetching. It's even possible to bypass the cache entirely, which can be useful in some cases where cache pollution would cause a performance hit. With SSE, we can even eliminate some branches in the code, thereby reducing the chance of a branch misprediction, which would necessitate a pipeline flush. This can potentially improve performance further.

First-generation SSE instructions can be accessed using the header xmmintrin.h. SSE2 and 3 instructions can be used by including emmintrin.h and pmmintrin.h, respectively. To access all vector extensions, including SSE4 and AVX, use immintrin.h. You also need to consult your compiler's documentation to learn which flags / switches are required to enable the instructions. In the code example below, I demonstrate...

initialization of a vector at compile time

initialization of a vector from an array

storing vector data in an array

vector addition

vector multiplication

vector division

multiplication by a scalar

using element-wise comparison to create a mask

bitwise AND, OR and ANDNOT (NAND)

how to replace a conditional statement using the mask

the vector square root

the approximate inverse square root

the approximate reciprocal instruction

how to use the shuffle instruction and the SHUFFLE macro.

The code sample compiles on g++ 4.8. Some minor changes may be needed for other compilers. Some of these changes are mentioned in the comments. Click here for a version of the example that compiles in g++ 4.7 and clang++ 3.0 (and probably other compilers).

To understand the shuffle command in part L of the example, refer to this documentation for a more general usage of the shuffle operation.

Studying the example above should give you a basic sense of how to use some of the SSE instructions. Notice the definition of the v4sf data type, which is a vector of size 16 bytes. Also note that each operation performed above on vectors has the form _mm_operation_ps(), but there are other types of operations. If you read through the xmmintrin.h header, you will see that there are many more operations available. Reading through the header and doing a few web searches for specific intrinsic functions is an easy way to learn about the details of SSE.

For an idea of how SSE can be used to solve a real problem, consider the following partial example. Suppose we decide to step some particles forward in space using

Ignore the fact that this isn't the best expression to use, in general. We could create particle packets containig four particles, like this:

Using OpenMP to spread the work over several threads would further improve performance.

Note: SSE2 and SSE3 merely add more instructions on top of SSE. On the other hand, the newer AVX instructions use 256 bit registers, so they are able to operate on 8 single precision floats or 4 double precision floats simultaneously, while AVX 2 will have 512 bit registers! AVX is quite similar to SSE in terms of usage. For more information, see avxintrin.h.

This entry was posted
on Friday, October 18th, 2013 at 12:55 am and is filed under regular update.
You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.

316820 Responseshttp%3A%2F%2Fwww.idius.net%2Fvectorization-with-sse%2FVectorization+with+SSE2013-10-18+07%3A55%3A23http%3A%2F%2Fwww.idius.net%2F%3Fp%3D3168 to “Vectorization with SSE”

Dear ,
Your post is a glamorous post. very creative and helpful. I just wanted to thank you so much for sahring this great article. This is a nice post in an interesting line of content.once again Thanks for sharing this article, great way of bring this topic to discussion.

The education can give people the high level of income. Education can receive the respect from every human being that lives in our citizen. The education shows the superior status. The education has the greater knowledge about everything.

Being a professor i have seen a most common thing among many of the students that they drop their studies in between due to feeling difficulty in continuing their regular education and they mostly move towards Life Experience Degrees Accredited to get life experience degrees for whatever the reason may be. It is however not a crime to go towards those options but atleast they should try their best to complete the education.

Couches are large and cumbersome. To thoroughly clean them properly you need to pull away the chair cushions as well as vacuum each and every inch as well as brush aside any crumbs along with other bits which have fallen at the rear of and collected within the corners and across the edges.cleaning company abu dhabi

Our educational systems are not perfect and these need to be made more perfect by some changes in it so that it can made fit for every person and to remove all the allegations on it. Our look for professional writers online will point out the shortcomings in our educational systems.

Is it possible for a piece of recorded music to be worth $1 million or more? Are all musicians and music composers doomed to struggle in the music industry and claw their way up into a career in music? Visit Our Website Best Recording Microphone/

The best part of our parking slots is that we can be reachable from any terminals. Plus we also offer high-end security all round the clock. You can more details here at www.turbonuoma.it. Our team will help you with the bookings at the best rates with added discounts. Reach us now and more info. Website:- www.turbonuoma.lt/