SSE2 Vectorization of Alphablend Code

Introduction

In this article, we will explore vectorizing the pixel alphablending code in my earlier article. I believe most C/C++ programmers have tinkered with SIMD instructions like MMX and Streaming SIMD Extensions 2 (SSE) or even Advanced Vector Extensions (AVX). Most programmers, I believe, gave up after their first SIMD attempt, after that effort produced an mediocre or worse performance than the original non-vectorizing code. The main reason for slow performance is CPU cycles are typically wasted in packing the primitive data into the vector prior to SIMD execution and unpacking them into their proper place in the structure after execution. We will dive deeper into this in the next section. The main reason that SIMD is slow to be adopted by programmers around the world, is because SIMD code is harder to write and the resultant code, even with using intrinsics, are not readable and even less understandable.

In this article, we will only focus on SSE2; the reason being MMX is a 64bit SIMD technology, which does not offer significant speed up compared to SSE2 (mostly 128 bit SIMD from Intel) on today's 64-bit processors, and Advanced Vector Extensions (AVX) (256 bit SIMD from Intel) was not chosen for this article because not many mainstream users own these latest Intel processors that can exploit AVX: SSE2 has been around since 2001. (Okay, I admit, the real reason is actually because I do not own AVX supported processor. I promise to update the article when I have access to a PC with an AVX-enabled processor.) All, but the oldest, personal computers should have housed an x86/x64 processor, which support SSE2. Rather than using inline assembly code, we will use SSE2 intrinsics provided by Visual Studio to assist us in calling the SSE2 instructions: Currently, Microsoft 64 bit C/C++ compiler does not permit inline assembly code.

Structure of Arrays

Let us imagine we have a structure type named SomeStruct as shown below and we have an array of SomeStruct, essentially, an array of structures.

struct SomeStruct
{
int aInteger;
double aDouble;
short aShort;
};

To use SSE2 arithmetic instructions on aInteger, we must first pack 4 aInteger into a 128 bit vector of integer type, __m128i.

After we have the computation result, we must unpack 4 aInteger into its individual structure objects in the array.

This is usually what kills the performance! To enhance performance, programmers are advised to pack their data into structure of arrays rather than array of structures; the same advice is applicable to GPGPU programming so as to allow memory coalescing. It is often not possible to alter the existing data domain/model to unnaturally suit the SIMD data packing requirement. In the structure shown below, arrInteger, arrDouble and arrShort are usually pointers to dynamic allocated arrays whose size is only known during runtime but I show them here to be of array type so as to emphasize they are of array type rather than pointer type.

As the reader may have noticed, integer, double and short data type have different sizes. 4 integers would fit into __m128i and 2 doubles would fit into __m128i and 8 shorts would fit into __m128i. This might present difficulties, when programmers are writing code to loop through the __m128i arrays to do computation. For our case, this is not a problem, because we are only using unsigned short type in this alphablend article.

Explanation of the Code

For our SSE2 alphablend code we will, as I have just said, use unsigned short type to store each color channel. The reader may ask why use unsigned short to store a value, which only ranges from 0 to 255? The answer is due to the multiplication used, the intermediate values would most likely exceed 255 while the final result still stays within 255. In normal coding, we can use byte variables to do the same computation, without using 16-bit variable. The reason is because the byte value is automatically promoted to integer (word size of the platform) before any computation takes place and is also automatically converted back to byte after computation. We have declared __m128i array to store unsigned short for each color channel of 2 alphablending images. m_fullsize is to store the full size of the array in bytes. m_arraysize is the array size in terms of a 128-bit sized element. m_remainder is to hold the number to indicate which of the unsigned short are valid in the last 128-bit element if the number of unsigned short required is not cleanly divisible by 8 (8 x 16 bits=128-bits). For example, 4 integers fits into the __m128i, if we have 6 integers to computer, array of 2 __m128i element is needed but on the last __m128i element, only 2 integers is used, thus m_remainder would reflect 2.

With Get128BitElement method, we can get __m128i element from the __m128i array, using a bigIndex, which counts in terms of __m128i elements. bigSize is the size of __m128i array in 128-bit chunks. smallRemainder is the number of valid unsigned short element in the last __m128i element. When we detect that it is the last element and the number of short integer is no clearly divisible by 8, we need to set the unused shorts in the last __m128i element to 1s, so as to prevent division-by-zero error. It is a good practice to do this for every last element returned, regardless of if it is integer type or float type.

We must take care to allocate our __m128i array with _aligned_malloc function so that they will align with 16 bytes boundary as required by 128-bit SSE2 instructions. We calculate and store the remainder if the number of unsigned short required is not cleanly divisible by 8 (as explained before)(Note: there are 8 unsigned short (16 bits data type) in 128-bit).

Lastly, we must remember to deallocate the __m128i arrays with _aligned_free in destructor of the parent class. This is a good practice to remember to write deallocation code before we begin our main coding.

Now, we have come to the meat of our article: using SSE2 to do alphablending. The formula used is ((SrcColor * Alpha) + (DestColor * InverseAlpha)) >> 8. The code listing below is heavily commented. We use _mm_set1_epi16 instruction to set the same short value (0) in every short element in the __m128i element. _mm_mullo_epi16 instruction is used to multiply the unsigned short and only retain the 16 bits if the final result is overflowed into 32 bits. We used _mm_add_epi16 instruction to adding of unsigned shorts. _mm_srli_epi16 instruction is to do shift right by 8 times to simulate division by 256. We can get away without using division instruction, which is a blessing because SSE2 does not support division for integers. After SSE2 operations, we need to unpack the data back to ARGB format. This unpacking offsets some of the performance gains.

Benchmark Results

This is timing of doing alphablending 1000 times on a Intel i7 870 at 2.93 Ghz. The performance characteristic is different from what is posted in the earlier article because the benchmarks are done on different PCs. The earlier PC used, has 'retired' from useful service in this world.

Below is the number of times of speedup we got when compared to the unoptimized version. But comparing the SSE2 speedup to the 2nd runner up (Optimized Code with Shift), the speedup is only 2.2 times, not the 8 times that we expected. So what is the memory consumption that is needed for this SSE2 speedup? 2 bytes for each color channel instead of 1 byte. The original implementation needs 4 bytes to store a color pixel, ARGB. (4 bytes is used because the processor can load and store 32 bit data faster than 24 bit data, which may required shifting it to proper 32 bit boundary before loading inside the 32 bit register). The SSE2 version needs 6 bytes to store a color pixel, RGB (because 2 bytes are used to store each color channel). Individual alpha channel is not stored because they are not used. SSE2 version would need 50% more memory to perform its magic.

Conclusion

We have used SSE2 to speed up from the previous fastest performance by 2 times. The speedup is not dramatic. It is probably not worth the amount of extended effort and unreadable code to do this. To achieve better readability, we could use the vector class, written by Agner Fog of Copenhagen University College of Engineering, which basically extends the ones listed in dvec.h header. These vector class support overloaded operators for -, +, *, / and so on. And they also support the missing operations like multiplication and division for some integer types in SSE2 though those operations are sometimes implemented using the scalar version of instruction (not SSE2, so they have no speed advantage). This eliminates the programmer's effort to implement these unsupported/missing operations himself/herself. Any suggestion to improve the performance further is welcome. We could ramp up performance by using OpenMP or Parallel Patterns Library (PPL) in tandem with SSE2. But that is an article for another day. Any reader interested in a future article about utilizing OpenCL to do alphablending, please drop me a message below. OpenCL is a heterogeneous technology; OpenCL kernel can be run on either CPUs or GPUs.

Related Articles

Downloads

Comments

Interesting

Posted by circular
on 09/15/2012 06:07am

Thanks for this article. So the key to have some optimization is to pack values of one channel in the same array. Something that should be mentionned is that aligning data and directly loading values with movaps would be faster.

Top White Papers and Webcasts

Live Event Date: March 19, 2015 @ 1:00 p.m. ET / 10:00 a.m. PT
The 2015 Enterprise Mobile Application Survey asked 250 mobility professionals what their biggest mobile challenges are, how many employees they are equipping with mobile apps, and their methods for driving value with mobility.
Join Dan Woods, Editor and CTO of CITO Research, and Alan Murray, SVP of Products at Apperian, as they break down the results of this survey and discuss how enterprises are using mobile application management and private …

This report outlines the future look of Forrester's solution for security and risk (S&R) executives working on building an identity and access management strategy for the extended enterprise. We designed this report to help you understand and navigate the major business and IT trends affecting identity and access management (IAM) during the next five years. IAM in 2012 has become a tool not just for security but also for business agility. Competitive challenges push businesses into the cloud and encourage …