Visual Studio 2010 Built-in CPU Acceleration

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

Writing the sample code for this post I was amazed myself to see how simple it was to reach over 20 times performance improvement with so little effort.

The motivation is a very heavy video processing algorithm created for HD TV. This means hi-resolution which means many pixels to compute and it also means that every object in a frame contains several times more pixels than regular video image. After having a C# working code (you can try it here: http://www.asyncop.com/clearimage/) there is a need for a C++ DirectShow codec implementation. The C++ implementation worked very well with a 640 x 480 video resolution and 30 frames per second. This meant 640 x 480 x 30 = 9,216,000 pixels to process every second. On top of that the purpose of the algorithm is to produce a 'photoshop'-like effect by detecting important facial features and smoothing out non-important imperfections. This requires massive amounts of processing power. Moving towards HD means at least 1080 x 720 x 30 = 23,328,000 pixels to process, but it also means that every object in the image is made of approximately 4 times more pixels and every object is surrounded by more pixels. Processing power requirements increase dramatically.

Here is an example of the output as generated by the less sophisticated C# online implementation:

Source:

Product:

I will now survey the potential alternatives we considered with pros and cons for each possible alternative.

When I became an Intel Black-Belt this year I received an impressive glass certification plate, licenses to Intel software products and a brand new laptop. Dell Latitude 6410 with Core i-7 running 64 bit version of Windows 7. My old laptop had a dual-core CPU with a slow hard-drive so I didn't even try to install Visual Studio 2010 on it. The new laptop is a completely different story. The new machine is running Visual Studio 2010 with no problems. In fact an image of my old laptop is running as Virtual PC inside my new laptop and it is executing much faster than the original device (easily done, email me and I'll tell you how).

The C++ DirectShow implementation of the algorithm was first tested on this new Core i-7 machine. This finally allowed live encoding of a 640 x 480 video input but was far from enough for live encoding of HD ('beta' input) video. Here are the possible alternatives we have considered:

Logical Optimizations of the code: This means selecting the best effective internal resolution, color sensitivity, brightness compensation, linear adjustments etc. Reasoning: This is a way to eliminate pointless calculations and data processing which is not really relevant to the viewer. Pros: This will allow maximum performance for a given hardware / infrastructure. Cons: Requires great efforts to produce, high risk for bugs require long duration for QA, and overlooking a scenario is very costly which means a long design process. Product Quality: Very Good. Estimated Cost: 6 months.

Parallel Software Design: The algorithm is written sequentially. A good parallel design will allow best utilization of CPU Cores with minimal dependencies and inter-locking. Reasoning: Every CPU has multiple Cores. Making the most of the CPU cores can produce performance increase of several multiples. Pros: CPUs are expected to grow in the number of cores, which will allow performance improvements by simply buying new hardware. A good parallel design will also prepare the software for use on multiple machines for example as part of an HPC array. Cons: This process is sensitive for, and takes in account the usage of internal data structures. This means that every change in internal usage of data members and cache will result in redesign so this stage is the last stage of development, after the mathematics and all other features are fully closed. It is also possible that this will not be enough, and this solution has a resolution barrier per number of CPU Cores. Product Quality: Good, no as good as with logical optimizations. Estimated Cost: 2 weeks for every fix to the algorithm.

Distribution: It is possible to take the implementation of the algorithm as is and distribute it with minimal parallel optimizations. Reasoning: The idea is that we use enough computers to get enough processing power even if the algorithm is not optimized at all. Higher resolution required? Just add more machines. Pros: Very easy to increase processing power. Cons: Requires the software to support distribution. Also it is possible that there is a barrier due to the massive amounts of data exchanged between machines. Latency is not guaranteed. Product Quality: Poor. Great increase in storage, power consumption, heat generated, system costs, and LAN infrastructure. Estimated Cost: 2 months of R&D and dramatic increase in product costs.

At this point we went back to the drawing board. We tried to think of a way to enter the market with what we have or with minimal R&D efforts until we have a more concrete opportunity. We moved our efforts to the business arena, waiting for a good reason to invest in R&D. As you can see above this is not a small effort. But then...

Then I came across the Visual Studio 2010 CPU Acceleration support. The following steps are amazingly simple and fast to implement. Let's start by adding this option to the list:

Visual Studio 2010 Built-in CPU Acceleration: This option means that we try to get the best performance possible without having to go into the implementation details. Reasoning: Basically you can take my implementation of the algorithm and with minimal code modifications and project settings try to produce the most possible. Pros: Fast, simple to implement, out-of-the-box, very little chance for bugs and bugs are relatively simple to detect. Cons:This is not a real optimization to the code. The code will still perform redundant operations and the parallel optimization is far from optimal. Product Quality: As good as the latest implementation that we currently have. Estimated Cost: Faster than writing this blog post. (less than a day).

To simplify the process I will now create a new project and write down the steps I am taking one by one. I will not use the real algorithm code. Instead I will take a simple implementation of an operation. A sample video frame of 1080 x 720 pixels is used as input. For every pixel we will calculate the square root, multiply the result by itself, and calculate the square root again, so pixel P = sqr( P ); P = P * P; P = sqr( P );.

The code generates a call to a CRT function that does the sqrt calculation. Actually there are several internal calls until the code returns. Total processing time for this is 57.361 sec (57 seconds and 361 milliseconds). Here is the assembly generated. The assembly code is disorganised because this is compiled as Release and not Debug so that compiler optimizations are in effect.

All four CPU Cores are continuously busy and we get a time of 32.276 sec. If we optimized the code (as mentioned in the section above) we should be closer to 1/3 of the time or even 1/4 of the original processing time.

OpenMP is simple and most of you have already heard of it before. Now here is something even simpler that would probably surprise you.

Second Optimization SSE2

Instead of adding SSE support to the OpenMP version I will revert to the original code (before OpenMP) and add SSE support. SSE stands for Streaming SIMD Extensions, where SIMD is Single Instruction Multiple Data. There are several versions of these extensions, the latest today called AVX. The idea is that most CPU (assembly) operations operate on a single data variable either float, byte, word, double word, etc. SIMD operations started with MMX and allow for example incrementing the value by one for eight variables with a single CPU Instruction. This works by using a special CPU Register which can store more than a single variable. Here is the procedure:

This will activate the SIMD feature in the compiler. Rebuilding the application will produce a new code. Now CRT functions are replaced with single CPU Instructions. We are back to the original code. Now the assembly code looks like this:

The part of code that remained the same belongs to the for loop. We did zero modifications to the code. Now let's see how the new code behaves. The original version was 57.361 sec. Running the same application with SSE2 turned on resulted in the amazing time of 15.320!

Microsoft Visual Studio 2010 can also automatically generate AVX code. For this you will need to cancel SSE2 support by setting to "Not Set" and go to Project Properties / Configuration Properties / Command Line and manually add in "Additional Options" to "/arch:AVX".

But wait... there is even more! The compiler does not support all commands automatically. What we just did above was to ask the compiler to use the CPU acceleration for square root computation. The CPU can also perform this operation on several floating point variable at the same time, using the same CPU Core.

SSE Packed Operations

We add to our code the support for CPU Intrinsic functions by adding the header #include <emmintrin.h>.

Now we modify the code so that the buffer of pixels which is an array of floating point values is used as an array of Packed Floating Point values. Packed means that there are several different and unrelated values packed into a single variable. Here is the modified code:

The type __m128 is recognized by the compiler as packed type of four float values. Now the CPU will operate on four pixels per instruction instead of one so we advance the for loop variable xby 4 instead of 1. You can learn more about these Intrinsics functions by pressing F1 in Visual Studio 2010 or by searching MSDN Library. There is a full listing of supported operations.

When we execute the code now we get the amazing result of 3.729 sec!

Adding OpenMP support reduces this even more and down to 2.246 sec. Such a reduction means that only two CPU Cores are enough so we can tell OpenMP to use only two cores and we still get the same results, clearing up two cores for other applications. (We do this by using #include <omp.h> and calling omp_set_num_threads(2); before our code).

Conclusion

Instead of doing massive code redesign and optimization, which requires good understanding of the logic behind the code and is very bug prone, we can change a few compiler switches and modify C Runtime (CRT) Library functions without even understanding what the code does and get a performance increase which reduced in the example above the execution time from 57.361 seconds to 2.246 seconds. This can be achieved very fast by a programmer who has very little idea of the code implementation.

Fast and simple code changes resulted in 25 time increase in performance and two CPU Cores free for work. This means the equivalent of a cluster of 50 machines!

You were looking for parallel programming to help your application perform? Don't underestimate AVX!

5 comments

You need to specify the type of CPU acceleration that you are targeting. If the hardware support if not available you will receive an Exception as a result of CPU Interrupt saying that the application tried to execute an invalid instruction.
Some applications should assume the lowest common hardware (Pentium + SSE2), other applications can assume new systems for example client application cannot assume AVX and GPU acceleration but you may be able to guarantee the minimal requirements.
DirectShow is an old technology which (unfortunately) is going obsolete and is replaced with Media Foundations. If you have multi-stage algorithms then you might be better copying the input buffer to an intermediate buffer which is aligned correctly, possibly reordering pixel data to get the best performance (for example instead of RGB-RGB- use RRGGBB--).
The alignment note is very valid. Need to verify input buffers from third party code and need to set project settings to align to 16 bytes (if using 16 byte operations). #pragma for C++ is also a possibility.

It's irresponsible to suggest these optimizations without noting that:
A) Executing /arch:SSE2 or /arch:AVX compiled code without the required CPU instructions being available results in a crash.
B) Using 128-bit vector intrinsics requires memory aligned to 16 byte boundaries. As it is, the only reason this code runs is that the current implementation of operator new[] switches to an alternate allocation strategy for large blocks that happens to give the needed alignment. In a DirectShow filter you would need to specifically request this alignment from the allocator.

While these changes might be quick, additional time will be required when QA reports that the code now crashes on certain CPUs and with certain combinations of DirectShow filters.