Stay connected, up-to-date, and informed on all things parallel development via Go Parallel, where you'll find viewpoints, how-to's, software tools, and educational information to help your software development work shine. http://goparallel.sourceforge.net/

The compiler can perform autovectorization, where it will decide that some loops should be vectorized while others shouldn’t. But should we trust it? In this blog, Jeff Cogswell looks at how to attempt to coerce the compiler into vectorization.

In December, we were exploring how vectorization works. When we left off, we had looked how to get our loops to vectorize automatically. The C++ compiler decided, to the best of its knowledge, which loops could be vectorized. This time, let’s look at ways to go beyond what the compiler wants to do and force vectorization as much as we can. And then next time we’ll start looking at the generated assembly code to see exactly what we ended up with.

First, start up Visual Studio, and create a new C++ console project. Once the project is created, switch to the Intel C++ compiler by clicking the Project menu -> Intel Composer 2013 XE SP1 -> Use Intel C++.

Next, replace the existing main .cpp code with this:

#include "stdafx.h"
#include

using namespace std;

int main() {

float *a = new float[128000];
float *b = new float[128000];

for (int i=0; i<128000; i++) {
a[i] = i % 128;
b[i] = a[i] * 2;
}

}

Next, we want to have the compiler tell us when a loop is vectorized. I described the procedure in this blog. Follow the instructions on how to set the diagnostic level. Use the level called "Loops successfully and unsuccessfully vectorized." (Note that this is a different level than we used in the blog describing the steps. In that one we used the level "Loops successfully vectorized."

Then let's see what we get. Make sure your project is set to Release, not Debug. Go ahead and compile the code. Here's what I see in the compiler output:

That's fine. Let's see if we can force vectorization to happen anyway. There's a pragma setting called "simd" that turns on vectorization. (Remember, SIMD stands for Single Instruction, Multiple Data).

That's all it took. Now the loop is vectorized. But why didn't the compiler want to vectorize for us? The message said it "seems inefficient" to vectorize. That may well be the case. Vectorization comes with some overhead to set it up. And the question is whether the amount of time saved in using vectorization is worth the time it takes to set it up.

Let's investigate a bit further to try to figure out why the compiler didn't like the loop we gave it. Notice I'm using the modulo operator, represented by the percent sign. Let's replace that with a division. Change your loop to this:

for (int i=0; i<128000; i++) {
a[i] = i / 2;
b[i] = a[i] * 2;

And remove the pragma line. Compile again. This time the loop gets vectorized automatically without us having to manually force it. It's not that it was impossible previously; it's just that the compiler thought it wouldn't be efficient. Yet this time, the compiler seems to think differently.

Now try removing the second line that's inside the loop, so your loop looks like this:

for (int i=0; i<128000; i++) {
a[i] = i / 2;
}

This time there's no message at all. Not only was the loop not vectorized, but the compiler didn't even consider the loop for vectorization. Add the pragma back in before the loop, and here's the message you'll see:

ConsoleApplication1.cpp(16): warning #13379: loop was not vectorized with "simd"

The compiler flat-out refuses to vectorize it. So, what is going on? It all comes down to optimization. The compiler is optimizing the code for you. Vectorization falls under optimization, and there is other optimizing taking place. The compiler simply knows that some code won't function any better vectorized.

Conclusion

But does the compiler really know best? Surely, the compiler knows what it's doing inside and can, therefore, make decisions for us. Let's explore some further optimizations in the next blog, and move on to exploring the assembly code. And as we do our exploration, we'll see what changes we can make to our code to get the compiler to vectorize it without us having to force vectorization through a pragma.

Use this MPI library for better application performance on Intel® architecture-based clusters by implementing the high-performance MPI-2 specification on multiple fabrics. Quickly deliver maximum end-user performance even with new interconnects—without requiring major changes to the software or operating environment.