File size

File size

File size

File size

File size

121.8 MB

The C++ compiler in Visual Studio 11 includes a new feature, called auto-vectorization. It analyses the loops in C++ code and tries to make them run faster by using the vector registers, and instruction, inside the processor. This short talk explains what's going on.

Jim, whatever happened to Phoenix? The last time it was spoken of (by yourself) it seemed it was on the verge of being used in the x86 backend for VC but not x64. That was years ago. I used the framework two or three years back and even then it seemed 'dead', although very powerful. LLVM has gone on to show the importance of a like-framework.

Does the array size need to be known at compile time? - no. but indexing into that array is limited to forms like [i * K + j] where K must be a compile-time constant. If you are flattening a 2-D array onto one dimension, then K appears as the row-length.

Do you need to use the index syntax (array[i]) or can you use pointers? What about iterators? Pointers work.

Can the compiler vectorize operations on a std::vector<> ? Sometimes.

Are there any operations that will prevent the compiler from vectorizing? ex. branching, trigonometry functions etc. Conditionals, yes. Trig functions, no.

If the compiler detects a cross-iteration dependency on one of the many operations in a loop, will it split the work in one vectorized loop and one scalar loop? For certain patterns, yes. But I'd shy away from saying we've conquered this in first release.

As Charles mentioned, please checkout the blog - it answers most of these topics in more detail. (Just published episode 6 earlier today)

@Charles:Great, that blog was exactly what I was looking for! Now if only an almighty moderator could add a link to this blog as a "see also" in the video description so that other curious minds that have had their interest piqued can dig deeper, that would be wonderful

I have a question about AVX (YMM registers) support: does the new autovectorizer also support wider 256-bit YMM registers (Sandy Bridge+)?

VC 10 also had support for SSE2 and AVX (via a command line) instructions. From the talk it seemed as if there wasn't any support for SIMD instructions.

Also the floating point addition was a bit misleading, since *AX and *BX registers are integer registers, while ADDPS means "Add Packed Single-Precision FP Values". You should have used an x87 FPU or SSE2 registers (as 32-bit float) as your nonvectorized example.

1) The first example showed the values 1.10 and 1.20 being loaded into RAX and RBX. These are integer registers, and 1.10 and 1.20 are not integers. Oops.2) The optimization shown only works if there is guaranteed to be no aliasing between 'a' and 'b'. If 'a' and 'b' are pointers then the optimization cannot be done. This seems like an extremely important point and it makes me sad that it is glossed over.3) Jim goes to great lengths to explain that this optimization will not change the results at all. Given that his example is a moderately complex example using floating-point math (sin and cos) it is extremely unlikely that his guarantee is correct. If nothing else, the change from x87 to SSE *will* change the results, even if the vectorization itself does not.4) He suggested that the only reason he got a 2.9x speedup instead of a 4x speedup was because of compiler limitations. That suggests that developers should always expect a 4x speedup but that is not practical for all sorts of reasons such as processor detection overhead, memory bandwidth, etc.

1) Yes, the real story would compute these floats in the low 32-bits of the XMM registers. See the How it Works blog post for an explanation that addresses this 'untruth'.

2) Yes, aliasing complicates the story. However, pointers don't always prevent vectorization. In general, the auto-vectorizer will include runtime checks against aliasing. Where possible, via whole-program-optimization, it can sometimes prove the lack of aliasing, and therefore elide the runtime checks. (I mentioned aliasing a couple of times during the blog; next episode covers it in a little more depth)

3) In VS11, default floating-point calculations will use SSE instructions. Auto-vectorization will produce the same results as the scalar, SSE instructions. (Yes, results might differ between 32-bit SSE and 80-bit x87. I should have been explicit - I was comparing scalar SSE versus vector SSE versions of the program). In other cases, such as reductions, auto-vectorization CAN produce results different from scalar SSE code (due to non-associativity). But we only perform auto-vectorization for such cases under the /fp:fast flag)

4) I was speaking about this particular example - future compiler improvements should raise the speedup above 2.9X, heading towards 4X. For other loops, there are, of course, many factors that limit the speedup. (The topic of a future blog post, already drafted but not published).

I'd encourage folks to read the auto-vectorization blog - it includes about 6 posts now, allowing more time to dig into details than the brief 15-minutes available in this video.

No, I'm afraid the auto-vectorizer doesn't use AVX in this first release. (But high up on our TODO list!)

In VS2010, the /arch:SSE and /arch:SSE2 switches, on x86, tell the compiler to use XXM registers for floating-point calculations, rather than x87 registers. Similarly, the /arch:AVX switch tells the compiler to emit AVX instructions, rather than SSE instructions. But in all those cases, it emits just SCALAR instructions. It is only with the advent of the auto-vectorizer that the compiler emits SIMD instructions that make full use of the wide XMM vector registers.

*AX, *BX. Yes, sorry. No excuse! (As I replied to Bruce, above, I was more careful with this example in the blog)

Remove this comment

Remove this thread

Comments Closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums, or
Contact Us and let us know.