Auto-Parallelization and Auto-Vectorization

The /Qpar compiler switch enables automatic parallelization of loops in your code. When you specify this flag without changing your existing code, the compiler evaluates the code to find loops that might benefit from parallelization. Because it might find loops that don't do much work and therefore won't benefit from parallelization, and because every unnecessary parallelization can engender the spawning of a thread pool, extra synchronization, or other processing that would tend to slow performance instead of improving it, the compiler is conservative in selecting the loops that it parallelizes. For example, consider the following example in which the upper bound of the loop is not known at compile time:

Because u could be a small value, the compiler won’t automatically parallelize this loop. However, you might still want it parallelized because you know that u will always be large. To enable the auto-parallelization, specify #pragma loop(hint_parallel(n)), where n is the number of threads to parallelize across. In the following example, the compiler will attempt to parallelize the loop across 8 threads.

The function upper_bound() might change every time it's called. Because the upper bound cannot be known, the compiler can emit a diagnostic message that explains why it can’t parallelize this loop. The following example demonstrates a loop that can be parallelized, a loop that cannot be parallelized, the compiler syntax to use at the command prompt, and the compiler output for each command line option:

Notice the difference in output between the two different /Qpar-report (Auto-Parallelizer Reporting Level) options. /Qpar-report:1 outputs parallelizer messages only for loops that are successfully parallelized. /Qpar-report:2 outputs parallelizer messages for both successful and unsuccessful loop parallelizations.

The Auto-Vectorizer analyzes loops in your code, and uses the vector registers and instructions on the target computer to execute them, if it can. This can improve the performance of your code. The compiler targets the SSE2, AVX, and AVX2 instructions in Intel or AMD processors, or the NEON instructions on ARM processors, according to the /arch switch.

The Auto-Vectorizer may generate different instructions than specified by the /arch switch. These instructions are guarded by a runtime check to make sure that code still runs correctly. For example, when you compile /arch:SSE2, SSE4.2 instructions may be emitted. A runtime check verifies that SSE4.2 is available on the target processor and jumps to a non-SSE4.2 version of the loop if the processor does not support those instructions.

By default, the Auto-Vectorizer is enabled. If you want to compare the performance of your code under vectorization, you can use #pragma loop(no_vector) to disable vectorization of any given loop.

As with all pragma directives, the alternate pragma syntax __pragma(loop(no_vector)) is also supported.

As with the Auto-Parallelizer, you can specify the /Qvec-report (Auto-Vectorizer Reporting Level) command-line option to report either successfully vectorized loops only—/Qvec-report:1—or both successfully and unsuccessfully vectorized loops—/Qvec-report:2).