Introducing a new, advanced Visual C++ code optimizer

We are excited to announce the preview release of a new, advanced code optimizer for the Visual C++ compiler backend. It provides many improvements for both code size and performance, bringing the optimizer to a new standard of quality expected from a modern native compiler.

This is the first public release and we are encouraging people to try it and provide suggestions and feedback about potential bugs. The official release of the new optimizer is expected to be Visual Studio Update 3, while the release available today is unsupported and mostly for testing purposes.

How to try it out

The compiler bits with the new optimizer are very easy to get: just install the latest VisualCppTools package using NuGet. Details about how to do this are available in this blog post. Once installed, compile your applications the usual way – the optimizer is enabled by default on all architectures.

Reporting bugs and suggestions

We are hoping to get as much feedback as possible about bugs you have found or suggestions you may have. If you believe you found a bug, you can confirm it’s caused by the new optimizer by using the following undocumented flag to disable it: -d2SSAOptimizer-

Attached details such as the compiler version, compile flags, and the source code that reproduces the bug in the form of pre-processed files or a linkrepro. Bruce Dawson’s blog has a great post about producing high-quality bug reports

Why a new optimizer?

The main motivation for a new optimizer framework was the desire to have more aggressive optimizations, such as ones that take advantage of more compile-time information and modern compiler developments. The design of some of the older optimization passes made it difficult to implement more advanced transformations and to make improvements at a faster pace. As the new framework was intended to be the basis of many future optimization efforts, a core design objective was to make it easier to implement, test and measure new optimizations.

Some of the main goals of the project:

Improving the code quality for both scalar and vector code

There are many cases where both performance and code size can be improved, sometimes quite substantially. The framework attempts to solve several deficiencies of the old optimizer:

The old expression optimizer has a small set of known transformations and a limited view of the function – this prevents discovering all the expressions that could be optimized.

Many small optimizations based on identifying patterns – known as peephole optimizations – are either missing or implemented only for certain target architectures.

Vector code – either from intrinsics or generated by the auto-vectorizer – can be optimized better.

The new optimizer takes advantage of the Static Single Assignment form, which allows handling more complex expressions, that potentially span the entire function. Another advantage of the SSA form is that it makes it possible to write simpler and more efficient algorithms, eliminating the need of using more complicated and slower techniques such as data-flow analysis.

Peephole optimizations can now be implemented in a target-independent way, using a pattern matching system that is very fast (based on template meta-programming) and which requires little code to be written. This allowed adding a large number of patterns in a fraction of the time it takes to add using the usual way of identifying patterns.

The same pattern matching mechanism can be used for vector operations, making it now possible to optimize expressions using both integer and float vector operations as easily as expressions with scalar operations. Note that this feature is not yet complete and enabled.

Designing a framework that allows easy development, with less potential for mistakes

Being able to quickly prototype ideas and move to a reliable implementation is one of the main advantages of the new framework. It includes various helpers for easier manipulation of the SSA form, pattern matching of expressions, building new expressions and doing safety checks in the presence of pointer aliasing and exception handling.

Performing better static analysis of the code

The new optimizer also adds new static analysis modules, including those that can identify when a value is Boolean (exactly either 0 or 1), when a value is always positive, and when a value cannot be zero. It also has a powerful module that can estimate known one/zero bits of a value, and the ranges a value could fall in. The results are either used as preconditions for certain optimizations, to eliminate some useless operations completely or to transform operations into a form that can be optimized better.

Strong emphasis on testing and correctness

Given the large scope of the project, ensuring and maintaining correctness was a top priority. This was achieved by using formal verification, testing with randomly-generated programs (fuzz testing) and popular programs and libraries such as Chrome, Firefox, CoreCLR and Chakra. See the Testing approach section below for more details.

Examples of implemented optimizations

The following is an example that illustrates just a few of the many new transformations the new optimizer implements. This sort of code is often found in codecs:

The execution time with the old optimizer is approximately 5 cycles in the best case (this assumes out-of-order execution and perfect branch prediction) and at least 10 cycles in the worst case. With the new optimizer, execution time is always 2 cycles. Obviously, there are also important savings in code size.

Very interesting results can be achieved by combining multiple smaller transformations. In this case, there are two patterns applied to produce the final result:

a % 2 == 0 -> a & 1 == 0Since the remainder is esed to zero, the sign of a does not affect the compare result and the remainder can be replaced by AND.

a<bool> ? C1 : C2 -> C2 + a*(C1-C2)A ternary question operation selecting between two constants. The first requirement is that the condition value is Boolean, which the static analysis package can determine. The second is that C1-C2 is a power of two, so that a shift or LEA is generated instead of a multiplication.

Let’s see a few more examples of interesting optimizations and patterns that are implemented. Focus was put especially on operations that were previously not optimized very well, such as comparisons, conversions, divisions, question and control-flow dependent expressions (PHI operations in SSA form). Although some examples might seem unlikely to be written like that in the source code, they do appear quite often after inlining and other transformations.

The SSA form exposes larger expressions, which can span the entire function – this allows discovering more optimization opportunities, especially when combined with expression reassociation. There are also dozens of new patterns added, such as the following ones:

Most new float optimizations are enabled only under -fp:fast, but some of them are valid under the default -fp:precise.
More information about the optimizations allowed under different floating point models is available in the documentation:Microsoft Visual C++ Floating-Point Optimization

Optimizing control-flow dependent expressions

I mentioned above that the SSA format simplifies handling larger, more complex expressions. One advantage is that it makes it easier to reason about variables that are either redefined, or defined with different values based on the path taken in the function. As its name implies, SSA solves this by creating a different version of the variable each time it is redefined; if there are points in the function where a variable has more than one possible value, a pseudo-operation known as PHI is inserted, merging all values.

Although building the SSA format is quite complicated, the example below should be simple enough to get a good intuition about SSA and the role of the PHI operations:

As it can be seen on the right side, each variable is renamed to multiple versions (indicated by the number suffix). After the if-then-else statement, all three variables can have two different values, depending on the runtime result of a > 3, making it necessary to insert PHI operations.

The new optimizer is able to take advantage of the PHI operations and turn the entire function into the equivalent of return 1, all other code being removed by Dead Code Elimination. That’s 1 instruction compared to the 18 that were generated before on x64.
For p1 >= 16 it computes every possible value and compares it with 16, which is the minimum possible value. For q1 == 0 it checks if the low bits are known to be zero in both z1 and z2.

The old expression optimizer is not able to reason about the larger expressions that involve these PHI operations – this causes it to miss many optimization opportunities, like the ones exemplified above. In the new optimizer, every operation and static analysis supports PHI. A few more examples:

The following is an interesting case found in Mozilla Firefox. A Boolean expression, spanning an if-then-else statement, is used in a negated form if(!expr). The new algorithm that tries to cancel an inverted Boolean operation by inverting every subexpression did the following transformation, eliminating the inversion:

(phi 0, (x ? 1 : 0)) ^ 1 -> phi 1, (x ? 0 : 1)

Better conditional move generation

Converting branches to CMOV produces more compact code that usually executes faster. The late CMOV generation phase is augmented by generating question operations during the new optimizer. In doing so, already-existing transformations can be applied, simplifying things even further. In the following examples, the left-hand side is a newly-detected CMOV pattern, and the right-hand side is the code after a transformation is applied:

CMOV performance can sometimes be hard to estimate, especially on modern CPUs with good branch prediction. To help in cases where a branch would be faster, when profile information is available, the CMOV is not generated if the branch is highly predictable (heavily biased as either taken or not-taken).

Improved optimization of compare operations

Comparisons are the operations with the most improvements. Since reducing the number of branches benefits both code size and performance, the focus was mainly on branch folding (eliminating a branch by proving that it is either taken or not-taken). Besides the usual tests for comparing constants, static analysis is used to estimate value ranges and known one/zero bits, making it possible to handle more complicated cases. Among the dozens of transformations that simplify comparisons, the following one is an example that reduces execution time substantially:

Folding comparisons and branches: comparisons are folded using both known bit and value range information. For example, given a == b, if a is known to have a bit set at a position where it is definitely not set in b, the two values cannot be equal. This can be applied to other conditions such as less-than by checking the sign bit. When using value ranges, every range of a is compared with every range of b.

Improved overflow checks: optimizing a + C1 < C2 into a < C2 – C1 is not valid, since a + C1 might overflow, giving a different result. Using the known bits or value ranges, it can be proven that the addition does not overflow. In practice, this usually happens when a is a zero-extension from a smaller type.

Discovering Boolean and positive values: used as pre-conditions for various optimizations, such as the ones applied on question operations. Another example is eliminating an ABS intrinsic if the value is already positive.

Removing redundant AND/OR instructions, eliding useless conversions:

a % C -> 0 if C is a power of two and the low bits in a are zero (a is a multiple of C)
a & C -> 0 if all bits that are one in C are known to be zero in a
a | C -> a if all bits that are one in C are known to be one in a

Improved Common Subexpression Elimination

Common Subexpression Elimination is an optimization that eliminates redundant operations by replacing them with the result of previous ones that compute the same value – this happens much more often than one may expect. The existing algorithm is augmented with one based on Global Value Numbering, which increases the number of expressions that are found to be equivalent. Although this is a quite simple initial implementation that will be made more powerful, it shows significant improvements for both code size and performance.

Eliminating redundant operations before doing the expression optimization also exposes more opportunities.
For example, (a + b) – c -> a if b is found to be equivalent to c.

Taking advantage of signed integer overflow being undefined

Historically, Visual C++ did not take advantage of the fact that the C and C++ standards consider the result of overflowing signed operations undefined. Other compilers are very aggressive in this regard, which motivated the decision to implement some patterns which take advantage of undefined integer overflow behavior. We implemented the ones we thought were safe and didn’t impose any unnecessary security risks in generated code.

A new undocumented compiler flag has been added to disable these optimizations, in case an application that is not standard-conformant fails: –d2UndefIntOverflow–. Due to security concerns, we have seen cases where these patterns should not be optimized, even though following the C and C++ standards allows us to by making the potential addition overflow undefined:

These two tests (and the similar ones with subtraction) are frequently used to check for overflow in places such as file readers and memory allocators. While the use is non-conformant with the standard and a well-known issue, enabling these transformations could potentially break the security of those applications.

Impact on code size

For most applications code size is reduced, but it can also increase due to interactions with other optimizations. For example, a smaller function is more likely to be inlined into multiple places, resulting in a size increase overall.

Below are some code size results from compiling several large applications on x64:

Application

Old optimizer

New optimizer

Reduction

Windows

1,112,545,269

1,112,096,059

438 KB

SQL Server

64,078,336

64,032,256

46 KB

Chakra

5,963,621

5,952,997

10 KB

The following table lists the number of instructions, split by category, for the Windows Kernel built for x64 with link-time code generation and profile information. It can be seen that the number of more expensive instructions, such as branches, divisions and multiplications, is reduced. The increase in CMOV and SETcc is a result of more branches being converted to conditional code.

Instruction type

Old optimizer

New optimizer

Difference

CONVERSION

28075

27301

-774

LEA

87658

87395

–263

SHIFT

15266

15194

-72

SETcc

2222

2345

+123

JUMP

19797

19791

-6

BRANCH

143795

142591

-1204

MUL

2115

1990

-125

DIV

541

530

-11

CMOV

4192

5913

+1721

Impact on compiler throughput

For all these improvements, compile time remains mostly the same, with about +/- 2% difference, depending on the application being compiled. For example, Google Chrome shows a compile time slowdown of 1.7%, while compiling the Windows Kernel shows a 2.6% speed-up. The speed-up can be explained by having less code go through the old, slower optimization passes.

Testing approach

Based on previous experience and the scope of the project, it was clear from the start that extensive testing needs to take a central role to ensure correctness. Several testing approaches were used, some to prevent mistakes in the first place, others to catch implementation problems:

Preventing implementation bugs by formally verifying the patterns

Most patterns are quite simple, such as x & 0 => 0. But there are also patterns that requires validation that is not always very obvious, leaving place for mistakes. The most common validation bugs are:

Failing to check for input preconditions, such as requiring positive numbers, powers of two, numbers with the N top bits 0, etc

Failing to differentiate between signed and unsigned operations. This is especially dangerous for instructions such as CMP, DIV/REM and SHR.

Alive, a tool by Nuno Lopes from Microsoft Research, is a formal verification tool that was used to ensure the patterns and preconditions are correct before implementing them. It uses a language similar to LLVM IR and the Z3 theorem prover to verify if an input pattern is equivalent to the output pattern – if not, it prints a counterexample. Alive has already been used by the LLVM community with great success to discover many bugs. More details about Alive can be found on John Regehr’s blog: ALIVe: Automatic LLVM InstCombine Verifier.

Covering and testing as many patterns as possible using random tests

Csmith is a random C program generator that has been used to discover a large number of bugs in various compilers. More than 15 million programs generated using CSmith have been tested, revealing several bugs in the new optimizer, plus bugs in other optimizer components. Very helpful in dealing with the huge failing tests was C-Reduce: is was able to reduce 200KB tests to tests of 2-3KB in size, making it much easier to spot the place with the bug.

Testing every three-instruction expression

Opt-fuzz, a tool by John Regehr from University of Utah, is able to generate every small integer expression with N instructions and a limited number of possible constants as LLVM IR. The Clang/C2 project made it possible to test all 250+ million tests generated for three-instruction expressions, which revealed several subtle bugs.

Using instrumentation and runtime checks

Complex components, such as the Bit Estimator and Value Numbering, were tested by instrumenting the compiled code with calls to a runtime library that verifies if the compile-time, static analysis results are actually valid. For example, in the case of the Bit Estimator, it would verify that the bits that were estimated to be always zero are zero at runtime. In the case of Value Numbering, it would ensure that two instructions that were assigned the same value number have the same value at runtime.

Testing with popular open-source projects

Exposing the compiler to more real-world code proved to be an effective way to find more bugs. This includes building and testing Google Chrome, Mozilla Firefox, CoreCLR and Chakra.

Future improvements

As I mentioned at the start of the blog post, the framework is designed to be the place where many of the future optimizer features will be implemented. Below are some of the optimizations that are very likely to be part of the next major Visual Studio release – it does not include any of the longer-term projects that are planned:

Complete and enable the optimization of vector operations

Better optimization of Boolean expressions in C++ code

Removal of operation with no effect on the expression result

Merging similar branches

Several Bit Estimator improvements

Closing remarks

Please try building and testing your applications with the new optimizer and report any problems that you might find. We are looking forward for your suggestions and opinions in the comment section. Let us know if you have examples of cases that could be optimized better and are not yet handled.

We are glad to finally be able to share this exciting new work with you! This marks the start of many optimizer improvements that will be added in the future releases of the compiler – we will keep you posted.

1. _assume is not used by the bit estimator, but a planned improvement is to use the alignment info from variables to prove the bottom N bits are zero. Alignment info is also propagated between functions with LTCG, so if all callers use a pointer argument with the same alignment, the alignment is known for the parameter.
2. The vectorizer uses runtime checks – one example is the for alignment, mentioned above. The time spent in the checks is tiny compared to the actual vector loops, not sure if it’s worth adding a non-standard extension in this case.

I also think that taking conditions inside __assume would help some people.
Therefore I have related questions:

1. Would bit/range estimators work in the code like this:
if (x & 0xFFFF != 0)
__assume(false);
It seems that __assume(false) should always work, and supposedly the new optimizer can derive information about bits from the fact that the branch is marked as dead?

2. If the code above works well, then it seems a bit strange that __assume(x & 0xFFFF == 0) does not work.

BTW, this is similar to the way __builtin_unreachable() is used in GCC.

Performance depends on the application – it does improve several benchmarks by a double-digit percentage.
We did not test if there is a throughput improvement of the compiler itself, it’s an interesting idea. I think a significant improvement is unlikely though, the slow parts, in the backend at least, are more of an algorithmic nature. We do plan to replace some of the old and slower optimizations with new ones implemented in this framework.

That would be another way to write it.
In my example, doing an arithmetic right shift on the sign bit produces a bitmask that is either all-ones or all-zeros. An AND with the bitmask then produces the expected result.

I am surprised that after all the heavy lifting, the measured differences between the old and the new optimizer seem quite marginal. But maybe only the comparison methodology is at fault: a saved BRANCH/DIV/MUL inside an inner loop is great; in a process startup routine (where it would be executed just once), it is immaterial.

Of course, optimizing away some instructions in the right place can give significant improvements in certain benchmarks.
If you refer to the code size wins: they are a bit hard to measure because inlining behavior changes substantially – more inlining is done (especially in C++ code with STL), the code tends to increase if a function is inlined into multiple places. It still manages to save some code size, even though this wasn’t a priority.
The instruction type statistics are also influenced by inlining. Another thing to consider is that I showed results for the kernel, which is likely more carefully written than the average code.

I posted this on HN[1] as well but would you consider grouping “-d2SSAOptimizerUndefinedIntOverflow-” (that is disabling the optimizations) when the /sdl flag is passed to the compiler? I think this would make sense since if you’re using /sdl you would really rather it just generate safer, more secure code, and furthermore if you’re using /sdl you probably don’t care about the likely trivial performance gain from it being enabled. If other possible unsafe optimizations (e.g. overflow, aliasing, etc) are being added then you may want to consider grouping them under /sdl as well.

I realize that it’s been designed to try and only use this optimization in safe areas but really regardless of the guarantee here I’m not sure that it can always be trusted to do the right thing. I also wondered about making this option more than a hidden one but I’ll accept simply grouping it under /sdl.

Perhaps in a world where I can write all the code for every dependency that my already obviously perfect code can use then maybe that would work.

Meanwhile in the real world where people are humans, make mistakes and compile code they didn’t write themselves I’m not sure how I could possibly ensure that your assumption is always true. I would rather the compiler not optimize out a check that it may deem unnecessary for a near trivial performance gain.

I don’t have an issue with the optimization, it’s when I’m explicitly compiling for security where I would rather it not be enabled.

Not sure about number of levels, but it does change the inlining behavior quite a lot: by becoming smaller, functions are more likely to get inlined. C++ code using STL seems to benefit the most from it.

While testing for code generation bugs introduced by new optimizer, I have noticed that compiler does not respect the /Zc:throwingNew option anymore. First I thought it was due to the new optimizer, but after few more tests, it seems like the version in Update 2 already has this behavior.

In short, compiler inserts a check for null pointer after each call to operator new and if it is zero, calls into __imp__invalid_parameter_noinfo_noreturn function.

Hey, I just tried a small example and as far as I can tell throwingNew is still being honored. I don’t repro the problem you describe. If you could attach a fully self contained example to the bug you linked that would be helpful.

Here is my example:

#include

struct Y {
__declspec(noinline) Y() { printf(“Y”); }
};

Y* foo() {
return new Y();
}

The null check in foo() is omitted correctly when compiling with /Zc:throwingNew /EHsc /c, and is present without /Zc:throwingNew.

A check for a value returned by operator new is inserted in the following code:

// #define _CRT_SECURE_INVALID_PARAMETER(expr)
#include

void main()
{
std::vector v(100);
}

Unless the first line is uncommented. So it actually looks like /Zc:throwingNew does not play nicely with secure CRT features. Or the optimizer is not smart enough to eliminate parameter checks after operator new.

how else (besides the NuGet) i can get my hand on the optimizer..
NuGet does not seem to work for me. I always get the following errors:
[Package source] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource’
[Package source 2] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource2’
[Package source 3] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource3’
[Package source 4] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource4’

Package source should be set to “http://vcppdogfooding.azurewebsites.net/nuget/”.
Another way to get the toolset using Nuget CLI is as below.
“nuget install VisualCppTools -source http://vcppdogfooding.azurewebsites.net/nuget/ -Prerelease”

The optimizer is already integrated and enabled by default, just that for now it is available as a NuGet package – this allows us to release externally compiler versions that are very close to the latest development version. It will be released officially as part of VS 2015 Update 3.

I have noticed a couple of sequences that MSVC’s lower level 64-bit codegen doesn’t seem to do.

The first is that a bitwise OR of a 16-bit value will a constant will never emit an or r16, imm16 instruction, instead it emits a mov r32, imm32; or r16, r16. (where the second r16 in the or is the lower half of the r32 in the mov)

The second is that doing a bitwise AND of a 64-bit value with 0xFFFFFFFFFFFF0000 to clear the lower 16-bits generates and r64, imm32, where imm32 is sign-extended to 64-bits. GCC 4.9 generates the much shorter sequence that achieves the same results: xor r16, r16.

The last that I’ve noticed so far is that MSVC seems to be over-aggressive when zero-extending a 32-bit value to 64-bits, and will emit the zero-extension even when the upper 32-bits of the value are going to be destroyed because it’s being shifted left 32-bits. In the particular case I noticed it, this meant it was generating a mov eax, eax; which would be correct, if the top 32-bits weren’t clobbered in the instruction right after it.
uint32_t b;
uint64_t a = ((uint64_t)b << 32) | b;

Do you have examples for the first and last cases? I don’t see a difference between VC++ and what GCC and LLVM generate.
I tested using these examples: https://godbolt.org/g/hX8hYb

The second case must be handled long after the optimizer runs, when the code is closer to the target architecture and the sub-registers are exposed. I’m going to let the team working on that part know about this issue.

I actually manage to trigger both the first and 3rd cases in the same function, which I use for case-insensitive hashing of identifiers (in the context of a compiler and a C-like language):
“`
size_t CaselessIdentifierHasher::doIdentifierHash(const char* s, size_t len) {
const char* cStr = s;

There's also the issue of the dec for i being done with a separate test i, i later when it could be done by moving the dec to where the test is, where it would actually get fused as a single micro-op, but I haven't been able to find a way to structure that loop in such a way that any compiler would generate that :(

1) I echo the request for benchmark performance results. If you going to claim better optimization then back it up with some numbers please.
2) It looks like this will do a better job of removing dead code. Any chance we can get some output somewhere that this happened? It’s all too common that I find a bug in some code through inspection, spend a significant amount of time trying to reproduce it and find that the I can’t because it’s dead code. Gosh…wish the compiler or linker had told me!

To answer your questions:
1) There are multiple benchmarks with double-digit percent performance improvements – some are part of SPEC2k and SPEC2k6, some are from other popular applications and benchmarks. The improvements you might get depend a lot on the application and where the optimizations kick in – if it’s in a hot loop, it’s very likely to see an improvement. If not, you’re not seeing much. On average, code size does tend to decrease. The performance improvement can also depend on how an application is compiled – we saw different numbers when comparing -O2, -O2 with LTCG and -O2 with LTCG and profile info for the same application.
2) Reporting code that gets deleted during optimization is much too noisy and of little use. For example, after inlining it’s quite common to delete large parts of the inlined function (think branch/switch on a parameter that was replaced by a constant).

And if the entire function is deleted everywhere in the program? Don’t you think that would be useful information to the developer? That’s one more function I don’t have to maintain, test, or worry about.

BTW: I know that information is available by the linker if /verbose is used. However, it’s very hard to dig through and only works on release builds.

Let me put it another way…how do you know an optimization did anything if you don’t test it? Do you just look at the machine code and decide it’s better? That seems flawed. If you do test it for performance changes before and after the optimization, where are those numbers? Maybe the problem is that you don’t have a micro-benchmark test frameworks. They can be difficult to produce, but you only have to do it once.

So I’m not buying the excuses. Either each optimization made it better, or it didn’t. If you don’t know…that’s a huge problem. If you do know, share it please.

First, nobody stated they did not test it. (Not in his reply, nor in the psot) Second, most of those optimizations are provable (epically if number of ops goes down). And rest can be shown through various estimates, because of known qualities. (Like multiplication is always faster then division, recommendations in Intel’s and AMD’s optimization guides)

Also looking at final assembler output can tell a lot about changes. (even can help spot some kinds of bugs)

Also your entire post is addressing problems not in evidence and seems to not address post at all. (including alleged excuses)

Guess I have more to say…. Do you know what vapor-ware is? It’s when someone says they have a piece of software that does something but they are unable or unwilling to demonstrate that it actually does what they say.

So, I applaud the change you are making and _hope_ that it results in faster code but until it is _demonstrated_ to do so…it’s just words (vapor).

I’m I AM addressing the post directly. It’s just words until you can show (with numbers) that your change makes a demonstrable difference. You’ve done that with the size of the executable but not with the speed.

2 years ago

Pentalon

@DavidLarson:
If you don’t trust them now when they say that they believe they are making improvements based on their private experiments, then why would you suddenly start trusting them when they post benchmark numbers about those improvements? Wouldn’t those numbers just be more unverified words from your perspective?

Also, it’s not even formally released yet. How are they supposed to demonstrate it? And what would constitute a valid demonstration to you? You clearly don’t trust them already. Do you need a live video to ensure they aren’t lying? How do you know the video isn’t spoofed?

If you want verification, and you don’t trust them, then you should do the verification yourself, or seek it from another party that you do trust. I don’t think it’s nice to harass them about their veracity when they’re just trying to let us know generally what’s going on, what’s coming down the line, and asking us for help in checking it out.

There are a couple of things to remember, especially with the whole “vapourware” comments.
First, the optimizer is available in the daily compiler builds, so it exists. This contradicts the possibility of it being vapourware.
Secondly, the benchmarks at this point would be worthless beyond showing that the optimizers do something. So until the optimizer code is complete and working as intended then the benchmark won’t be representative of the final code. Since they stated that they are targeting Update 3, then the version of the compiler that would be the most useful to test would be the Update 3 compiler.

Because when I go buy a car I avoid the sales people that say, “We don’t publish performance numbers, go find out for yourself!” or “Going from zero to 60 is an unrealistic scenario so we’re not going to doing any benchmarks”.

No offense but this NuGet thing is really pain and has really odd behavior.
Yes I read those “help” and blog links but none of it seems to work.
Should this work as described with Windows 8.1 or must it be Windows 10?

After messing with the thing for some time I figured out the easiest way was to open up a project in VS2015 and use the “Manage NuGet Packages..”. The result of this is installing it (all 1.23gb worth) into the root of my C++ project folder “\packages\VisualCppTools.14.0.24127-Pre”.
Is this how it should be installed?
Is there a way to make it the global compiler choice, at least temporary?

NuGet seems to be more for downloading and associating a library with a project, or all projects in a solution.
One way to make the compiler from the package the default – and also available from the command line – is to copy the package content to the Program Files directory. It’s quite an ugly solution, but it seems to work well:

– The default binaries are located in C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin. Make a backup of the entire folder.
– Copy all files from packages\VisualCppTools.14.0.24127-Pre\lib\native\bin over, overwriting everything.
– You should be able now to compile any project using the new compiler.
– When you are done, restore the files you backed up.

You really make it appear really easy together with your presentation however I find this matter to be actually something which I think I’d never understand. It sort of feels too complicated and extremely broad for me. I’m looking ahead for your next submit, I will try to get the grasp of it!

I have been running some tests with Update 3 and the performance of some of my x64 C/C++ benchmarks nearly doubled or tripled over the older compiler. Great work!!
However It seems the optimizer is very sensitive to certain compiler switches. For example, if I turn OFF buffer security check /GS- , it seems to disable the new optimizations. If I select /O3 over /O2, I actually see the performance benefit go away as well, surprisingly. The best performance is just to leave everything at the default for a new Release build…

Glad to hear about the improvements you’re seeing!
What exactly do you compare? Update 3 optimized vs. Update 2 (or earlier) optimized? Note that there is no -O3 option for VC++: -O2 is optimize for speed, -O1 is optimize for size. /GS- should not affect things very much, but it might change the code enough for the optimizer to miss some cases.

What are the chances of getting something like GCC’s __builtin_expect (aka likely/unlikely)? That’s still my most missed compiler/optimiser feature, and it’s a very easy way to tell the compiler which blocks are less likely to execute (and thus better hidden behind a branch than the other).

Prior arguments at not implementing it seem to be that “people would get this wrong, use POGO instead”, but POGO seems a lot more cumbersome and error prone to me — it’s typically very obvious to a human which code paths are less likely (especially error-handling paths — and where not obvious the hints can be omitted), and it seems much more sensible to bake this knowledge into the code itself as it’s written rather than relying on a profile run to replicate it every time.

The argument for not implementing this feature is that it is non-standard. MSVC is pushing to implement standards features, not extend the language in ways that are incompatible with other compilers. (We’ve done too much of that in our past.)