How we test the compiler performance

The C++ back-end team is very conscious of the performance of our product. Today I will present to you an overview of how we define “performance of our product” and the way we measure it. Along the way I hope to introduce you to some new ideas that you can use to test your product’s performance as well. You can read Alex Thaman’s blog post on “How we test the compiler backend” for some background on compiler testing.Additionally, you can read Asmaa Taha’s blog post on VCBench, our performance test automation system.

Questions before answers

Some of the questions that you should answer before you start measuring the performance of your product are:

·What features of my product will customers need to be performant?

·What scenarios can I run to reliably measure performance?

·What metrics can I measure that will produce actionable results?

·How can I execute the tests to minimize variability?

Picking what features to measure

Understanding customer needs is the first step towards figuring out what feature areas they will need to be performant. Some of the avenues that we identify performance scenarios through are:

·Connect bugs reported by you (a lot of attention given to connect bugs)

·Feedback from internal Microsoft teams

·Common historical performance bug reports

Based on this data we break down performance of the Visual C++ compiler into three areas:

·Compiler Throughput – seconds to compile a set of sources with a set of compiler parameters

oEspecially important for customers who have to wait for builds to finish

·Generated Code Quality – seconds to run generated executable with a fixed workload

oEspecially important for technical computing, graphics, gaming and other performance intensive applications

·Generated Code Size – number of bytes in the executable section(s) of the binary

oEspecially important for applications that need to run on less memory

When deciding whether or not to add an optimization algorithm we look at the impact to these three areas on a number of benchmarks. In a perfect world the Visual C++ compiler could take infinite time to compile and generate the perfect binary. Realistically we make tradeoffs to generate the most optimized binary possible in a reasonable amount of time. We try to make sure that the optimization algorithms that we use have an appropriate compilation time cost to code quality & code size benefit ratio.

What scenarios do we run/what metrics do we measure?

Results for correctness tests are easy to report — pass or fail and in the case of fail give enough information about what failed to dig into the issue. Performance tests are trickier. Instead of a black or white pass/fail, you have “we’re fairly sure there is no appreciable change”, “we’re fairly sure there is a significant change”, or “there is too much variation to tell whether something changed”. To make matters worse, some areas of performance are changing while others are not; some are improving and some are regressing. On one extreme you can report thousands of different metrics for each tiny part of the performance that can change. This will drown your developers in numbers so that it takes them forever to interpret them. On the other extreme if you report too few numbers or the wrong numbers you will be hiding important changes and therefore missing regressions.

In general you want to report as few metrics as possible that give you the fullest coverage of your identified scenarios. Given that [large] caveat, here are the metrics that we have decided to monitorand report:

Throughput

We measure the compilation time of certain critical parts of Windows and SQL as well as smaller projects on a daily basis in order to track our throughput. On a less regular basis we time how long it takes to build all of Windows. The metrics that we gather are for the front-end, back-end and linker. This sums up to the entire compilation time, however it lets us be more granular in triaging where a performance regression came from.

Code Quality / Code Size

We build and run a set of industry standard integer and floating point benchmarks in order to monitor the code size and code quality of our optimized code generation. Each benchmark is real world code, but with the special constraints of being CPU bound (no waiting on user input or heavily reading/writing to the disk). Some of the benchmarks we run exercise: cryptographic algorithms, XML string processing, compression, mathematical modeling, and artificial intelligence. We also measure NT/SQL for code size changes.

For code size we use “dumpbin.exe /headers <generated_binary>” and then accumulate the virtual size of all the sections marked as executable. We do this instead of the simpler “how big is the entire binary” because if 98% of a binary is data (read: strings, images, etc) and we double code size then it only reads as a 2% code size regression. This is an example of where carefully choosing what metric you report allows us to more accurately see how our optimizations affect the size of the executable code sections of a binary. Dumpbin.exe is a tool that ships with Visual Studio.

For code quality we use whatever metric the benchmark deems as appropriate for measuring the quality of the generated code. In many instances this is execution time, but in some cases it may be something like “abstraction penalty” or “iterations per second”. It is important to note whether a larger number in the metric is better or worse so that developers know whether they are improving or regressing it!

Minimizing variability

One of the biggest problems in tracking performance is that, with the exception of code size, the results that are produced vary from one run to the next. Unlike correctness test cases which passor fail, performance tests are susceptible to machine variances that can hide a performance change or show a performance change when there is none. To increase the fidelity of our results we run performance tests multiple times and merge the results into an aggregate value. This results in better accuracy of the results and removes outliers.

Use an appropriate aggregation method

We have found that performance results do not normally have a normal distribution; they are much closer to a skew distribution. Because of this using the median to aggregate results is better than using a mean of all of the results – it is less susceptible to large outliers. You can study the characteristic of the results you are getting out of your performance runs and determine the best method of aggregation. It may end up that using the minimum, maximum or a quartile result is best if you only have outliers in one direction. Our team primarily uses median.

Remove outliers

Here are some simple methods to remove outliers:

·Remove the top and bottom X% of results

·Remove all results outside of X standard deviations from the sample mean/median

In addition to throwing away general outliers, there can be a significant difference in the performance of a binary when it is not already loaded in the cache. Throwing out the first iteration as a warm up run can counteract this. Note: If your customer usage scenario is that they will primarily be executing it on a cold cache then you need to be concerned with this number!

Stabilize the machine before running the benchmark

The average windows machine has a lot of services running on it from SQL server to IIS to anti-virus. Shutting down as many applications and services as possible before executing your benchmark will help to reduce variability. Anti-virus in particular is important to disable because of all the places it can hook into the system and introduce additional overhead.Disabling your network adapters will also help to reduce noise.

Making results Actionable

At this point you have a set of benchmarks that run for a number of iterations on a stabilized machine, and you are aggregating the set of result iterations into two numbers that are your baseline results and your changed results. We now need to compare these results and determine whether they are:

1.The same (no performance change)

2.Different (a statistically significant change)

a.Significant improvement

b.Significant regression

3.Unactionable (we cannot tell if they are the same or different)

If the results are accurate enough, or the results are separated enough it can be easy to eyeball the numbers and tell what category the numbers fall in. However, if the numbers are fairly close or you are looking to fully automate this action you will need a more concrete algorithm for categorizing the results. This is where Confidence Intervals are useful – they allow you to say with certain confidence levels that two sets of results are either: the same, different, or that you need more iterations to tell one way or another.

Further Reading

For more information on confidence intervals, refer to your favorite statistics book. A good book that covers confidence intervals as well as a lot more about performance analysis read The Art of Computer Systems Performance Analysis by Raj Jain.

Thank you for your time,~Pete SteijnSoftware Development Engineer in TestVC++ Code Generation Team – Performance

Good job Steijn, numbers speak for themselves and VS2010 compilation times are much reduced especially in the x64 architecture (16% faster code with /GL and PGO), i will go to make some tests to see msbuild in action.

What about compiling something like Boost as part of your throughput tests?

I obviously haven't seen either the Windows or SQL source code, but I'm guessing that it's not exactly heavy on template instantiation and metaprogramming. Wouldn't it be relevant to test compiler throughput with more varied workloads?

Very interesting in the topic of ensuring that performance of my program is not getting worse when new features or validation are added.

I can recognize the compromise of quality vs. speed. But I see no explanation how you actual setup the testing system.

Right now I measure CPU time of the TestRunners I have and then use CruiseControl.net to maintain testrunner results. Once a month then I compare the current test-results with the previous month, and look for interesting changes.

There are several problems with this method:

– The computer running might be doing different things (virtual-computer), or running at different clock frequencies. Running the same test twice gives different timing results.

– Usually intensive integration test are very good at displaying divertion. But at the same time integration tests covers too much, so it is difficult to see if a small change is interesting.

– As the amount of features increases and the amount of test increases aswell. The manual task of comparing test results becomes more difficult.

– When comparing tests, then one need to be aware of the code changes since the last test-result. Some of the changes are acceptable because the speed penalty is acceptable because of the introduced quality.

– When finally have found a tests where the changes are actually interesting and need to find the cause of the changes. Then we need to compile the before version and the after version, and use AQTime to compare the profiler-results to what functions have gone bad.

It is very time consuming to ensure the performance of the program, because there are many false positives. Usually the result is that it is the customers who are reporting the performance issues.

Like to hear more of the experience you have in this area, and what tools you are using to gurantee that the performance have not become worse because of a code checkin.

I would be interested in your results. Please make sure to keep the distinction between compilation time and generated code quality. The latter is where we claim the 16% improvement in VS 2010 "/O2 /GL /link /LTCG:PGO" versus VS 2008 "/O2".

@jalf

Templates and thus template metaprogramming is all handled by the front end pre-processor, so this is not seen at all by the back-end code generation team. That being said, we work closely with our partner front end team to ensure the performance of the compiler as a whole.

We are constantly investing in expanding our performance testing. Ironically helped the front end team bring online a few precompiled headers and template throughput tests this week.

@Rolf

We use pools of machines of the same exact hardware specfication with a custom script to fetch workloads to run. These machines are brought up from scratch to make sure that they have the absolute minimum on them.

Because of the number of results that we put out (>1000 per run due to large matrix of optimization switches and benchmarks), it very important to minimize false-positives.

You are very correct that different machines will give very different performance results — machines of the same exact hardware configuration can vary up to 0.5-1.0% in absolute performance. Even the same machine's performance can vary over time due to reboots, windows update, etc. To solve this issue we always re-run our baseline at the same time and on the same machine as the benchmark we are executing.

In terms of viewing results, we have written an interface that drops the performance results in a set of tabs and tables. We have over benchmark 1000 results to sift through for each run that we do, so it is important that it is efficient to find the actual differences. From this interface our devs can quickly drill down into the individual iteration results for a benchmark to determine whether the results are clustered; if not, this indicates machine noise causing a false-positive.

I have a questoin on msbuild auto-build dependency. For example, if porject A depends on project B(via project reference), then when I built project A, project B will be built automatically by msbuild. But there are times when I just want to build project A.

For big projects, no one cares about compilation speed – because people use incredibuild. Link times are what kills me. These are presumably memory and disk access bound and often can't be parallelised. Yes, part of the onus should lie with users structuring their projects well, but I also don't want to have to wait 15mins for a release exe.

First, do you have tests set up to compare build performance to other compilers? In particular, Clang seems to offer some pretty impressive build times. Is that something you're keeping an eye on/consider worth competing with?

And second, I saw a report recently that link-time code gen was several times slower than simply including all the code in a single translation unit. At a high level the result is the same, so why the big performance difference?

– the VC++ front end can parse multipe source files in parallel with the /MP switch (MSDN Link) (Blog Link)

– the linker writes to the binary and the PDB in two separate threads (this is on by default in VS 2010, undocumented feature)

Multi-threading the compiler has the same issues as multi-threading any large real world code scientific computing application.

@Al the Pal

Currently if you compile with LTCG for the improved code quality then there is no incremental linker build. For LTCG builds (read: most likely you release configuration) we do all code generation at link time, which is most likely why you are seeing such a long build time. This is primarily CPU-bound, and definitely affected by the throughput of our compiler back-end.

@jalf

1) We test build performance on large real world code projects. Since many of our internal projects only build with VC++, we cannot compare build performance against our competitors on these projects.

2) Can you provide a link to the report or a repro case of this behavior? 'psteijn .at. microsoft .dot. com' or log the issue through http://connect.microsoft.com/

@ davie.jiang If there is no change in project B, the time spent in building project B is minimal. On the other hand, there is build only project feature available in the IDE. You can go to solution explorer, right click on ProjectA, choose "Project Only" -> Build only ProjectA.

This is the C++ intellisense parsing engine. It runs out of proc from Visual Studio and should mainly be busy while first building up intellisense for your project (take a look at the status bar) and then incrementally as you edit.

@Fast C++

Thanks for the feedback on build times. It would be great if you could share out a project for us to look at, as we're constantly looking for new performance tests that excercize poor behavior.