Whole Program Optimization with Visual C++ .NET

Visual C++ .NET goes to an entirely new level with Whole Program Optimization. This article discusses everything that the compiler can do using this new framework for optimization and how little the developer must do.

<!-- Add the rest of your HTML here -->

Introduction

The charter of compiler optimization has always been to
produce the fastest running programs possible. Developers trying to write
performance-tuned programs are in a continuous endurance trial of writing code
in ways that lead to optimization opportunities for the compiler. Historically,
compilers have introduced scalar optimizations that work only on isolated
pieces of a program, usually only inside functions. Visual C++ .NET goes
to an entirely new level with Whole
Program Optimization. This article discusses everything that the compiler
can do using this new framework for optimization and how little the developer
must do.

Before Visual C++ .NET

The first things that a person typically learns in a C++
course are using the compiler to compile code and understanding that the compiler
is responsible for creating object files for each source file. The linker then brings
all object files together into something useful, such as an executable or a DLL.
From the beginning, the compiler is at a disadvantage - it can only see a small
piece of the program at any point in time. Unable to see other pieces, the
compiler must use a conservative approach that results in slowing a program
down. A classic example of this is calling conventions.

The interfaces to each module of a program need to remain
consistent. Common calling conventions are cdecl,
stdcall, and fastcall. Mixing and matching calling conventions inside a program
was possible, but it required the developer to annotate the function signature
with keywords. The developer was not necessarily the best person to make the
decision for what would be the best calling convention for each function. The
compiler could not really make this decision either, however, without breaking
the interface to other modules. There are many similar examples where programs
could be improved if the compiler had access to the whole program. For example,
inlining could only happen inside individual object files. This program
generates two unnecessary functions:

Both the functions set_i
and print_i are inline
candidates. Unfortunately, when the compiler is working on main.cpp, it does
not have access to the implementation in myclass.cpp. Developers can work around
this by putting inline candidates in the header file, but it is better coding
practice to leave the header file free of implementation details. In addition,
not every user inline candidate should be inlined. This is also true for
functions not marked with the inline keyword; some of these are great inline
candidates. Again, the compiler is always at a disadvantage because it does not
have access to the entire program.

With Visual C++ .NET

Link time code generation (LTCG), the Visual C++ .NET
framework that makes whole program optimization possible, mitigates the
difficulty a compiler has in performing optimizations. As the name implies,
code generation does not occur until the linking stage. The steps that the
compiler uses during an LTCG build can be summarized as follows:

The
compiler takes each source file and does the usual parsing and type checking. It
then generates intermediate representations of the source file and shuffles
that off to the optimizer and the code generator.

Instead
of optimizing the intermediate representation, as it would normally do without
LTCG, the compiler puts the intermediate representation in an object file. Note
that the compiler basically does nothing to the code. Instead of containing
assembly language, the object file has a higher level view of the program.

The
linker now starts as usual trying to pull all the object files together to form
a program. Because the object files do not contain assembly code, the linker must
invoke the compiler to finish the job of compiling the code. The linker has the
compiler optimize and generate code for one function at a time. The compiler
can ask the linker for information about other parts of the program and thus
make informed decisions rather than always assuming the worst case.

The linking stage will take longer than usual, but the
compiling stage will be much faster. Also note that the object files produced
by the compiler through LTCG are not as portable as object files that contain
assembly code. The intermediate representation stored in LTCG object files is
likely to change with each version of Visual C++, so these object files would
need to be regenerated every time that the compiler is upgraded. This situation
only presents itself if the developer is trying to produce a .lib file. For
that reason, unless the plan is to regenerate a new library for each future version
of Visual C++, publicly distributing static-link libraries using LTCG for the object
files is not recommended. Another consequence of including intermediate
representations of the code in the object files, rather than assembly code, is
that tools such as dumpbin.exe and editbin.exe do not work.

Optimizations
Available to Whole Program Optimization

Cross-module inlining

As the previous example showed,
cross-module inlining is perhaps the best reason to use whole program
optimization. Instead of placing implementation details in header files,
developers can now keep things neatly organized in an appropriate source file. It
is not necessary to mark functions with the inline keyword, because the
compiler can determine if it is beneficial to inline that function. This will
happen when using the /Ob2 switch, which is implied by both /O1 and /O2. Sometimes,
the release build in Visual Studio .NET will include /Ob1 on the command line;
to enable cross-module inlining, do not include /Ob1, which only allows user-declared
inline candidates to be inlined.

Cross-module bottom-up
information

Often, the individual
optimizations that the compiler can do are completely safe, but the information
about the program is too conservative, and the compiler opts to not do the
optimization in favor of accuracy. The compiler always generates information
from the bottom of the call-tree. With whole program optimization, the scope of
the information includes the entire program including information collected
about each function’s register usage, memory usage, and information to improve
inlining heuristics. With accurate information, the compiler does not need to
make pessimistic decisions about whether a certain optimization is done.

Region based stack
double alignment

Just as integers and pointers
should be 4-byte aligned, doubles should be 8-byte aligned. By default, the
stack in Win32 s 4-byte aligned. Misaligning data types results in significant
performance loss. Without whole program optimization, the compiler has to generate
code to dynamically align doubles on a per-function basis. Doing this is a
challenge; the compiler cannot assume the position of the current stack frame. With
whole program optimization, the compiler knows much more about the call-tree,
and therefore, it can align the stack frame in a root function and keep things
aligned through nested calls. Each function is not penalized with figuring the
position of its stack frame.

Custom calling
convention

As previously mentioned, a single
calling convention is not the best for every function. For example, functions
passing only a few small arguments benefit greatly from fastcall, but using fastcall
also strains the optimizer. The compiler is certainly a better judge of when to
use a particular calling convention. With whole program optimization, the
compiler knows about all the call sites for a particular function. This lets
the compiler customize the calling convention. For example, function arguments
could be passed through an available register rather than on the stack. Functions
that are exposed outside the program, as would happen in a DLL, will
necessarily retain their default calling convention.

Improved memory
disambiguation for non-address taken globals

Before whole program optimization,
the compiler had a hard time optimizing global variables. This is worthwhile
because global variables live in memory and are highly susceptible to cache
misses. Unfortunately, because it does not have access to the whole program,
the compiler often must assume that global variables can be written to through
an assignment to a pointer. With whole program optimization, the compiler and
the linker can determine with better accuracy whether the address of a global
variable is taken so the compiler knows about pointers to the global variable. If
the variable does not change, it can be treated more like a local variable and
opened to standard code optimizations.

Small TLS offset encoding

The x86 instruction set uses
smaller instruction encodings when an offset is within 128 bytes of a pointer. When
organizing the layout of variables in thread-local storage, it is better to
place frequently used variables in the first 128 bytes of storage. The linker
is the utility that organizes the layout for thread-local variables. Determining
which variables are more frequently used requires knowing about the whole
program. Knowing the position of the variables in thread-local storage allows
the compiler to use a smaller instruction encoding for the variable offset. If
a program is heavily threaded, whole program optimization could dramatically
reduce the image size.

Using Whole Program
Optimization

Fortunately, developers need to do very little to enable
whole program optimization. On the command-line, adding the /GL switch is all
that is needed. When the /c switch is used to separate the compiling and
linking stage, the linker will need the /LTCG switch when any object files were
compiled with the /GL switch. When using the Visual Studio integrated development
environment, to enable whole program optimization, set this property in the General
property page of the project properties’ configuration folder.

Using whole program optimization restricts the ability to
use other features of Visual C++ .NET. When compiling with the /GL switch, edit
and continue (/ZI), automatic precompiled headers (/YX), and targeting the .NET
common language runtime (/clr) are not available.

In real-world code, whole program optimizations have boosted
performance as much as 10% to 15%. Of course, this can vary; some programs will
benefit more than others. On x86 architectures, 3% to 5% improvement is common.

Common Question About
Whole Program Optimization

Can whole program
optimization be used on some files, but not others?

Yes. Each source file that is
compiled with the /GL switch produces an object file that will use whole
program optimization. If an object file is not compiled with /GL, it will
contain optimized assembly code using the traditional approach to compiling.
Mixing object files built with and without /GL does not have any known issues.

Can I generate
assembly files? What do they look like?

Assembly files (.asm) can be
generated with LTCG, but because code generation is not done till link time,
the assembly file will not be produced until link time as well. The .asm files
produced with LTCG are just like without LTCG, but cannot be consumed by MASM.

What does this do to
overall build time?

Overall build time does not change
significantly. The shorter time in the compiling stage is shuffled to the
linking stage, which now includes optimization and assembly code generation.

Conclusion

Link time code generation is a framework that enables whole
program optimization. For developers, this means that the Visual C++ team is continuously
examining even more ways to improve code through this framework. At the moment,
whole program optimizations in Visual C++ .NET provide a significant advance
toward making C and C++ programs the best that they can be.

The
information contained in this document represents the current view of Microsoft
Corporation on the issues discussed as of the date of publication. Because
Microsoft must respond to changing market conditions, it should not be
interpreted to be a commitment on the part of Microsoft, and Microsoft cannot
guarantee the accuracy of any information presented after the date of
publication.

This
White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES,
EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying
with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be
reproduced, stored in or introduced into a retrieval system, or transmitted in
any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.

Microsoft
may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except
as expressly provided in any written license agreement from Microsoft, the
furnishing of this document does not give you any license to these patents,
trademarks, copyrights, or other intellectual property.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

We originally had the flag turned on, to gain performance.
However, when we made our application mixed-mode, we had to turn it off (As the article says, this flag doesn't work when targeting the .NET runtime with /clr).

This was for the project for the application's main executable. If we have the flag turned on for all of the other projects in the solution, will we benefit much?

What I meant to say, is that this article (http://msdn.microsoft.com/msdnmag/issues/05/01/COptimizations/) says that whole program optimisations should now work in VS 2005. And obviously, if we've got the feature available to improve performance then we'd like to use it.

I think the compiler's confused about how pmField is defined in the different headers, but I'm not sure how to cure this. Has anyone else come across this?

Is there a Microsoft documentation that explains in sufficient detail how to read, interpret and use the information in the map file? Specifically how much of memory is being used by (a) different parts of the code in the program, and (b) statically allocated data. For a code optimization viewpoint, it helps to know where resources are being used (or misused).

Hi there,
Unfortunately the IR has no specs -- that's because it changes on a nearly daily basis. This is why creating .lib files is not recommended, as the .obj files in the static link library will only work with the compiler that produced them. They have no where near the portability of .obj files that contain assembly.

There is some excellent research being done in this area as well. Take a look at the open source LLVM compiler: http://llvm.cs.uiuc.edu. You will find LLVM's IR much more stable and very easy to work with.

Does this change means that it would also be possible to put template source code in a source (CPP) file and have the linker generate the code.

It would be really great if the template source code could be put in source file as implementation details would not have to be shown in header in order to be able to generate the final code.

In fact, if we have lot of template code, it could even help the compiler to perform better as it would have to do actual instanciation once instead of once for each source file... and have less line of code to read and process at the compilation time.

Unfortunately, it does not mean template code can be put into the CPP files. Believe me -- I feel your pain. The problem is that the template specialization is done along with the parsing and type checking. All the template specialization happens before the code generator and optimizer even has a look at your program.

Fortunately, the only problem this creates is having to put implementation details in the header. Indeed, there is a slightly longer parsing time, but the parser, type checker, and template specialization is pretty darned fast (especially compared to everything else the optimizer does).

And luckily, whole program optimization might be able to speed up the code generation of template code. The way template specialization are created by the compiler without LTCG is by putting it in the communal data section. That means if two object files have the same specialization, only one copy makes it into the final image produced by the linker. The concept is similar for whole program optimization -- specializations are in the communal data section, and now the compiler knows it only has to work on the code once.

This sounds great, I am looking forward to trying it out. I have one question, though. In a large project I am working on, I make an "on-the-fly assembled" BitBlt function by writing code bytes into data memory and then calling a pointer to them. I do this because this function needs to fit a lot of different situations and optimizing for every scenario imaginable would make an enormous amount of code.
I know that this is a hack and to prevent failure, I always push/pop everything and I don't call anything from within my dynamic function.
So my question is how the new compiler/linker system would respond to this kind of code. I don't know much about compilers but I assume that it can tell that I am calling code in the data segment. It works perfectly with VS60 but I don't know what happens with the optimization around the call. To encapsule it, I placed my call in a small (c++) function which I assumed would satisfy the VS60 optimizer.