Building Enterprise Applications with Sun Studio Profile Feedback

Large, CPU intensive applications may perform better when built with profile feedback. Profile feedback optimization requires the application be built twice, once to collect the profile data, and again to make use of the profile to generate optimal code. This requirement may prevent many software vendors from building their applications with profile feedback Generating a profile for each software release can be impractical. However it is possible to use old profiles to minimize the overhead of profile feedback builds in a development environment and without compromising the advantages of feedback directed optimization.

This article introduces all the stages of profile feedback with examples, and offers some tips for making profile feedback builds, feasible.

0. Introduction

In general, compilers generate object code based on the pre-defined heuristics, and the optimization flags supplied during the compilation. However since compilers cannot predict the dynamic behavior of the code, they have to rely on heuristics for the best possible guess; and hence the generated code may or may not perform well with typical workloads.

Processor stalls are one of the problems that could occur with large applications with tons of instructions. Since the processor cannot hold all the instructions on chip at any given time, it has to wait while some instructions are being fetched from memory. So it is up to the developer to lay out the high level instructions carefully to reduce processor stalls for better performance. As developers may not be the end users in most cases, it will be a cumbersome exercise for them to gather the application run-time data, to identify the
hot code (where the application spends most of the time), and to re-write/re-arrange some blocks of code to improve the run-time performance. Programmers can be relieved from such tasks by using the
feedback based optimization technique, supported by Sun Studio compilers. When the run-time behavior of the code is available in the form of a profile, the object code can be laid out by the compiler, in a way that the on-chip (Level-1 or
L1) cache and the memory will be used efficiently, during the run time.

Note that the Sun Studio C, C++ & Fortran compilers have the ability to generate optimal code using profile feedback data. Even though the examples in this article were written in C, the execution methodology is the same for all applications, regardless of the high-level language used to develop them. Also, the steps outlined in this article can be used to build
any kind of application, and
not just enterprise applications as the title suggests.

1. Feedback-Based Optimization

In some situations, the desired code improvements may not be achieved directly with compiler's classical optimization flags. For example, a hot routine may not be auto-inlined by the compiler at optimization level 4 (
-O4) or higher, if its inclusion violates the threshold heuristic defined in the compiler. In this case, using profile feedback data may help inline the hot code.

Feedback based optimization (FBO) is the term used to describe any technique that alters a program based on information gathered at run-time. It is also widely known as
feedback directed optimization (FDO) and
profile feedback optimization (PFO). The idea behind this technique is to supply some information about run-time behavior of the program, to the compiler. Upon instrumentation of the object code by the complier, the program is profiled and this profile data is used by the compiler to generate optimal code that would run faster.

When the profile data is available, the compiler's front end reads the execution counts of each block from the profile feedback file, and attaches them to the program's intermediate representation (IR). It will be done, at the beginning of any kind of optimization. Many compiler optimizations subsequently use the execution counts from IR. Based on the profile data, compiler can do optimizations of the following types:

Code Layout: Arrange code in a way that the frequently executed code in a routine is grouped together. Its goal is to reduce instruction cache (I$) misses, and to improve instruction fetch by using profile information to guide the layout of code in memory. The article
Improving Code Layout Can Improve Application Performance explains code reordering using profile feedback.

Inlining: Inline routines that are frequently called. Rarely executed functions may not be inlined even if they are eligible for auto-inlining. Inlining eliminates the cost of the call to the routine; and exposes further opportunities for optimization.

Global instruction scheduling: groups instructions with no dependencies together to avoid processor stalls in the pipeline.

Delay slot scheduling: for processors with branch/call delay slots, such as SPARC, an instruction is said to require a delay slot if some instructions following the instruction are executed as if they were located before it. With profile data, the scheduler can reduce branch penalties by using instructions from more frequent blocks for normal branches.

Branch prediction: using profile data, the compiler can minimize pipeline stalls for processors that support branch prediction (such as SPARC) by setting the branch prediction bit in the opcodes for conditional branch instructions.

Typical steps involved in using profile feedback mechanism, are as follows:

Build the application with
-xprofile=collect compiler option. In this step, the object code is instrumented to gather profile data. ie., counters are inserted into the object code to facilitate determining the number of times the code was executed. Instrumented objects can also be referred as
profiled objects. Instrumented code runs slower compared to non-instrumented code. Use instrumented code only to collect profile data.

When the instrumented binaries are run, the application may appear unchanged to the end user, but profile data is collected as a side effect of execution. This data will be used by the compiler in
use phase of FBO, to generate highly optimized binaries.

FBO requires the code to be compiled at optimization level 2, or above. If no optimization level is specified on compile line, compiler uses level 2 optimization (ie.,
-O2), by default. Compiler may suppress certain optimizations with
-xprofile=collect option, to record accurate information about the run-time behavior of the code. However it is recommended to specify exact compiler flags, except for the value
-xprofile, in both phases of feedback based optimization.

The following example shows the steps involved in generating instrumented binaries with the Sun Studio C compiler.

To enable profile data collection, compile this code with
-xprofile=collect and
-xO2 options.

%
cc -o bubblesort -xO2 -xprofile=collect bubblesort.c

Use
-xprofile_ircache[=
path]
with the
-xprofile=collect|use option, to improve compilation time during the
use phase by reusing compilation data (Intermediate Representation or
IR) saved from the
collect phase. Be aware that the saved data could increase disk space requirements considerably.

Run the instrumented binary (that is, the binary compiled with
-xprofile=collect), with one or more representative workloads. If the workload is representative, then the branches that are normally taken in the training run are normally taken in the real workload.

In general, if you run your program with only a single input file, then you can just run that input file, and you'd have collected good profile data. However, if you are creating a general purpose application that can have a variety of inputs which cause execution of different parts of your program, you should choose different kinds of representative sample inputs, which your program will receive. Using only certain kinds of input will bias the compiler in favoring the executed paths of the program more, than the non-executed paths. So, it is important to find one or a combination of training workloads that may give the best possible results in almost all scenarios.

In this phase, the compiler instrumented code collects the branch frequencies for all branches, and the counts for all basic blocks. As a side effect of the execution, a directory named after the program will be created, with
.profile extension.
feedbin file under
<program>.profile directory, holds the execution frequencies of various blocks, for later use by the optimizer when the source code is compiled again with
-xprofile=use option.
feedbin file can be referred as profile feedback file.

The profile data collection is
additive . That is, if you run the profiled executable more than once, with similar or different inputs, the data from the recent run will be added to the data collected from previous runs. Therefore, the profile data will be an aggregate of all your runs with the profiled executable.

But you do need to observe caution here. If you have profile data from earlier training runs, and if you recompile the program with
-xprofile=collect and re-run it, the compiler instrumented code that writes out the profile data will detect it as a different program, and overwrites the old data.

By default, the
<program>.profile directory will be created in the same directory, from where the executable is being run. If you wish to change the directory in which the profile data resides, you can use the
SUN_PROFDATA_DIR environment variable, as shown in the following example.

By default, the profiler thread creates one profile feedback file (ie.,
feedbin) for each profiled executable. The default behavior is good enough for small programs or applications with very few executables. However for large applications with tens of executables, having too many profile feedback files, pose slight inconvenience in use phase, where these feedback files are specified on compile line with
-xprofile=use:<path_to_profdir> option, to produce optimal binaries.

For example, if the application consists of twenty executables, we need to have twenty
-xprofile=use flags on the compile line, as shown below:

If the make file grabs all compiler options from environment variables like
CFLAGS, it may not be possible to specify all instances of
-xprofile=use in a single
CFLAGS, due to the underlying shell restrictions on the number of characters per variable.

The compile line may become too long, and look ugly with too many instances of
-xprofile=use.

To get around these inconveniences, it is recommended to use compiler supported environment variables
SUN_PROFDATA_DIR and
SUN_PROFDATA in profile data collection phase, to request the profiler to write all the profile data from different profiled processes into a single
feedbin file, instead of creating one per executable. If these environment variables are set, the profiler writes the profile data into the file pointed by
SUN_PROFDATA, under the directory
SUN_PROFDATA_DIR . That is, the profile data from all processes will be written into
$SUN_PROFDATA_DIR/$SUN_PROFDATA.

The following trivial example illustrates the default behavior, as well as the behavior with
SUN_PROFDATA and
SUN_PROFDATA_DIR environment variables.

During run-time, the profiler thread reads the values of
SUN_PROFDATA and
SUN_PROFDATA_DIR and writes all profile feedback data from different profiled processes into a single feedbin file under
/tmp/consolidate/singlefeedbin.profile directory.

But note that writing to a single profile feeback file helps only when several instrumented objects serve as dependencies for several profiled processes. The purpose of the above example is only to show, how to request the profiler to write the profile data into a single feedback file.

1.2.2 Asynchronous profile data collection

By default, profile data collection is synchronous. The profiler thread waits for the shared library finalization (if any), and also for the process to call
exit(), before writing all the profile data to feedback file. In a way it requires that the process exit to get the profile data. As a result, multi-threaded applications may experience some profile data loss due to the possible race conditions that may occur among multiple threads. Also, there is no guarantee that all applications, especially multi-threaded applications, will be designed to terminate gracefully. If some profiled process does not call
exit() but forces the process to terminate in other ways, for example with a
SIGKILL, it may be unlikely that a usable profile can be obtained from that process. If the profiled process loads dynamically and unloads other libraries with the help of
dlopen(),
dlclose() system calls, it will lead to indirect call profiling, with its own share of problems collecting the profile data.

To alleviate the problems described above, we need some mechanism to collect the profile data from a running process without requiring it terminate gracefully. An asynchronous profile data collection feature was added in the Sun Studio 11 compiler release. It was then back ported to the Sun Studio 9 and 10 releases. Applying patch 115983-06 (or later) to Studio 9, and 117832-06 (or later) to Studio 10, gives you the ability to control the way the profile data will be collected. As a result, the chances of getting a good profile from many single/multi-threaded applications is high, irrespective of how the profiled processes exit.

1.2.2.1 Enabling asynchronous profile data collection

Asynchronous profile collection is not enabled by default. To enable it, set
SUN_PROFDATA_ASYNC_INTERVAL environment variable before running the application. If
SUN_PROFDATA_ASYNC_INTERVAL has been set to a positive integer value
n at the start up of an application, the profiler thread collects periodic profile data, every
n seconds, and subsequently updates the corresponding feedbin file.
n is the time interval between periodic profile snapshots, in seconds.

When data for a snapshot is collected, the profiler updates a single profile directory whose name is of the form: <procname>.<hostname>.<pid>[.
profile]

where:
<procname> is the name of the process being profiled
<hostname> is the host name of the machine executing the profiled process
<pid> is the process id of the profiled process

.profile will be appended to the name of the profile directory unless <dir_name> is specified using the value of the environment variable
SUN_PROFDATA.

Note that the profiler thread collects profile snapshots only for the process in which it was initiated. Forked processes will not inherit the profiler thread.

The collected profile data can be used in the use phase of profile feedback by specifying the compiler option:
-xprofile=use:<procname>.<hostname>.<pid>. The profile directory can be renamed as you wish before specifying it in -
xprofile=use option.

1.2.2.2 Multiple profile snapshots per process

Asynchronous profile collection also enables the collection of profile data more than once per process. If the environment variable
SUN_PROFDATA_ASYNC_SEQUENCE is defined, and set to an integer value,
num_snapshots ≥ 1, the profiler generates a sequence of distinct profile snapshots whose names are of the form: <procname>.<hostname>.<pid>.<
n>[
.profile]

where:
<
n> is a positive integer in the range [1..
num_snapshots].

Subsequent profile snapshots are applied to update the <procname>.<hostname>.<pid>[
.profile] directory for the remaining life time of the process.

The time sequence of profile snapshots generated by setting
SUN_PROFDATA_ASYNC_SEQUENCE might be used to determine how long profile data should be collected from a given application in order to obtain good performance with
-xprofile=use.

Here's an example:

Let's assume that the program
mtserver is compiled with -
xprofile=collect. The async profile data collection can be done as follows:

This example collects a snapshot of profile data from process 8529 every 30 seconds, as long as it runs. The first 3 snapshots will be saved in their own profile directories:
/tmp/profile/mtserver.v890appserv.8529.1.profile,
/tmp/profile/mtserver.v890appserv.8529.2.profile and
/tmp/profile/mtserver.v890appserv.8529.3.profile. Then the subsequent snapshots will update the feedback directory:
/tmp/profile/mtserver.v890appserv.8529.profile.

To get any warning messages during profile data collection, define the environment variable
SUN_PROFDATA_VERBOSE. For multi-threaded programs, observe that the thread count increases by one if the program is compiled with
-xprofile=collect. The extra thread that you didn't create is the profiler thread – the compiler adds necessary code to create this thread as part of its instrumentation.

1.3. Re-build the application with profile feedback

Once you gather the profile data from the profiled process, feed it to the compiler with the flag:
-xprofile=use:<
path_to_profdir>. The compiler uses this data to do a better job optimizing the application code. Make sure to give the profile data directory -- if you only use
-xprofile=use, then the compiler does not know what the profile data directory is called; and therefore looks for
a.out.profile by default. Note that it is not necessary to add
.profile, when specifying the profile data directory name in
-xprofile=use. In the bubble sort example, it is valid to specify either
-xprofile=use:bubblesort.profile or
-xprofile=use:bubblesort on compile line.

Except for the
-xprofile option which changes from
-xprofile=collect to
-xprofile=use, the source files and other compiler options must be exactly the same as those used for the compilation of profiled objects. The same version of the compiler must be used for both collect and use builds.

If both
-xprofile=collect and
-xprofile=use are specified on the same compile line, the rightmost
-xprofile option in the compile line is applied.

If you are compiling the object file with
-xprofile=use in a directory that is different from the directory in which the object file was previously compiled with
-xprofile=collect, make sure to add the
-xprofile_pathmap=<
collect_prefix>
:<
use_prefix> option on compile line, so the compiler can find profile data for the object file.
collect-prefix is the prefix of the pathname of a directory in which object file was compiled using -
xprofile=collect; and
use-prefix is the prefix of the pathname of a directory in which the object file is to be compiled using -
xprofile=use. Refer to C compiler options reference for detailed information about
-xprofile_pathmap compiler option.

Important Note:
Measure the application performance with profile feedback, and compare it with baseline numbers before you put this into a build environment. Because it requires compiling the entire application code twice, it is intended to be used only after other debugging and tuning is finished, and as one of the last steps before putting the application into production or releasing it to the customers.

When the compiler encounters multiple profiles on the compile line, all the profile data will be merged before any code transformations are performed, based on the profile feedback data.

1.3.2 Extracting execution counts

If you are curious about the compiler code transformations performed based on the profile feedback data, use the following code generator (cg) options, to dump the execution count of each basic block in an assembly listing.

The
-assembly option will generate a
.s file with the same
basename and
dirname as the object file (e.g.,
bubblesort.o will be accompanied by
bubblesort.s in the same directory). The
-Qcg-V option adds more information as assembler comments to the generated
.s file. If
-xprofile=use has been specified, this information includes execution counts derived from the <
path_to_profdir>

The code coverage analysis tool,
tcov, can be used to find the frequency of execution of blocks, and instructions. If the source code is compiled with
-g or
-g0 debug options, the Sun Studio
er_src utility can be used to read the compiler inserted commentary about the code transformations.

Please refer to Sun Studio's Performance Analyzer documentation, for more detailed information about these tools.

2. Building Patches For An Enterprise Application

There is one frequently asked question to ask when considering to use the profile feedback mechanism in building applications: Is it necessary to go through the entire profile feedback life cycle whenever changes are made to the source code?. The simple answer is: No. The following explains a simple way to avoid building the entire application with
-xprofile=collect when there aren't many changes in the code base.

If the application is very big and only few objects were changed, profile only those objects that will be re-built for the patch. However, in order to collect a meaningful profile, there needs to be
-xprofile=collect versions of all object files comprising a re-built executable or a shared library. For example, if the executable
mtserver is built by linking the object files
a.oand
b.o, re-compile those objects with
-xprofile=collect, and re-link to build a new copy of
mtserver. Then: (i) replace the old binaries in the previously saved collect build with the newly built binaries; (ii) re-run the training run, and collect the profile data for the entire build; (iii) finally re-compile all object files comprising the binary (executable or library) with
-xprofile=use, and then re-link to build the actual binary to be shipped to the customer, as a patch.

Here's an example:

Assume that a shared library
libABC.so was built with profile feedback, by linking the objects
A.o,
B.o and
C.o. If the objects A & B were modified/enhanced later, re-build
libABC.so with profile feedback, as outlined below:

Replace
libABC.so in the previous full collect build, with the newly built
libABC.so. Here the assumption is that the full collect build of the application that was used for collecting the profile data in building the previous version of the application is still available.

Collect profile data for the entire application with the training run, preferably with the workload used in previous training run(s).

Compile the objects A and B again with
-xprofile=usecompiler flag, and with the new profile data from step #4.

Re-link the objects
A.o (new),
B.o (new) and
C.o (old) to build
libABC.so. Make sure to specify
-xprofile=use compiler flag on link line, along with the new profile data from step #4.

Release
libABC.so as a patch, to the customers.

Repeat the above steps for all binaries (executables or shared libraries) that will be released as a patch. Apparently step #4 will be done only once, even if there are multiple binaries that need to be re-built, to be released as part of a patch. If there are several binaries that need to be re-built due to the changes in source code, consider building the whole application with
-xprofile=collect, instead of building only those binaries (as explained in the above example) that goes into the patch.

In general, it is desirable to collect profile data whenever there are some changes in the code base. However doing so may not be feasible when very large applications were built with profile feedback. So it is suggested to skip the profile data collection, and use the existing profile data to reduce the overhead upto some extent, when the source code changes are limited to very few lines. Be aware that the gains from profile feedback may diminish over the time, when the previously collected profile data is used, despite the large number of changes in code base. So for optimal performance, collect the profile data again for the whole application, when the number of source code changes become large enough to release a bigger patch. That is, when distributing a large number of modified binaries.

3. Compiling Modified Source With Old Profile Data

It is important to know how a simple change in source code affects the feedback based optimization in the presence of old profile data. Assume that a program was linked with a library
libstrimpl.so, that implements string comparison,
__strcmp, and string length calculation,
__strlen.

The library was extended with a new routine for string reversal,
__strreverse, for its next release. Let's see what happens if we skip the profile data collection for this library after integrating the code for
__strreverse routine. Since the programmer may not care much about the organization of the independent routines within the source file, the new routine can be placed anywhere (top, middle or at the end) in the source file.

Case 1: The routine was added at the bottom of the file ie., after all existing routines

If you do not want to collect the profile data for the new code to be added, appending the new code at the bottom of the source file is the recommended way. By doing so, the existing profile data remains consistent, and can be used by the compiler in optimizing the untouched (existing) code, as before. Since there is no profile feedback data available for the new routine, compiler simply performs other optimizations, as it usually does without
-xprofilecompiler option.

Case 2: The routine was added somewhere in the middle of the source file

Compiler reads the line numbers of the blocks and their execution counts from the feedback (feedbin) file. As a result, introducing new code in a routine makes its profile data inconsistent. Also, since the position of all other routines that are underneath the newly introduced code may change, their profile data becomes inconsistent as well. Hence the compiler ignores the profile data of such routines to avoid introducing functional errors.

Apparently the same explanation holds true even when the new code was added at the top of the source file above all existing routines. Such an action leaves all the profile data for this object, in an unusable (inconsistent) state. Observe the warnings in the following example, for clear understanding.

The bottom line: If the plan is to skip profile data collection in favor of using old profile data from previous training run(s), always add the new code at the bottom of the source file (unless it needs to be placed somewhere else to avoid compilation errors), to keep the data consistent at least for majority of the existing code.

4. Other Compiler Options That Could Use Profile Data

Compiler option
-xipo performs crossfile optimization -- optimizations that extend across multiple source files. One example of this kind of optimization is inlining a routine from one source file into code from another source file. In the presence of profile feedback, the compiler has a much better model of the set of routines that are worth inlining.

Option
-xlinkopt causes the compiler to perform link time optimization. This final phase of compilation uses all the knowledge of the generated code in order to do some final tweaking of the code layout. This is useful for large codes where performance can be gained by laying out the code to keep all the frequently executed code together.

5. Profile Data Portability Across Different Platforms

In order to reduce the build time overhead of profile feedback, it is desirable to use the profile data collected on one platform in building the application on other platforms, provided the application code is portable. At the time of this writing, profile data collected with Sun Studio compilers on SPARC platforms is not compatible with profile data collected on x86/x64 platforms. That is, profile data collected on one platform cannot be use on another platform.

6. Alternatives To Feedback-Based Optimization

Sun introduced a static optimizer,
binopt, as part of Sun Studio 11 compiler suite.
binopt works directly on binaries. If using feedback based optimization is either not feasible, or didn't help much due to the non-representative workloads used in training run(s),
binopt can be used as an alternative, to improve the performance of the application.

Acknowledgements

The techniques described in this article are derived from earlier work done by Vinod Grover and Chris Aoki, and the author wishes to acknowledge their input.

About The Author

Giri Mandalika is a software engineer in Sun Microsystems Market Development Engineering group, working with independent software vendors to make sure their products run well on Sun platform. He holds a Master's degree in Computer Science from The University of Texas at Dallas.