Back to the Basics: Compiler optimization for smaller, faster embedded application code

In spite of the increased speed of microprocessors, performance has not become irrelevant for a number of reasons. First, the average case response time of a real-time system is irrelevant. It is the worst-case response time that guards against a dropped call, a misprint, or other error conditions in one’s product. In other words, performance is necessary to minimize the response time of one’s system, thereby achieving product reliability.

Second, increased performance allows more features to be implemented without compromising the integrity of the system. Conversely, performance may be used to select a cheaper microprocessor or one that consumes less power.

There are many factors that determine the performance of a system. The choice of hardware can mean the difference between a few MIPS and a few hundred. But good data structures and algorithms are also essential, and bookshelves have been written on this topic. There are simple guidelines that can be followed to make sure that one’s code can be readily optimized. But a good compiler is essential, one that is extensible to allow the compiler to evolve to ever changing machine architectures, improvements in optimization techniques, and newer languages.

In order to allow for the maximum flexibility, today’s compilers consist of a pipeline of stages. Each stage operates independently of each other, allowing for newer stages to be implemented without affecting the rest of the stages in the pipeline.

The first stage is a language-specific front end that parses the source code and converts it to a language independent intermediary representation (IR). Certain language-specific optimizations are also performed during this stage. New languages may be added by writing a new language front end without needing to modify all of the rest of the components of the compiler system.

The second stage is the global and optimizer. This stage converts the intermediary representation of the program into a functionally equivalent intermediary representation that executes faster or uses less code space. These optimizations are largely independent of the target processor, although they rely on target-specific “backend” hooks that describe aspects of the processor, such as the relative cost of performing certain operations, its condition code architecture, its addressing modes, the characteristics of its register set, etc.

The third stage of the compiler is a code generator, which converts the intermediary representation into assembly instructions that are suitable for the target architecture. Register allocation, instruction scheduling, and peephole optimizations are also performed during this stage. Although this stage is where most of the target-specific tricks are performed, a compiler may be crafted such that many of these optimizations may be performed by machine-independent passes that utilize abstraction layers to allow for a common algorithm.

After the third stage, the compilation process is complete. However, “compiling” is often times used to refer to the entire process of building one’s application. When compilation is considered in this larger sense, a fourth stage consists of assembling the output from the compiler. Assembly source files supplied by the user bypass the first three stages and go directly to this fourth stage. The fifth stage is linking the program together. While activities such as creating archives, incremental linking, and binary code generation differ in these final two stages, being aware of these stages can be helpful in understanding some of the topics we will be discussing.

The great strength in this pipeline of stages is the extensibility that it provides. New microprocessor architectures may be added by just writing a new backend. The global optimizations and support for different languages are obtained on this new architecture. Another advantage is that improvements to the global optimizer generally leverage all of the many different target architectures that modern compilers support. It is impractical for most organizations to have the resources to maintain compilers for several different languages on many different architectures if each compiler was a unique body of code. The shared code in the first two stages as well as any shared code in the third allows for many compilers to be maintained by a group of a practical size.

Although optimization happen throughout the compilation process, most of the “standard” optimizations, including constant propagation, partial redundancy elimination, loop unrolling, strength reduction, and inlining are traditionally performed during the second stage. This is not to say that all implementations of these optimizations are the same. The textbooks and academic papers often don’t bother to take many real-world issues, such as aliasing, into account, leaving the implementer with the task of extending the concept to handle the general case while preserving correctness.

Intermodule Inlinining
Inlining replaces function calls with a copy of the function being called. This is typically done to improve execution speed, although selective use can be helpful for size optimizations, as well. Consider the following example of inlining:

Traditional inlining only works within the same source file. Recently, more compilers have started to be capable of inlining across different modules to achieve an additional performance increase. There are two ways that this could be implemented.

The first technique dumps the intermediate representation of each source module into a database. Later, this information is used as part of a global inlining strategy during a main second stage of the compiler.

The second technique reads in multiple source files at once and re-interprets them as a single source file, emitting a single object file.

Data Allocation
The C language guarantees that within a structure, the members must follow each other in the order they were declared in memory. However, the language provides no guarantees about the order in which different variables are placed in memory. The compiler can take advantage of this freedom in a couple of ways.

Usage-based Allocation
This optimization is particularly effective on architectures where it is expensive to load the address of a global variable, although it may be used on any architecture. Consider the ARM architecture. Loading the address of a global variable normally requires 8 bytes – 4 bytes to put the 32 bit address into a literal pool located in the text section and another 4 bytes to load the contents of the literal pool using a PC-relative addressing mode. For example, the code:

int foo() { extern int x; return x; }

may be compiled into:

One trick is to allocate variables into the data section according to their usage patterns. The idea is that “register plus offset” addressing modes may be used based on an address, stored in a register, that is picked to be close to a number of variables that are used in a section of code. For example:

int sum() { extern int x, y; return x+y; }

could be compiled into:

Of course, this requires that x and y may be allocated immediately after each other.

This optimization may be enhanced if uninitialized global variables are allocated using a strict reference/definition model rather than the common variable model that most toolchains use. When a strict reference/definition model is used, it may even be beneficial to move certain uninitialized variables into the .data section rather than the .bss section when the savings from being able to get its address “for free”, as above, outweigh the cost of the additional ROM image for the zeros that would not have been necessary had the variable stayed in .bss.

As a practical matter, much existing code relies on the common variable model, so this must be an option that the user selects rather than the only option, as some toolchains naively assume is good enough for their users.

Small Data Area
Certain architectures allocate data into a Small Data Area (SDA). Typically, a general purpose register is reserved to point into this area, and “register plus offset” addressing modes are used to reference the data within this section.
For example, accessing a normal global variable like:

int foo(int x){ extern int glob[]; return glob[0];}

would generate the following PowerPC code:

lis r3, r3, %hi(glob)lwz r3, %lo(glob)(r3) blr

When glob is allocated to SDA, the code would instead look like:

lwz r3, %sdaoff(glob)(r13)blr

Using a small data area, usually results in smaller, faster code on the PowerPC, MIPS, V800, SPARC, and other RISC architectures, although it can be a dubious tradeoff on other architectures with fewer registers or smaller offsets on their “register plus offset” addressing mode. Furthermore, since the register needs to be preserved across all routines, the Small Data Area optimization becomes an ABI issue.

The problem with SDA allocation is that it becomes difficult to select what variables to put into these sections. Since the offset of the addressing mode is limited, only a certain number of variables can be put into the section until it overflows. Typically, the user selects a threshold value, such as 4 bytes, for SDA variables. Variables whose size are equal to or less than the threshold value are allocated into the SDA section while variables larger than the threshold are not.

The problem is that this approach requires the user to try multiple values for the SDA threshold until one works. If too large of a value is chosen, the toolchain reports a failure at link time, and the user has to recompile his entire application with a smaller SDA threshold. If too small of a value is chosen, the user is getting suboptimal code.

Another approach is to manually select what variables to allocate into SDA and to indicate this to the compiler using a special pragma or keyword. The problem with this approach is that it is both time consuming and difficult for the user to figure out the variables to allocate to SDA. The system also breaks down as the project grows in size, since all different parts of the project must coordinate to only use a certain amount of SDA room.

A handful of compilers have started to employ other whole-program or two-pass techniques to get the benefits of SDA without any of the guesswork or trial-and-error required by these traditional approaches.

Either way, the SDA optimization can be a big savings in terms of code speed and code size.

Greg Davis is Senior Engineer / Engineering Manager, Compiler Development at
Green Hills Software, Inc.. This article is based on material developed for Class 411 on compiler optimization presented at the Fall 2005 Embedded Systems Conference.