The increasing acceptance of Linux among
developers and researchers has yet to be matched by a similar
increase in the number of available development tools. The recently
released Intel C++ and Fortran compilers for Linux aim to bridge
this gap by providing application developers with highly
optimizable compilers for the Intel IA-32 and Itanium processor
families. These compilers provide strict ANSI support, as well as
optional support for some popular extensions. This article focuses
on the optimizations and features of the compiler for the Intel
IA-32 processors. Throughout the rest of this article, we refer to
the Intel C++ and Fortran compilers for Linux on IA-32 collectively
as “the Intel compiler”.

The Intel compiler optimizes a program at all levels, from
high-level loop and interprocedural optimizations to standard
compiler data flow optimizations, in addition to efficient
low-level optimizations, such as instruction scheduling, basic
block layout and register allocation. In this article, we mainly
focus on compiler optimizations unique to the Intel compiler. For
completeness, however, we also include a brief overview of some of
the more traditional optimizations supported by the Intel
compiler.

Traditional Compiler Optimizations

Decreasing the number of instructions that are dynamically
executed and replacing instructions with faster equivalents are
perhaps the two most obvious ways to improve performance. Many
traditional compiler optimizations fall into this category: copy
and constant propagation, redundant expression elimination, dead
code elimination, peephole optimizations, function inlining, tail
recursion elimination and so forth.

The Intel compiler provides a rich variety of both types of
optimizations. Many local optimizations are based on the
static-single-assignment (SSA) form. Redundant (or partially
redundant) expressions, for example, are eliminated according to
Chow's algorithm (see Resource 6), where an expression is
considered redundant if it is unnecessarily calculated more than
once on an execution path. For instance, in the statement:

x[i] += a[i+j*n] + b[i+j*n];

the expression i+j*n is redundant and needs to be calculated
only once. Partial redundancy occurs when an expression is
redundant on some paths but not necessarily all paths. In the code:

if (c) {
x = y+a*b;
} else {
x = a;
}
z = a*b;

the expression a*b is partially redundant. If the else branch is
taken, a*b is only calculated once; but if the then branch is
taken, it is calculated twice. The code can be modified as follows:

t = a*b;
if (c) {
x = y+t;
} else {
x = a;
}
z = t;

so there is only one calculation of a*b, no matter which path is
taken.

Clearly, this transformation must be used judiciously as the
increase in temporary values, ideally stored in registers, can
increase lifetimes and, hence, register pressure. An algorithm
similar to Chow's algorithm (see Resource 9) is used to eliminate
dead stores, in which a store is succeeded by another store to the
same location before a fetch, and partially dead stores, which are
dead along some but not necessarily all paths. Other optimizations
based on the SSA form are constant propagation (see Resource 7) and
the propagation of conditions. Consider the following
example:

if (x>0) {
if (y>0) {
. . .
if (x == 0) {
. . .
}
}
}

Since x>0 holds within the outmost if, unless x is
changed, we know that x != 0, and therefore the code within the
inner if is dead. Although this and the previous example may seem
contrived, such situations are actually quite common in the
presence of address calculations, macros or inlined functions.

Powerful memory disambiguation (see Resource 8) is used by
the Intel compiler to determine whether memory references might
overlap. This analysis is important to enhance, for instance,
register allocation and to enable the detection and exploitation of
implicit parallelism in the code, as discussed in the following
sections. The Intel compiler also provides extensive
interprocedural optimizations, including manual and automatic
function inlining, partial inlining where only the hot parts of a
routine are inlined, interprocedural constant optimizations and
exception-handling optimizations. With the optional “whole
program” analysis, the data layout of certain data structures,
such as COMMON BLOCKS in Fortran, may be modified to enhance memory
accesses on various processors. For example, the data layout could
be padded to provide better data alignment. In addition, in order
to make decisions that are more intelligent about when and where to
inline, the Intel compiler relies on two types of profiling
information: static profiling and dynamic profiling. Static
profiling refers to information that can be deduced or estimated at
compile time. Dynamic profiling is information gathered from actual
executions of a program. These two types of profiling are discussed
in the next section.

Comment viewing options

I have tried both gcc and icc 7.0 on cache-intensive code. Also examined the intermediate assembly code. Same code, same performance (better comments for icc), provided that you compile (under gcc) for the right processor type. Default processor is 386 (!!!) for some distributions (e.g., Mandrake), pentium for others (e.g., RedHat). Be careful, the performance advantage can be up to 40%.
Of course, no OpenMP support for gcc. However, when Intel people will dare to make measurements with hyperthreading enabled (please read their papers carefully), I will convice myself that it MIGHT be useful.. :)

"dare to say the current gcc has most of this stuff already implemented."

Not true, although you'll find some things that work better in GCC. The Intel compiler is specifically optimized for IA, while gcc has to run on a lot of different architectures. Your mileage will vary depending on what you're doing.

GCC vs. the Intel Compiler definitely falls into the category of "use the right tool for the right job." Of course, the proprietary nature of the Intel tool will be an obstacle for some, but you can definitely get some performance benefits from using a compiler that is specificially optimized for the architecture.

Trending Topics

Webinar: 8 Signs You’re Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th

Join Linux Journal and Pat Cameron, Director of Automation Technology at HelpSystems, as they discuss the eight primary advantages of moving beyond cron job scheduling. In this webinar, you’ll learn about integrating cron with an enterprise scheduler.