Overview of OpenMP Software Execution

The actual execution model of OpenMP Applications is described in the
OpenMP specifications (See, for example, OpenMP
Application Program Interface, Version 2.5, section 1.3.) The specification,
however, does not describe some implementation details that may be important
to users, and the actual implementation at Sun Microsystems is such that directly
recorded profiling information does not easily allow the user to understand
how the threads interact.

As any single-threaded program runs, its call stack shows its current
location, and a trace of how it got there, starting from the beginning instructions
in a routine called _start, which calls main,
which then proceeds and calls various subroutines within the program. When
a subroutine contains a loop, the program executes the code inside the loop
repeatedly until the loop exit criterion is reached. The execution then proceeds
to the next sequence of code, and so forth.

When the program is parallelized with OpenMP (or by autoparallelization),
the behavior is different. An intuitive model of that behavior has the main,
or master, thread executing just as a single-threaded program. When it reaches
a parallel loop or parallel region, additional slave threads appear, each
a clone of the master thread, with all of them executing the contents of the
loop or parallel region, in parallel, each for different chunks of work. When
all chunks of work are completed, all the threads are synchronized, the slave
threads disappear, and the master thread proceeds.

When the compiler generates code for a parallel region or loop (or any
other OpenMP construct), the code inside it is extracted and made into an
independent function, called an mfunction. (It may also be referred to as
an outlined function, or a loop-body-function.) The name of the function encodes
the OpenMP construct type, the name of the function from which it was extracted,
and the line number of the source line at which the construct appears. The
names of these functions are shown in the Analyzer in the following form,
where the name in brackets is the actual symbol-table name of the function:

There are other forms of such functions, derived from other source constructs,
for which the OMP parallel region in the name is replaced
by MP construct, MP doall, or OMP
sections. In the following discussion, all of these are referred
to generically as "parallel regions".

Each thread executing the code within the parallel loop can invoke its
mfunction multiple times, with each invocation doing a chunk of the work within
the loop. When all the chunks of work are complete, each thread calls synchronization
or reduction routines in the library; the master thread then continues, while
the slave threads become idle, waiting for the master thread to enter the
next parallel region. All of the scheduling and synchronization are handled
by calls to the OpenMP runtime.

During its execution, the code within the parallel region might be doing
a chunk of the work, or it might be synchronizing with other threads or picking
up additional chunks of work to do. It might also call other functions, which
may in turn call still others. A slave thread (or the master thread) executing
within a parallel region, might itself, or from a function it calls, act as
a master thread, and enter its own parallel region, giving rise to nested
parallelism.

The Analyzer collects data based on statistical sampling of call stacks,
and aggregates its data across all threads and shows metrics of performance
based on the type of data collected, against functions, callers and callees,
source lines, and instructions. It presents information on the performance
of OpenMP programs in either of two modes, User mode and Machine mode. (A
third mode, Expert mode, is supported, but is identical to User mode.)

User Mode Display of OpenMP Profile Data

The User mode presentation of the profile data attempts to present the
information as if the program really executed according to the model described
in Overview of OpenMP Software Execution.
The actual data captures the implementation details of the runtime library, libmtsk.so , which does not correspond to the model. In User mode,
the presentation of profile data is altered to match the model better, and
differs from the recorded data and Machine mode presentation in three ways:

Artificial functions are constructed representing the state
of each thread from the point of view of the OpenMP runtime library.

Call stacks are manipulated to report data corresponding to
the model of how the code runs, as described above.

Two additional metrics of performance are constructed for
clock-based profiling experiments, corresponding to time spent doing useful
work and time spent waiting in the OpenMP runtime.

Artificial Functions

Artificial functions are constructed and put onto the User mode
call stacks reflecting events in which a thread was in some state within the
OpenMP runtime library.

The following artificial functions are defined; each is followed by
a description of its function:

<OMP-ordered_section_wait>— thread
waiting for its turn to enter an ordered section

When a thread is in an OpenMP runtime state corresponding to one of
those functions, the corresponding function is added as the leaf function
on the stack. When a thread’s leaf function is anywhere in the OpenMP
runtime, it is replaced by <OMP-overhead> as the leaf
function. Otherwise, all PCs from the OpenMP runtime are omitted from the
user-mode stack.

User Mode Call Stacks

The easiest way to understand this model is to look at the call stacks of an OpenMP program at various points in its execution.
This section considers a simple program that has a main program that calls
one subroutine, foo. That subroutine has a single parallel
loop, in which the threads do work, contend for, acquire, and release a lock,
and enter and leave a critical section. An additional set of call stacks is
shown, reflecting the state when one slave thread has called another function,
bar, which enters a nested parallel region.

In this presentation, all the inclusive time spent in a parallel region
is included in the inclusive time in the function from which it was extracted,
including time spent in the OpenMP runtime, and that inclusive time is propagated
all the way up to main and _start

The call stacks that represent the behavior in this model appear as
shown in the subsections that follow. The actual names of the parallel region
functions are of the following form, as described above:

For clarity, the following shortened forms are used in the descriptions:

foo -- OMP...
bar -- OMP...

In the descriptions, call stacks from all threads are shown at an instant
during execution of the program. The call stack for each thread is shown as
a stack of frames, matching the data from selecting an individual profile
event in the Analyzer Timeline tab for a single thread, with the leaf PC at
the top. In the Timeline tab, each frame is shown with a PC offset, which
is omitted below. The stacks from all the threads are shown in a horizontal
array, while in the Analyzer Timeline tab, the stacks for other threads would
appear in profile bars stacked vertically. Furthermore, in the representation
presented, the stacks for all the threads are shown as if they were captured
at exactly the same instant, while in a real experiment, the stacks are captured
independently in each thread, and may be skewed relative to each other.

The call stacks shown represent the data as it is presented with a view
mode of User in the Analyzer or in the er_print utility.

Before the first parallel region

Before the first
parallel region is entered, there is only the one thread, the master thread.

Master

foo

main

_start

Upon entering the first parallel region

At this
point, the library has created the slave threads, and all of the threads,
master and slaves, are about to start processing their chunks of work. All
threads are shown as having called into the code for the parallel region, foo-OMP..., from foo at the line on which the
OpenMP directive for the construct appears, or from the line containing the
loop statement that was autoparallelized. The code for the parallel region
in each thread is calling into the OpenMP support library, shown as the <OMP-overhead> function, from the first instruction in the parallel
region.

Master

Slave 1

Slave 2

Slave 3

<OMP-overhead>

<OMP-overhead>

<OMP-overhead>

<OMP-overhead>

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

The window in which <OMP-overhead> might
appear is quite small, so that function might not appear in any particular
experiment.

While executing within a parallel region

All four
of the threads are doing useful work in the parallel region.

Master

Slave 1

Slave 2

Slave 3

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

While executing within a parallel region between chunks of
work

All four of the threads are doing useful work, but one has
finished one chunk of work, and is obtaining its next chunk.

Master

Slave 1

Slave 2

Slave 3

<OMP-overhead>

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

While executing in a critical section within the parallel
region

All four of the threads are executing, each within the
parallel region. One of them is in the critical section, while one of the
others is running before reaching the critical section (or after finishing
it). The remaining two are waiting to enter the critical section themselves.

Master

Slave 1

Slave 2

Slave 3

<OMP-critical_section_wait>

<OMP-critical_section_wait>

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

The data collected does not distinguish between the
call stack of the thread that is executing in the critical section, and that
of the thread that has not yet reached, or has already passed the critical
section.

While executing around a lock within the parallel region

A section of code around a lock is completely analogous to a critical
section. All four of the threads are executing within the parallel region.
One thread is executing while holding the lock, one is executing before acquiring
the lock (or after acquiring and releasing it), and the other two threads
are waiting for the lock.

Master

Slave 1

Slave 2

Slave 3

<OMP-lock_wait>

<OMP-lock_wait>

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

As in the critical section example, the data collected
does not distinguish between the call stack of a thread holding the lock and
executing, or executing before it acquires the lock or after it releases it.

Near the end of a parallel region

At this point,
three of the threads have finished all their chunks of work, but one of them
is still working. The OpenMP construct in this case implicitly specified a
barrier; if the user code had explicitly specified the barrier, the <OMP-implicit_barrier> function would be replaced by <OMP-explicit_barrier>.

Master

Slave 1

Slave 2

Slave 3

<OMP-implicit_barrier>

<OMP-implicit_barrier>

<OMP-implicit_barrier>

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

Near the end of a parallel region, with one or more reduction
variables

At this point, two of the threads have finished all
their chunks of work, and are performing the reduction computations, but one
of them is still working, and the fourth has finished its part of the reduction,
and is waiting at the barrier.

Master

Slave 1

Slave 2

Slave 3

<OMP-reduction>

<OMP-implicit_barrier>

<OMP-implicit_barrier>

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

While one thread is shown in the <OMP-reduction> function, the actual time spent in doing the reduction is usually
quite small, and is rarely captured in a call stack sample.

At the end of a parallel region

At this point,
all threads have finished all chunks of work within the parallel region, and
have reached the barrier.

Master

Slave 1

Slave 2

Slave 3

<OMP-implicit_barrier>

<OMP-implicit_barrier>

<OMP-implicit_barrier>

<OMP-implicit_barrier>

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

main

main

main

main

_start

_start

_start

_start

Since all the threads have reached the barrier, they
may all proceed, and it is unlikely that an experiment would ever find all
the threads in this state.

After leaving the parallel region

At this point,
all the slave threads are waiting for entry into the next parallel region,
either spinning or sleeping, depending on the various environment variables
set by the user. The program is in serial execution.

Master

Slave 1

Slave 2

Slave 3

foo

main

_start

<OMP-idle>

<OMP-idle>

<OMP-idle>

While executing in a nested parallel region

All
four of the threads are working, each within the outer parallel region. One
of the slave threads has called another function, bar,
and it has created a nested parallel region, and an additional slave thread
is created to work with it.

Master

Slave 1

Slave 2

Slave 3

Slave 4

bar-OMP...

bar-OMP...

bar

bar

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo-OMP...

foo

foo

foo

foo

foo

main

main

main

main

main

_start

_start

_start

_start

_start

OpenMP Metrics

When processing a clock-profile event for an OpenMP program, two metrics
corresponding to the time spent in each of two states in the OpenMP system
are shown. They are "OMP work" and "OMP wait".

Time is accumulated in "OMP work" whenever a thread is executing from
the user code, whether in serial or parallel. Time is accumulated in "OMP
wait" whenever a thread is waiting for something before it can proceed, whether
the wait is a busy-wait (spin-wait), or sleeping. The sum of these two metrics
matches the "Total LWP Time" metric in the clock profiles.

Machine Presentation of OpenMP Profiling Data

The real call stacks of the program during various phases of execution
are quite different from the ones portrayed above in the intuitive model.
The Machine mode of presentation shows the call stacks as measured, with no
transformations done, and no artificial functions constructed. The clock-profiling
metrics are, however, still shown.

In each of the call stacks below, libmtsk represents
one or more frames in the call stack within the OpenMP runtime library. The
details of which functions appear and in which order change from release to
release, as does the internal implementation of code for a barrier, or to
perform a reduction.

Before the first parallel region

Before the first
parallel region is entered, there is only the one thread, the master thread.
The call stack is identical to that in User mode.

Master

foo

main

_start

During execution in a parallel region

Master

Slave 1

Slave 2

Slave 3

foo-OMP...

libmtsk

foo

foo-OMP...

foo-OMP...

foo-OMP...

main

libmtsk

libmtsk

libmtsk

_start

_lwp_start

_lwp_start

_lwp_start

In Machine mode, the slave threads are shown as starting
in _lwp_start , rather than in _start where
the master starts. (In some versions of the thread library, that function
may appear as _thread_start .)

At the point at which all threads are at a barrier

Master

Slave 1

Slave 2

Slave 3

libmtsk

foo-OMP...

foo

libmtsk

libmtsk

libmtsk

main

foo-OMP...

foo-OMP...

foo-OMP...

_start

_lwp_start

_lwp_start

_lwp_start

Unlike when the threads are executing in the parallel
region, when the threads are waiting at a barrier there are no frames from
the OpenMP runtime between foo and the parallel region
code, foo-OMP.... The reason is that the real execution
does not include the OMP parallel region function, but the OpenMP runtime
manipulates registers so that the stack unwind shows a call from the last-executed
parallel region function to the runtime barrier code. Without it, there would
be no way to determine which parallel region is related to the barrier call
in Machine mode.