Performance Considerations

A shared object can be used by multiple applications within the same
system. The performance of a shared object affects the applications that use
the shared object, and the system as a whole.

Although the code within a shared object directly affects the performance of
a running process, the performance issues discussed here relate to the runtime
processing of the shared object. The following sections investigate this processing in more
detail by looking at aspects such as text size and purity, together
with relocation overhead.

Analyzing Files With elfdump

Various tools are available to analyze the contents of an ELF file,
including the standard Unix utilities dump(1), nm(1), and size(1). Under Oracle Solaris,
these tools have been largely superseded by elfdump(1).

The ELF format organizes data into sections. Sections are in turn allocated to
units known as segments. Some segments describe how portions of a file
are mapped into memory. See mmap(2). These loadable segments can be displayed by
using the elfdump(1) command and examining the LOAD entries.

There are two loadable segments in the shared object libfoo.so.1, commonly referred to
as the text and data segments. The text segment is mapped to allow
reading and execution of its contents, PF_X PF_R. The data segment is mapped
to also allow its contents to be modified, PF_W. The memory size,
p_memsz, of the data segment differs from the file size, p_filesz. This
difference accounts for the .bss section, which is part of the data
segment, and is dynamically created when the segment is loaded.

Programmers usually think of a file in terms of the symbols that
define the functions and data elements within their code. These symbols can
be displayed using the -s option to elfdump. For example.

The output from elfdump(1) in the above example shows the association of the
functions _init, foo, and _fini to the sections .init, .text and .fini.
These sections, because of their read-only nature, are part of the text
segment.

Similarly, the data arrays data, and bss are associated with the sections
.data and .bss respectively. These sections, because of their writable nature, are part
of the data segment.

Underlying System

When an application is built using a shared object, the entire loadable
contents of the object are mapped into the virtual address space of
that process at runtime. Each process that uses a shared object starts by
referencing a single copy of the shared object in memory.

Relocations within the shared object are processed to bind symbolic references to
their appropriate definitions. This results in the calculation of true virtual addresses
that could not be derived at the time the shared object was generated
by the link-editor. These relocations usually result in updates to entries within
the process's data segments.

The memory management scheme underlying the dynamic linking of shared objects shares
memory among processes at the granularity of a page. Memory pages can
be shared as long as they are not modified at runtime. If a
process writes to a page of a shared object when writing a
data item, or relocating a reference to a shared object, it generates
a private copy of that page. This private copy will have no effect
on other users of the shared object. However, this page has lost
any benefit of sharing between other processes. Text pages that become modified
in this manner are referred to as impure.

The segments of a shared object that are mapped into memory fall into
two basic categories; the text segment, which is read-only, and the data
segment, which is read-write. See Analyzing Files With elfdump on how to obtain this information from
an ELF file. An overriding goal when developing a shared object is to
maximize the text segment and minimize the data segment. This optimizes the
amount of code sharing while reducing the amount of processing needed to
initialize and use a shared object. The following sections present mechanisms that can
help achieve this goal.

Lazy Loading of Dynamic Dependencies

You can defer the loading of a shared object dependency until the
dependencies first reference, by establishing the object as lazy loadable. See Lazy Loading of Dynamic Dependencies.

For small applications, a typical thread of execution can reference all the
applications dependencies. The application loads all of its dependencies whether the dependencies
are defined lazy loadable or not. However, under lazy loading, dependency processing can
be deferred from process startup and spread throughout the process's execution.

For applications with many dependencies, lazy loading often results in some dependencies
not being loaded at all. Dependencies that are not referenced for a
particular thread of execution, are not loaded.

Position-Independent Code

The code within a dynamic executable is typically position-dependent, and is tied to
a fixed address in memory. Shared objects, on the other hand, can
be loaded at different addresses in different processes. Position-independent code is not
tied to a specific address. This independence allows the code to execute efficiently
at a different address in each process that uses the code. Position-independent
code is recommended for the creation of shared objects.

The compiler can generate position-independent code under the -K pic option.

If a shared object is built from position-dependent code, the text segment
can require modification at runtime. This modification allows relocatable references to be
assigned to the location that the object has been loaded. The relocation of
the text segment requires the segment to be remapped as writable. This
modification requires a swap space reservation, and results in a private copy of
the text segment for the process. The text segment is no longer
sharable between multiple processes. Position-dependent code typically requires more runtime relocations than
the corresponding position-independent code. Overall, the overhead of processing text relocations can cause
serious performance degradation.

When a shared object is built from position-independent code, relocatable references are
generated as indirections through data in the shared object's data segment. The code
within the text segment requires no modification. All relocation updates are applied
to corresponding entries within the data segment. See Global Offset Table (Processor-Specific) and Procedure Linkage Table (Processor-Specific) for
more details on the specific indirection techniques.

The runtime linker attempts to handle text relocations should these relocations exist.
However, some relocations can not be satisfied at runtime.

The x64 position-dependent code sequence typically generates code which can only be
loaded into the lower 32–bits of memory. The upper 32–bits of any
address must all be zeros. Since shared objects are typically loaded at the
top of memory, the upper 32–bits of an address are required. Position-dependent
code within an x64 shared object is therefore insufficient to cope with relocation
requirements. Use of such code within a shared object can result in
runtime relocation errors.

Position-independent code can be loaded in any region in memory, and hence
satisfies the requirements of shared objects for x64.

This situation differs from the default ABS64 mode that is used for
64–bit SPARCV9 code. This position-dependent code is typically compatible with the full
64–bit address range. Thus, position-dependent code sequences can exist within SPARCV9 shared objects.
Use of either the ABS32 mode, or ABS44 mode for 64–bit SPARCV9
code, can still result in relocations that can not be resolved at runtime.
However, each of these modes require the runtime linker to relocate the
text segment.

Regardless of the runtime linkers facilities, or differences in relocation requirements, shared
objects should be built using position-independent code.

You can identify a shared object that requires relocations against its text segment.
The following example uses elfdump(1) to determine whether a TEXTREL entry dynamic
entry exists.

Note - The value of the TEXTREL entry is irrelevant. The presence of this
entry in a shared object indicates that text relocations exist.

To prevent the creation of a shared object that contains text relocations
use the link-editor's -z text flag. This flag causes the link-editor to generate diagnostics
indicating the source of any position-dependent code used as input. The following
example shows how position-dependent code results in a failure to generate a shared
object.

Two relocations are generated against the text segment because of the position-dependent
code generated from the file foo.o. Where possible, these diagnostics indicate any symbolic
references that are required to carry out the relocations. In this case,
the relocations are against the symbols foo and bar.

Text relocations within a shared object can also occur when hand written
assembler code is included and does not include the appropriate position-independent prototypes.

Note - You might want to experiment with some simple source files to determine
coding sequences that enable position-independence. Use the compilers ability to generate intermediate
assembler output.

SPARC: -K pic and -K PIC Options

The global offset table is an array of pointers, the size of
whose entries are constant for 32–bit (4–bytes) and 64–bit (8–bytes). The following
code sequence makes reference to an entry under -K pic.

ld [%l7 + j], %o0 ! load &j into %o0

Where %l7 is the precomputed value of the symbol _GLOBAL_OFFSET_TABLE_ of the
object making the reference.

This code sequence provides a 13–bit displacement constant for the global offset
table entry. This displacement therefore provides for 2048 unique entries for 32–bit
objects, and 1024 unique entries for 64–bit objects. If the creation of an
object requires more than the available number of entries, the link-editor produces
a fatal error.

You can investigate the global offset table requirements of an object using
elfdump(1) with the -G option. You can also examine the processing of
these entries during a link-edit using the link-editors debugging tokens -D got,detail.

Ideally, frequently accessed data items benefit from using the -K pic model. You
can reference a single entry using both models. However, determining which relocatable
objects should be compiled with either option can be time consuming, and the
performance improvement realized small. A recompilation of all relocatable objects with the
-K PIC option is typically easier.

Remove Unused Material

The inclusion of functions and data that are not used by the
object being built, is wasteful. This material bloats the object, which can
result in unnecessary relocation overhead and associated paging activity. References to unused dependencies
are also wasteful. These references result in the unnecessary loading and processing
of other shared objects.

Unused sections are displayed during a link-edit when using the link-editors debugging token
-D unused. Sections identified as unused should be removed from the link-edit. Unused
sections can be eliminated using the link-editors -z ignore option.

The link-editor identifies a section from a relocatable object as unused under
the following conditions.

The section is allocatable

No other sections bind to (relocate) to this section

The section provides no global symbols

You can improve the link-editor's ability to eliminate sections by defining the
shared object's external interfaces. By defining an interface, global symbols that are
not defined as part of the interface are reduced to locals. Reduced symbols
that are unreferenced from other objects, are now clearly identified as candidates
for elimination.

Individual functions and data variables can be eliminated by the link-editor if
these items are assigned to their own sections. This section refinement is
achieved using compiler options such as -xF. Earlier compilers only provided for the
assignment of functions to their own sections. Newer compilers have extended the
-xF syntax to assign data variables to their own sections. Earlier compilers required
C++ exception handling to be disabled when using -xF. This restriction has
been dropped with later compilers.

If all allocatable sections from a relocatable object can be eliminated, the
entire file is discarded from the link-edit.

In addition to input file elimination, the link-editor also identifies unused dependencies.
A dependency is deemed unused if the dependency is not bound to
by the object being produced. An object can be built with the -z ignore
option to eliminate the recording of unused dependencies.

The -z ignore option applies only to the files that follow the option on
the link-edit command line. The -z ignore option is cancelled with -z record.

Maximizing Shareability

As mentioned in Underlying System, only a shared object's text segment is shared
by all processes that use the object. The object's data segment typically is
not shared. Each process using a shared object, generates a private memory
copy of its entire data segment as data items within the segment
are written to. Reduce the data segment, either by moving data elements that
are never written to the text segment, or by removing the data
items completely.

The following sections describe several mechanisms that can be used to reduce
the size of the data segment.

Move Read-Only Data to Text

Data elements that are read-only should be moved into the text segment using
const declarations. For example, the following character string resides in the .data
section, which is part of the writable data segment.

char *rdstr = "this is a read-only string";

In contrast, the following character string resides in the .rodata section, which
is the read-only data section contained within the text segment.

const char *rdstr = "this is a read-only string";

Reducing the data segment by moving read-only elements into the text segment
is admirable. However, moving data elements that require relocations can be counterproductive.
For example, examine the following array of strings.

This definition ensures that the strings and the array of pointers to
these strings are placed in a .rodata section. Unfortunately, although the user
perceives the array of addresses as read-only, these addresses must be relocated at
runtime. This definition therefore results in the creation of text relocations. Representing
the array as:

const char *rdstrs[] = { ..... };

ensures the array pointers are maintained in the writable data segment where
they can be relocated. The array strings are maintained in the read-only
text segment.

Note - Some compilers, when generating position-independent code, can detect read-only assignments that result in
runtime relocations. These compilers arrange for placing such items in writable segments.
For example, .picdata.

Collapse Multiply-Defined Data

Data can be reduced by collapsing multiply-defined data. A program with multiple
occurrences of the same error messages can be better off by defining
one global datum, and have all other instances reference this. For example.

The main candidates for this sort of data reduction are strings. String
usage in a shared object can be investigated using strings(1). The following example generates
a sorted list of the data strings within the file libfoo.so.1. Each
entry in the list is prefixed with the number of occurrences of
the string.

$ strings -10 libfoo.so.1 | sort | uniq -c | sort -rn

Use Automatic Variables

Permanent storage for data items can be removed entirely if the associated
functionality can be designed to use automatic (stack) variables. Any removal of
permanent storage usually results in a corresponding reduction in the number of runtime
relocations required.

Allocate Buffers Dynamically

Large data buffers should usually be allocated dynamically rather than being defined
using permanent storage. Often this results in an overall saving in memory,
as only those buffers needed by the present invocation of an application are
allocated. Dynamic allocation also provides greater flexibility by enabling the buffer's size
to change without affecting compatibility.

Minimizing Paging Activity

Any process that accesses a new page causes a page fault, which
is an expensive operation. Because shared objects can be used by many
processes, any reduction in the number of page faults that are generated by
accessing a shared object can benefit the process and the system as
a whole.

Organizing frequently used routines and their data to an adjacent set of
pages frequently improves performance because it improves the locality of reference. When
a process calls one of these functions, the function might already be in
memory because of its proximity to the other frequently used functions. Similarly,
grouping interrelated functions improves locality of references. For example, if every call to
the function foo() results in a call to the function bar(), place
these functions on the same page. Tools like cflow(1), tcov(1), prof(1) and
gprof(1) are useful in determining code coverage and profiling.

Isolate related functionality to its own shared object. The standard C library
has historically been built containing many unrelated functions. Only rarely, for example,
will any single executable use everything in this library. Because of widespread use,
determining what set of functions are really the most frequently used is
also somewhat difficult. In contrast, when designing a shared object from scratch, maintain
only related functions within the shared object. This improves locality of reference
and has the side effect of reducing the object's overall size.

Relocations

In Relocation Processing, the mechanisms by which the runtime linker relocates dynamic executables and
shared objects to create a runable process was covered. Relocation Symbol Lookup and When Relocations Are Performed
categorized this relocation processing into two areas to simplify and help illustrate
the mechanisms involved. These same two categorizations are also ideally suited for considering
the performance impact of relocations.

Symbol Lookup

When the runtime linker needs to look up a symbol, by default
it does so by searching in each object. The runtime linker starts
with the dynamic executable, and progresses through each shared object in the same
order that the objects are loaded. In many instances, the shared object
that requires a symbolic relocation turns out to be the provider of
the symbol definition.

In this situation, if the symbol used for this relocation is not
required as part of the shared object's interface, then this symbol is
a strong candidate for conversion to a static or automatic variable. A
symbol reduction can also be applied to removed symbols from a shared
objects interface. See Reducing Symbol Scope for more details. By making these conversions, the link-editor
incurs the expense of processing any symbolic relocation against these symbols during
the shared object's creation.

The only global data items that should be visible from a shared
object are those that contribute to its user interface. Historically this has
been a hard goal to accomplish, because global data are often defined to
allow reference from two or more functions located in different source files.
By applying symbol reduction, unnecessary global symbols can be removed. See Reducing Symbol Scope.
Any reduction in the number of global symbols exported from a shared
object results in lower relocation costs and an overall performance improvement.

The use of direct bindings can also significantly reduce the symbol lookup
overhead within a dynamic process that has many symbolic relocations and many dependencies.
See Appendix D, Direct Bindings.

When Relocations are Performed

All immediate reference relocations must be carried out during process initialization before
the application gains control. However, any lazy reference relocations can be deferred
until the first instance of a function being called. Immediate relocations typically result
from data references. Therefore, reducing the number of data references also reduces
the runtime initialization of a process.

Initialization relocation costs can also be deferred by converting data references into
function references. For example, you can return data items by a functional
interface. This conversion usually results in a perceived performance improvement because the initialization
relocation costs are effectively spread throughout the process's execution. Some of the
functional interfaces might never be called by a particular invocation of a process,
thus removing their relocation overhead altogether.

The advantage of using a functional interface can be seen in the
section, Copy Relocations. This section examines a special, and somewhat expensive, relocation mechanism employed
between dynamic executables and shared objects. It also provides an example of
how this relocation overhead can be avoided.

Combined Relocation Sections

The relocation sections within relocatable objects are typically maintained in a one-to-one
relationship with the sections to which the relocations must be applied. However,
when the linker editor creates an executable or shared object, all but the procedure
linkage table relocations are placed into a single common section named .SUNW_reloc.

Combining relocation records in this manner enables all RELATIVE relocations to be grouped
together. All symbolic relocations are sorted by symbol name. The grouping of
RELATIVE relocations permits optimized runtime processing using the DT_RELACOUNT/DT_RELCOUNT.dynamic entries. Sorted
symbolic entries help reduce runtime symbol lookup.

Copy Relocations

Shared objects are usually built with position-independent code. References to external data items
from code of this type employs indirect addressing through a set of
tables. See Position-Independent Code for more details. These tables are updated at runtime with
the real address of the data items. These updated tables enable access
to the data without the code itself being modified.

Dynamic executables, however, are generally not created from position-independent code. Any references
to external data they make can seemingly only be achieved at runtime
by modifying the code that makes the reference. Modifying a read-only text segment
is to be avoided. The copy relocation technique can solve this reference.

Suppose the link-editor is used to create a dynamic executable, and a reference
to a data item is found to reside in one of the
dependent shared objects. Space is allocated in the dynamic executable's .bss, equivalent
in size to the data item found in the shared object. This space
is also assigned the same symbolic name as defined in the shared
object. Along with this data allocation, the link-editor generates a special copy
relocation record that instructs the runtime linker to copy the data from the
shared object to the allocated space within the dynamic executable.

Because the symbol assigned to this space is global, it is used
to satisfy any references from any shared objects. The dynamic executable inherits
the data item. Any other objects within the process that make reference to
this item are bound to this copy. The original data from which
the copy is made effectively becomes unused.

The following example of this mechanism uses an array of system error
messages that is maintained within the standard C library. In previous SunOS
operating system releases, the interface to this information was provided by two global
variables, sys_errlist[], and sys_nerr. The first variable provided the array of error
message strings, while the second conveyed the size of the array itself. These
variables were commonly used within an application in the following manner.

The link-editor has allocated space in the dynamic executable's .bss to receive
the data represented by sys_errlist and sys_nerr. These data are copied from the
C library by the runtime linker at process initialization. Thus, each application
that uses these data gets a private copy of the data in
its own data segment.

There are two drawbacks to this technique. First, each application pays a
performance penalty for the overhead of copying the data at runtime. Second,
the size of the data array sys_errlist has now become part of
the C library's interface. Suppose the size of this array were to
change, perhaps as new error messages are added. Any dynamic executables that
reference this array have to undergo a new link-edit to be able to
access any of the new error messages. Without this new link-edit, the
allocated space within the dynamic executable is insufficient to hold the new data.

These drawbacks can be eliminated if the data required by a dynamic
executable are provided by a functional interface. The ANSI C function strerror(3C)
returns a pointer to the appropriate error string, based on the error number
supplied to it. One implementation of this function might be:

The error routine in foo.c can now be simplified to use this
functional interface. This simplification in turn removes any need to perform the
original copy relocations at process initialization.

Additionally, because the data are now local to the shared object, the
data are no longer part of its interface. The shared object therefore
has the flexibility of changing the data without adversely effecting any dynamic executables
that use it. Eliminating data items from a shared object's interface generally
improves performance while making the shared object's interface and code easier to maintain.

ldd(1), when used with either the -d or -r options, can verify any
copy relocations that exist within a dynamic executable.

For example, suppose the dynamic executable prog had originally been built against
the shared object libfoo.so.1 and the following two copy relocations had been recorded.

ldd(1) shows that the dynamic executable will copy as much data as
the shared object has to offer, but only accepts as much as
its allocated space allows.

Copy relocations can be eliminated by building the application from position-independent code.
See Position-Independent Code.

Using the -B symbolic Option

The link-editor's -B symbolic option enables you to bind symbol references to their global
definitions within a shared object. This option is historic, in that it
was designed for use in creating the runtime linker itself.

Defining an object's interface and reducing non-public symbols to local is preferable
to using the -B symbolic option. See Reducing Symbol Scope. Using -B symbolic can often result
in some non-intuitive side effects.

If a symbolically bound symbol is interposed upon, then references to the
symbol from outside of the symbolically bound object bind to the interposer.
The object itself is already bound internally. Essentially, two symbols with the same
name are now being referenced from within the process. A symbolically bound
data symbol that results in a copy relocation creates the same interposition
situation. See Copy Relocations.

Note - Symbolically bound shared objects are identified by the .dynamic flag DF_SYMBOLIC. This
flag is informational only. The runtime linker processes symbol lookups from these objects
in the same manner as any other object. Any symbolic binding is
assumed to have been created at the link-edit phase.

Profiling Shared Objects

The runtime linker can generate profiling information for any shared objects that are
processed during the running of an application. The runtime linker is responsible
for binding shared objects to an application and is therefore able to
intercept any global function bindings. These bindings take place through .plt entries. See
When Relocations Are Performed for details of this mechanism.

The LD_PROFILE environment variable specifies the name of a shared object to
profile. You can analyze a single shared object using this environment variable.
The setting of the environment variable can be used to analyze the use
of the shared object by one or more applications. In the following
example, the use of libc by the single invocation of the command ls(1)
is analyzed.

$ LD_PROFILE=libc.so.1 ls -l

In the following example, the environment variable setting is recorded in a configuration
file. This setting causes any application's use of libc to accumulate the
analyzed information.

# crle -e LD_PROFILE=libc.so.1
$ ls -l
$ make
$ ...

When profiling is enabled, a profile data file is created, if it
does not already exist. The file is mapped by the runtime linker.
In the previous examples, this data file is /var/tmp/libc.so.1.profile. 64–bit libraries require an
extended profile format and are written using the .profilex suffix. You can
also specify an alternative directory to store the profile data using the LD_PROFILE_OUTPUT
environment variable.

This profile data file is used to deposit profil(2) data and call
count information related to the use of the specified shared object. This
profiled data can be directly examined with gprof(1).

Note - gprof(1) is most commonly used to analyze the gmon.out profile data created
by an executable that has been compiled with the -xpg option of cc(1).
The runtime linker's profile analysis does not require any code to be
compiled with this option. Applications whose dependent shared objects are being profiled should
not make calls to profil(2), because this system call does not provide
for multiple invocations within the same process. For the same reason, these
applications must not be compiled with the -xpg option of cc(1). This compiler-generated
mechanism of profiling is also built on top of profil(2).

One of the most powerful features of this profiling mechanism is to
enable the analysis of a shared object as used by multiple applications.
Frequently, profiling analysis is carried out using one or two applications. However, a
shared object, by its very nature, can be used by a multitude
of applications. Analyzing how these applications use the shared object can offer
insights into where energy might be spent to improvement the overall performance of
the shared object.

The following example shows a performance analysis of libc over a creation
of several applications within a source hierarchy.

The special name <external> indicates a reference from outside of the address
range of the shared object being profiled. Thus, in the previous example,
1634 calls to the function open(2) within libc occurred from the dynamic executables, or
from other shared objects, bound with libc while the profiling analysis was
in progress.

Note - The profiling of shared objects is multithread safe, except in the case
where one thread calls fork(2) while another thread is updating the profile
data information. The use of fork(2) removes this restriction.