[Sorry for the 3-way crosspost!]
One of the big holes in the MIPS ABI has always been the lack of support
for non-PIC executables. Any call that might be to a DSO must be made
indirectly via $25, and any data that might be defined in a DSO must
be accessed via the GOT. MIPS has no PLTs or copy relocations.
There has been talk of changing this at various times over the years.
In true bus style, nothing much happened for a long time, then two
implementations came along at once. I implemented non-PIC support for
Specifix, as part of a more general project to allow MIPS16 code to be
used on GNU/Linux. At the same time, CodeSourcery implemented it for
Sourcery G++. I only found out about CS's version recently, after
finishing the Specifix one, and I think the same is true in reverse.
Oh well!
I suppose the good news is that we can pick the best bits of each
implementation as the official one. I'll describe my implementation
below, then compare it to what I understand CS's version to be.
CS folks: please correct me if I'm wrong. Dan said that he'd be
submitting CS's version too.
First of all, I should emphasise that this is intended to be a pure
ABI extension. Existing objects should continue to work, and existing
ET_REL objects should be link-compatible with the new objects.
Example
-------
To take a concrete example, suppose an executable has code like this:
extern void foo (void);
extern int x;
void bar (void) { foo (x); }
The compiler has no information about where "foo" and "x" are defined,
so for safety's sake, it must assume that they might be defined in a
shared library. We are currently forced to generate code like this:
bar:
.set noreorder
lui $28,%hi(__gnu_local_gp)
addiu $28,$28,%lo(__gnu_local_gp)
lw $25,%call16(foo)($gp)
jr $25
lw $4,%got(x)($gp)
.end noreorder
(This is the "-mno-shared" form. The "-mshared" version would replace
the first two instructions with ".cpload $25" and expect $25 to be valid
on entry.)
This is very inefficient if "x" and "foo" turn out to be defined in the
executable itself. In contrast, the non-PIC implementation of "bar"
is simply:
bar:
.set noreorder
lui $4,%hi(x)
j bar
lw $4,%lo(x)($4)
.end noreorder
So the aim is to allow this non-PIC version of "bar" to be used in
dynamic executables. It is the static linker's responsibility to
ensure that "bar" works when:
- "x" is defined by a shared library
- "foo" is a PIC function in the same executable
- "foo" is defined by a shared library
It needs help from the dynamic linker to handle the first and third
cases.
Copy relocations
----------------
As on most other SVR4 targets, we want to use dynamic relocations
for full-word references like:
.data
.word x
However, we want to use copy relocations if the reference is in a
read-only section, such as:
.set mips16
lw $2,1f
jr $31
.align 2
1: .word x
or if the reference is not a full-word one:
lui $2,%hi(x)
addiu $2,$2,%lo(x)
Fortunately, VxWorks has already allocated an R_MIPS_COPY relocation type.
We can simply extend it to GNU objects as well.
PLTs
----
MIPS has traditionally not used PLTs. Instead, it has a special
form of lazy binding stub that is local to the object; unlike a
PLT entry, this stub does not participate in name lookup.
These stubs can only be used when all references are through
R_MIPS*_CALL* relocations. The GOT slot starts off pointing
at the stub, then the stub redirects it to the real function.
References like:
.word f
prevent lazy binding.
In contrast, PLTs would allow function references of the forms:
j f
jal f
lui $2,%hi(f)
addiu $2,$2,%lo(f)
They would also allow references of the form:
.word f
to be lazily bound.
Adding PLTs gives us three possible ways of referring to an
externally-defined function:
- a full-word dynamic relocation (R_MIPS_REL32, possibly combined
with R_MIPS_64)
These relocations can only be used if all references are
32-bit ones. They prevent lazy binding, so we only use
them for traditional PIC objects.
- a traditional MIPS lazy-binding stub
These stubs can only be used if all references to the function
are through R_MIPS*_CALL* relocations. However, they are the
most efficient way of handling that situation. Once the
function has been resolved, all calls go directly to the
real target.
- a PLT stub
PLTs are the fallback, and provide a second, more general,
form of lazy binding.
As before, we can appropriate the VxWorks R_MIPS_JUMP_SLOT relocation
and use it for GNU objects too.
Many (but not all) targets put .got.plt in the main .got section.
I don't think it makes sense to do this for MIPS. We never use
$gp-relative accesses for .got.plt, so putting it in .got would
steal valuable room in the primary GOT.
DT_JMPREL, DT_PLTREL and DT_PLTSZ describe .rel.plt in the usual way.
Objects without PLTs do not have these tags.
At the moment, there are two sorts of GOT header:
- When the top bit of _GLOBAL_OFFSET_TABLE_[1] is clear:
- _GLOBAL_OFFET_TABLE_[0] points to the resolver for the
traditional lazy-binding stubs.
- _GLOBAL_OFFSET_TABLE_[1] is the first local or global
GOT entry.
This is the traditional SVR4 GOT. glibc still supports it.
- When the top bit of _GLOBAL_OFFSET_TABLE_[1] is set
- _GLOBAL_OFFET_TABLE_[0] points to the resolver for the
traditional lazy-binding stubs.
- _GLOBAL_OFFSET_TABLE_[1] is a module pointer.
- _GLOBAL_OFFSET_TABLE_[2] is the first local or global
GOT entry.
This is a GNU extension that the linker has used for a long time.
glibc keeps the top bit of _GLOBAL_OFFSET_TABLE_[1] set,
but uClibc does not.
We need a further GOT entry for resolving PLTs, so the obvious thing
is to reserve _GLOBAL_OFFSET_TABLE_[2]. There are then three GOT layouts:
- When the top bit of _GLOBAL_OFFSET_TABLE_[1] is clear:
Layout as before.
- When the top bit of _GLOBAL_OFFSET_TABLE_[1] is set and
there is no DT_JMPREL tag:
Layout as before.
- When the top bit of _GLOBAL_OFFSET_TABLE_[1] is set and
there is a DT_JMPREL tag.
- _GLOBAL_OFFET_TABLE_[0] points to the resolver for the
traditional lazy-binding stubs.
- _GLOBAL_OFFSET_TABLE_[1] is a module pointer.
- _GLOBAL_OFFSET_TABLE_[2] points to the PLT resolver.
- _GLOBAL_OFFSET_TABLE_[3] is the first local or global
GOT entry.
The PLT resolver needs to obtain two bits of information:
- the module pointer
- the target function's index in .got.plt/.rel.plt
ARM is another target that places .got.plt separately from .got.
Loosely following its example, I used this resolver interface:
$14 : the start of .got.plt
$15 : &_GLOBAL_OFFSET_TABLE_[2]
$24 : the .got.plt entry for the target function
Thus "$15 - sizeof (void *)" points to the module pointer and
"($24 - $14) / sizeof (void *)" is the PLT index.
Note that I chose to pass $24 instead of the relocation index itself
because we can then use a 4-instruction PLT entry without imposing
any limit on the _number_ of PLT entries.
Although the dynamic linker could work out the executable's module
pointer without $15, I thought it was better to have an interface
that would work for shared libraries too, in case we ever do want to
use PLTs for shared libraries in future.
The PLT entry for a function "f" is:
lui $24,%hi(.got.plt slot for f)
lw $25,%lo(.got.plt slot for f)($24)
jr $25
addiu $24,$24,%lo(.got.plt slot for f)
The PLT header itself is:
lui $15,%hi(&_GLOBAL_OFFSET_TABLE_[2])
lw $25,%lo(&_GLOBAL_OFFSET_TABLE_[2])($15)
addiu $15,$15,%lo(&_GLOBAL_OFFSET_TABLE_[2])
lui $14,%hi(.got.plt)
jr $25
addiu $14,$14,%lo(.got.plt)
The header is followed by 8 bytes of padding so that each PLT entry
is 16-byte aligned.
This PLT entry is deliberately not compatible with MIPS I. I thought
that fitting the PLT entry into a 16-byte cache line was more important
than supporting such an obselete ISA level. Hopefully anyone who still
uses MIPS I won't mind sticking to the traditional scheme. (Hi Maciej!)
As on other SVR4 targets, PLT entries have type STT_FUNC and belong
to SHN_UNDEF. Unfortunately, as Nigel Stephens pointed out when
discussing this a while ago with Dan Jacobowitz and I, this clashes
with the traditional MIPS lazy-binding stubs, which also use the
STT_FUNC/SHN_UNDEF combination. I followed Nigel's suggestion of
adding an STO_MIPS_PLT symbol type to distinguish PLTs from
traditional stubs.
Linking PIC and non-PIC in the same object
------------------------------------------
Most targets allow PIC and non-PIC to be linked together, and it would
be awkward if MIPS didn't. This means that the static linker has to
cope with things like:
a.s (non-PIC):
jal foo
b.s (PIC):
foo:
.cpload $25
...
We can handle this situation in two ways. If the target function
"foo" starts a section, and the section is not too heavily-aligned,
we can insert:
lui $25,%hi(1f)
addiu $25,$25,%lo(1f)
1:
immediately before it. This code goes in a new section and is padded
with leading nops if necessary. "foo" then resolves to the "lui"
instruction, so that all references to "foo" have the same address.
I think this is an important optimisation. In practice, most uses
of PIC in executables will come from static libraries, which usually
have one function per section.
However, if "foo" doesn't start a section, the linker must create
a separate trampoline of the form:
foo:
lui $25,%hi(.pic.foo)
j .pic.foo
addiu $25,$25,%lo(.pic.foo)
where .pic.foo is the original PIC form of "foo". Again, "foo"
resolves to this trampoline, so that all references to "foo" have the
same address.
These trampolines all go in a separate section at the beginning of .text.
They are padded with a nop so that each one is aligned to 16 bytes.
ld -r
-----
It should be possible to use "ld -r" to link PIC and non-PIC together
into a relocatable object. The result is clearly a non-PIC object,
so what do we do with PIC functions? One option would be to add
"la $25" prefixes or trampolines to all of them, but that would be
inefficient.
I thought it would be better to mark PIC functions with a new st_other value,
STO_MIPS_PIC. This allows the final link to distinguish between PIC and
non-PIC functions in the same input file.
n32 and n64 GP-load sequences
-----------------------------
n32 and n64 use the idiom:
lui $28,%hi(%neg(%gp_rel(foo)))
addiu $28,$28,%lo(%neg(%gp_rel(foo)))
addu $28,$28,$25
to load the value of _gp. Such a reference to foo should not be
redirected to an "la $25" stub.
Choice of new STO_* values
--------------------------
For reasons I don't understand, STO_MIPS16 is defined as 0xf0,
taking up 4 of the 8 bits in st_other. Visiblity accounts for
2 more. That gives us enough bits to treat STO_MIPS_PLT and
STO_MIPS_PIC as orthogonal to both MIPS16ness and visiblity,
but we wouldn't have any room left over.
Fortunately, STO_MIPS_PLT, STO_MIPS_PIC are STO_MIPS16 are
mutually-exclusive, so we can simply reinterpret the top 4
bits of st_other as an enum. 0x0c would then be free for
future extensions.
Identifying non-PIC relocatable objects
---------------------------------------
MIPS has two PICness flags: EF_MIPS_PIC and EF_MIPS_CPIC ("calls PIC").
We can therefore mark non-PIC abicalls objects as:
(flags & (EF_MIPS_PIC | EF_MIPS_CPIC)) == EF_MIPS_CPIC
Assembler directives and command-line interface
-----------------------------------------------
The EF_MIPS_CPIC combination is generated by the assembler directives:
.abicalls
.option pic0
There is no command-line flag to select this mode, so I added one
called -call_nonpic. I'm not too tied to that name though.
GCC command-line interface
--------------------------
GCC 4.2 has an "-mno-shared" option. As its name implies, this option
can only be used to compile executables. It applies on top of
"-mabicalls" and allows GCC to use absolute references for things that
it can prove are defined by the executable itself. Functions compiled
with "-mno-shared" do not require $25 to be valid on entry, so the
compiler can also use direct jumps and calls to functions in the same
object file.
Of course, as in the example above, there are many cases in which
the compiler cannot prove that something is defined by the executable.
The non-PIC support is designed to plug that gap. So, from a conceptual
point of view, the new functionality is really a special "-mno-shared" mode;
it allows the compiler to use absolute references for all data except TLS.
I decided to add a new pair of GCC options, "-mgnu-plts" and
"-mno-gnu-plts", that apply on top of "-mno-shared". There are
then four basic forms of o32, n32 and n64 code:
(1) -mno-abicalls
For non-dynamic objects like the linux kernel.
(2) -mabicalls -mno-shared -mgnu-plts
For dynamic executables. Code only uses the global offset
table for some models of thread-local storage; it uses absolute
accesses for everything else. If it has to call functions
indirectly, such as for:
void foo (void (*t) (void)) { t (); }
it continues to call through $25.
This combination requires support from both the static and
dynamic linkers.
(3) -mabicalls -mno-shared -mno-gnu-plts
For dynamic executables. Code uses absolute accesses for
objects that are defined in the executable itself. It uses
direct calls for functions in the same translation unit.
This combination requires support from the static linker.
It does not affect the ABI of the final executable.
(4) -mabicalls -mshared
For any dynamic object. This is the traditional SVR4 mode.
Note that shared library code must still be compiled with
"-fpic" or "-fPIC".
Having four types of executable is complicated, and is not the intended
user interface. From the user's perspective, GCC should have the
following target-independent interface:
- shared library code is compiled with "-fpic" or "-fPIC";
- position-independent executables are compiled with
"-fpie" or "-fPIE"; and
- for efficiency, position-dependent executables are compiled
without any of these "-f" flags.
In the last case, GCC should default to the most efficient executable
model available. Since 4.3, GCC's configure script automatically checks
whether the static linker supports "-mno-shared". It can therefore
choose between (3) and (4) without direct intervention.
However, GCC's configure script cannot know whether the dynamic linker
supports (2), so we really need a new configure option to choose between
"(2)" and "(3) or (4)". I therefore added "--with-gnu-plts" and
"--without-gnu-plts", where the latter is the default.
Non-dynamic executables like the linux kernel remain a special case.
They must continue to be compiled with "-mno-abicalls".
Linker interface
----------------
The linker should use the new extensions when compiling a non-PIC
CPIC executable, but not otherwise. It might be useful to forbid
the use of copy relocs and PLTs altogether, so I made -znocopyreloc
turn off the extensions.
Linker errors
-------------
Shared-library code must be compiled with "-fpic" or "-fPIC".
However, because MIPS compilers have traditionally used (4) as the
default executable mode, the sorts of failure you get by forgetting
"-fpic" and "-fPIC" have tended to be very subtle. For example,
suppose we have:
void foo (void) { ... }
void bar (void) { ... foo (); ... }
Without "-fpic" or "-fPIC", GCC will think it can inline foo() into
bar(). It will then be impossible for an executable (or for another
shared library) to override foo() properly.
The failure for other targets is more drastic, so most cross-target
build systems already do the right thing. However, some MIPS-specific
build systems might not.
Changing the default executable mode from "(4)" to "(3) or (2)"
makes the MIPS failure mode as drastic as it is for other targets.
Unfortunately, the MIPS linker has traditionally not checked for
accidental uses of absolute code in shared libraries; it would link
the following code as a shared library without any diagnostic at all:
lui $4,%hi(x)
addiu $4,$4,%lo(x)
The resulting DSO would treat "x" as having the value 0.
This seems too dangerous, so I made the linker complain about any
relocation that it cannot resolve itself and that it cannot implement
dynamically. The errors have the form:
non-dynamic relocations refer to dynamic symbol FOO
Comparison with the CS implementation
-------------------------------------
I think the main differences with CS's implemention are:
- CS treat .got.plt is part of .got. See above for why I think it
should be separate. Note that the PLT header is the same size for
both implementations, so the extra parameters don't cost much.
- CS PLT entries pass a PLT index rather than a .got.plt address.
This makes no difference for most objects, but a longer stub
is needed if there are more than 0x10000 PLT entries.
- I couldn't see any specific support for ld -r in the CS version.
- The CS version always uses separate "la $25" trampolines,
rather than adding instructions to the beginning of a function.
This is an implementation rather than an ABI detail though.
- CS support MIPS I, at the cost of using the start of the next
PLT entry as a delay slot instruction.
- STO_MIPS_PLT is separate from STO_MIPS16.
This comparison is based on 4.2-129 and I've probably got it wrong.
I'm not sure if CS's version supports n32 and n64, but adding
it wouldn't be a big issue.
Patches
-------
In case it helps the discussion, I've attached patches for binutils,
gcc and glibc. I'm not asking for approval though.
Each set of patches has prerequisites that haven't been applied yet.
I've therefore attached them in the form of a bzipped quilt.
The patches with .clog changelog files are the ones related to
non-PIC support.
The glibc patches are based on EGLIBC 2.6.
Richard