Contents

PowerPC assembly

Introduction to assembly on the PowerPC

Hollis BlanchardPublished on July 01, 2002

High-level languages offer great advantages in general by hiding many
mundane and repetitive details from programmers, allowing them to
concentrate on their goals. However, sometimes programmers must use a
lower-level language, such as when writing code that deals directly with
hardware or that is extremely performance sensitive. Assembly language is the
programming language closest to the hardware, which makes it a natural
last resort in such situations.

This article assumes a basic understanding of computer design (for example, you should know that a
processor has registers and can access memory) and of operating systems
(system calls, exceptions, process stacks). This article should be useful to PowerPC
programmers unfamiliar with assembly as well as programmers who already
know ia32 assembly and want to broaden their horizons.

Introduction to PowerPC

The PowerPC Architecture Specification, released in 1993, is a 64-bit
specification with a 32-bit subset. Almost all PowerPCs generally
available (with the exception of late-model IBM RS/6000 and all IBM
pSeries high-end servers) are 32-bit.

PowerPC processors have a wide range of implementations, from high-end server
CPUs such as the Power4 to the embedded CPU market (the Nintendo
Gamecube uses a PowerPC). PowerPC processors have a strong embedded presence because
of good performance, low power consumption, and low heat dissipation. The
embedded processors, in addition to integrated I/O like serial and ethernet
controllers, can be significantly different from the "desktop" CPUs. For
example, the 4xx series PowerPC processors lack floating point, and also use a
software-controlled TLB for memory management rather than the inverted
pagetable found in desktop chips.

PowerPC processors have 32 (32- or 64-bit) GPRs (General Purpose Registers) and
various others such as the PC (Program Counter, also called the IAR/Instruction
Address Register or NIP/Next Instruction Pointer), LR (link register), CR
(condition register), etc. Some PowerPC CPUs also have 32 64-bit FPRs
(floating point registers).

RISC

PowerPC architecture is an example of a RISC (Reduced Instruction Set Computing)
architecture. As a result:

The PowerPC processing model is to retrieve data from memory, manipulate it in registers, then store it back to memory. There are very few instructions (other than loads and stores) that manipulate memory directly.

Application binary interfaces (ABIs)

Technically, a developer can use any GPR for anything. For example, there
is no "stack pointer register"; a programmer could use any register for that purpose.
In practice, it is useful to define a set of
conventions so that binary objects can interoperate with different
compilers and pre-written assembly code.

Calling conventions are determined by the ABI (Application Binary
Interface) used. ppc32 Linux and NetBSD implementations use the SVR4
(System V R4) ABI, but ppc64 Linux follows AIX and uses the PowerOpen ABI.
The ABI specifies which registers are considered volatile
(caller-save) and non-volatile (callee-save) when calling subroutines, and
a lot more.

Some concrete examples of behavior specified by the SVR4 ABI:

Since the PowerPC has so many GPRs (32 compared to ia32's 8), arguments are passed in registers starting with gpr3.

Registers gpr3 through gpr12 are volatile (caller-save) registers that (if necessary) must be saved before calling a subroutine and restored after returning.

Register gpr1 is used as the stack frame pointer.

Many of the SVR4 features are identical to the PowerOpen ABI, which
greatly aids interoperability.

When to use assembly

All the pros and cons listed in the "Assembly HOWTO" (see Related topics for a link) apply to PowerPC.

Machine-specific registers

Sometimes you must touch CPU registers that higher-level languages are
completely unaware of. This is especially true in the course of writing an
operating system. One simple example is assigning your code its own stack
-- on a PowerPC, you must set r1. A C compiler
will only increment or decrement r1, so if your
application is running directly on the hardware, you must set r1 before calling C code. Another example is an
operating system's exception handlers, which must carefully save and
restore state one register at a time until it's safe to call higher-level
code.

Nonetheless, when faced with a situation in which you must use
low-level hardware features, you should implement as little as possible in
assembly:

C code is portable and understood by a large number of developers; assembly code (especially PowerPC assembly) is not.

Higher-level code is frequently much easier to debug than assembly.

Higher-level code is by definition more expressive than assembly; in other words you can do more with less code (and in less time).

If you find yourself writing high-level constructs such as loops or C
structures in assembly, take a step back and consider if this could be
done more easily in another language. A general rule is to use just enough
assembly to allow you to use a higher-level language.

Optimization

One of the most common reasons people want to use assembly language is to
make a slow program run faster. But in these cases, assembly should be the
absolute last place you turn.

General advice on optimization is beyond the scope of this document,
but here are some places to start:

Profile You must profile your code before starting any optimization work. Not only will this tell you where the hotspots are (they're frequently not where you expect!), it will also give you proof that you've sped anything up once you're done. Once you find hotspots, you can begin optimizing the high-level code (rather than attempting to rewrite it in assembly).

Algorithmic optimization No matter how tight your assembly is, if you're using an n4 algorithm, you're still going to be incredibly slow. Some other techniques you should try first include using a more appropriate data structure. If you iterate repeatedly over a linked list, think about using a hash table, binary tree, or whatever is appropriate for your application.

Your compiler can almost always do a much better job than you can at
writing assembly! Rather than attempting to rewrite high-level code in
assembly, make judicious use of optimization options such as -O3 and C directives like __inline__. The compiler is aware of tricks like
instruction scheduling, which considers the internals of the processor and
tries to keep all pipelines full at all times. That may involve moving
loads earlier in the instruction stream than required to keep the pipeline
from stalling as the CPU waits for memory accesses to catch up. Unless
you've been coding assembly for many years, these are tasks that most
people cannot correctly perform by hand.

How to learn assembly

gcc is the best place to start learning assembly (for any architecture).
gcc -O3 -S file.c will produce file.s in gas-compilable format (gas is the
GNU Assembler). Open file.s in your favorite
editor and you can see the assembly output from your C code.

You'll probably see instructions you don't understand. You can look
them up in The PowerPC Architecture: A Specification for a New Family of RISC
Processors, 2nd. Ed and PowerPC Microprocessor Family: The Programming Environments for
32-bit Microprocessors (see Related topics for links to these documents). However, like
learning any (spoken) language, there are certain words that are important
and that you should know, and others that can be safely ignored until
you've figured out more important features of the code. A good example of
an important instruction is the branch family of instructions, such as
blr.

Assembly examples

Hello World -- ia32 assembly

Listing 1 is copied directly from the gas example in the Assembly
HOWTO, which unfortunately is completely ia32-specific. It makes two direct
system calls: the first writes to stdout; the second exits the application
(with a return code of 0). It is very unusual
to make system calls directly; normally applications link with a libc
library, which wraps all the system calls.

General notes about Listing 2

PowerPC assembly requires a destination register for all
register-to-register operations (because it is a RISC architecture). This
register is always the first in the argument list.

Under PPC Linux, system calls are made with the syscall number in gpr0 and arguments beginning with gpr3. The syscall number, order of arguments, and
number of arguments may differ under other PowerPC operating systems
(NetBSD, Mac OS, etc.), which is one reason programmers typically make
system calls through a libc library (which handles the OS-specific
details).

Register notation
PowerPC registers have numbers, not names. For the learner, this can
sometimes be confusing since literals aren't easily distinguishable from
registers. "3" could mean the value 3 or the
register gpr3, or floating point fpr3, or special purpose register spr3.
Get used to it. :)

Immediate instructionsli means "load immediate", which is a way of
saying "take this constant value known at compile time and store it in
this register". Another example of an immediate instruction is addi, for
example addi 3,3,1 would increment the contents
of gpr3 by 1, then store the result back into gpr3. Contrast this with
add 3,3,1, which increments the contents of gpr3
by the contents of gpr1, storing the result back into gpr3.

Instructions ending in "i" are usually immediate instructions.

Mnemonicsli isn't really an instruction; it's actually a mnemonic. A mnemonic is a
bit like a preprocessor macro: it's an instruction that the assembler
will accept but secretly translate into other instructions. In this case,
li 3,1 is really defined as addi 3,0,1.

The sharp-eyed will notice that those instructions aren't
necessarily the same thing: addi is really adding 1 to the
contents of gpr0, storing the result into gpr3, right? That would
be true, except the PowerPC spec says gpr0 sometimes has a value, and
sometimes is read as 0, depending on the context. In this case (and the
addi description states this explicitly), the 0 means value 0 rather than
register gpr0.

Mnemonics shouldn't matter at all to anyone other than assembler
developers, but mnemonics can be confusing when you're looking at disassembly
output. However, GNU objdump -d is quite good
at displaying the original mnemonic rather than the instruction actually
present in the file. For example, objdump will display the mnemonic nop rather than ori 0,0,0
(the actual instruction used).

Loading pointers
The most interesting part of our Hello World example is how we load the
address of msg. As mentioned earlier, PowerPC uses fixed-length
32-bit instructions (in contrast to ia32, which uses variable-length
instructions). That 32-bit instruction is just a 32-bit integer. This
integer is divided into fields of different sizes:

The number of fields and their sizes will vary by instruction, but the
important point here is that these fields take up space in the
instruction. In the case of addi, after just those three fields are placed
into the instruction, there are only 16 bits left for the immediate value
you're adding!

That means that li can only load 16-bit immediates. You cannot
load a 32-bit pointer into a GPR with just one instruction. You must use two
instructions, loading first the top 16 bits and then the bottom. That is
exactly the purpose of the @ha ("high") and @l ("low") suffixes. (The "a"
part of @ha takes care of sign extension.) Conveniently, lis
(meaning "load immediate shifted") will load directly into the high 16
bits of the GPR. Then all that's left to do is add in the lower bits.

This trick must be used whenever you load an absolute address (or any
32-bit immediate value). The most common use is in referencing globals.

Listing 4. Hello World -- PPC64 assembly

Listing 4 is almost identical to the 32-bit PowerPC example (Listing 2) above.
PowerPC was designed as a 64-bit specification with 32-bit
implementations, and not only that, PowerPC user-level programs are more
or less binary-compatible across those implementations. Under Linux, ppc32
binaries run perfectly well on 64-bit hardware (with a little munging here
and there for variable types visible to both 32-bit userland and the
64-bit kernel).

There are only two differences between the ppc32 code (Listing 2) and the ppc64 code (Listing 4). The
first is the way we load pointers, and the second
is those assembler directives about an .opd section. It's worth pointing out that the ppc32 code works
perfectly under ppc64 Linux when compiled as a ppc32 binary.

Loading pointers
On ppc32 it took two instructions to load a 32-bit immediate value into a
register. On
ppc64 it takes 5! Why?

We still have 32-bit fixed-length instructions, which can only load 16
bits worth of immediate value at a time. Right there you need a minimum of
four instructions (64 bits / 16 bits per instruction = 4 instructions). But
there are no instructions that can load directly into the high word of a
64-bit GPR. So we have to load up the low word, shift it to the high word,
then load the low word again.

The rotate instructions (like the rlicr seen here) are notoriously
complicated, and having jokingly been called Turing-complete. If all you
need to do is load 64-bit immediate values, don't worry about it -- just
convert these five instructions into a macro and never think about it again.

One last note: we used @h here instead of
@ha in the ppc32 example because we then use
ori rather than addi to supply the low 16 bits. On RISC machines it's
frequently possible to accomplish something in many different ways (for
example, there are many possibilities for nop).

Function descriptors -- the .opd section
Under ppc64 Linux, when you define and call a C function foo, that is not actually the address of the
function's code. In assembly if you try to bl
foo, you will quickly find your program crashing. The label foo is actually the address of foo's function
descriptor. Function descriptors are described in detail in the ppc64 ELF
ABI (see Related topics), but very briefly you must
have a function descriptor (which is simply a structure containing 3
pointers) if your assembly will be called from C code, because the
compiler expects it.

We don't have any C code here, but the ELF ABI also says that the ELF
file's entry point (_start by default) points to a function
descriptor. So we must have one, and that is what goes into the .opd
section.

Those assembler directives were copied almost directly from the output
of gcc -S. This is another excellent candidate
for a preprocessor macro in your assembly code.

Where to learn more

For those of you interested in learning more about PowerPC, you can
start by compiling tiny programs with gcc -S --
provided that you have a PowerPC box handy. If you do not, check out the
PPC cross-compiling mini-howto, as well as the other sites and documents
listed in the Resources section. Also try experimenting with gdb's
psim (PowerPC simulator) target. It's easier than you may think!