Sharing tips and experience with the cell processor, performance, data design and game programming.

Background on Branching

By Mike Acton on April 11, 2006 1:04 AM

Branching, both conditional and unconditional, slows most
implementations. Even an unconditional branch or a correctly predicted
taken branch may cause a delay if the target instruction is not in the
fetch buffer or the cache. It is therefore best to use branch
instructions carefully and to devise algorithms that reduce branching.
Many operations that normally use branches may be performed either with
fewer or no branches.

-- From IBM's The PowerPC Compiler Writer's Guide 3.1.5

Branches represent a significant part of both performance critical
and general purpose code - as a general rule of thumb, 20% of the
instructions in typical code are branches. In inner loops and other
code sections which demand the highest performance may benefit from a
multifold increase in performance by eliminating, or reducing, branches.

This series of articles will present the types of delays that branches
may cause in program execution and some programming patterns that help
avoid those delays.

Branch Paths

Branch instructions divide the execution path in two paths, the target
path and the fall-through path. The address of the first instruction on
each path is referred to as target address and fall-through address,
respectively. The target address is where the instruction points to,
much like the label in a C goto statement. The fall-through address
represents the instruction just after the branch.

...if ( a != b ){ fall_through_address: ...}target_address:...

...beq cr0, target_addressfall_through_address:...target_address:...

There are two potential performance issues related to branch instructions, which
are based on the decisions the hardware must make in order to execute the correct
instruction stream:

Deciding which target address will be taken. For indirect branches where the
target address is not known until runtime, either the architecture will wait for
the address to become known, then start loading the instruction stream (incurring
a stall), or a form of target address prediction will be used, which will
begin loading queuing the stream at the predicted address and revert any executed
code if the prediction turns out to be incorrect.

Deciding which branch will be taken. For conditional branches, there are
both static and dynamic methods of predicting whether excecution will
continue through the target address or the fall-through address. This decision isn't
as simple to make as it may appear at a first glance. Modern processors are able
to execute more than one instruction concurrently and what the processor chooses to
do will have a great impact on overall execution performance.

Branch Types

Branch instructions can be either unconditional or conditional.

Unconditional branch, or jump, instructions, as the name would imply, are always
taken and cause the instruction queue to begin loading instructions either at the
address which is embedded in the instruction itself (called a direct branch),
or at the address previously stored in a register by the programmer or compiler (called
an indirect, or computed, branch).

On both the PPU and SPU, an unconditional branch may also jump to the link register, which is not a
special type of register, but is one register within the register file which by
convention (i.e. the ABI) has been decided to be the register in which next instruction
of the parent on the callstack is stored. The link register is normally maintained by the compiler,
but may be used directly by the programmer in assembly.

return; <-- Unconditional, indirect branch via link register

Conditional branch instructions are only taken if a particular condition is met, typically
the result of some comparison (e.g. equal-to-zero, less-than, etc.). Conditional branches generally
correspond to if, for,
while and do statements in
the C language.

On the PPU, the state of the condition is held in the Condition Register (CR) which is
then used to determine the which of the target or fall-through streams to begin queuing. The CR is
modified by either special CR-modifying versions of instructions which compare the result of the expression
to zero and store the camparison result, or more directly by special comparison instructions and CR
logic instructions.

On the SPU, there is no special condition register. Comparison instructions
store a value (zero or non-zero) into a target register specified by the programmer or compiler.
This register is then used by various branch-on-zero instructions to queue the appropriate
instruction stream.
Like unconditional branches, conditional branches may be either direct or indirect. However, the vast
majority of conditional branches will correspond to local conditional statements or loops and in which
the target address is known at compile time (i.e. direct).

On the PPU, there is additionally a special register caled the Count Register (CTR) which is designed
specifically for simplifying the prediction of local loops. The state of the CTR is held in the
Condition Register (CR) and conditional branches can test bits in the CR, for CTR being equal to or
different from zero to decide whether or not to jump. A PPU compiler will often map
for loops with static iteration counts to use the CTR.

What follows is a table of the conditions which may be used to affect the branch condition result on the PPU
depending on where the target address is stored.

Conditional branch type

Available conditions

Predefined target address

CR bits
CTR=0
CTR≠0
CTR=0 and CR bits
CTR≠0 and CR bits
Always(*1)

Target address in LR

CR bits
CTR=0
CTR≠0
CTR=0 and CR bits
CTR≠0 and CR bits
Always(*1)

Target address in CTR

CR bits
Always (*1)

*1 In those cases execution will always continue in the target address.

In summary, unconditional, direct branches are generated by the compiler for the goto, break and continue C statements.
Indirect branches are compiled for calls to function pointers and C++ virtual functions, for subroutine returns,
and for switch statements that use lookup tables or address tables.
Conditional branches are compiled for if, while, do-while, for statements and the ? ternary operator.

A branch instruction can depend on resources in order to make the processor continue execution
on the right location. A branch instruction that has all needed resources available by the time
it's processed is called resolved. Unconditional, direct branches are always resolved
(statically, at compile time) because they have no dependencies.

All other branches are called unresolved. They depend on the value in a register (e.g. LR or CTR)
to determine the target address or on a condition (e.g. CTR or CR) to determine if then branch
will or will not be taken. Unresolved branches correspond to conditional or indirect branch operations.
Those branches must be resolved (the target address or the condition - or both - must be computed)
before they are dispatched to be processed.
If an unresolved branch is processed, some prediction of the address where the program will
continue execution must take place in order not to stall, or minimize the stall, of the execution pipeline.

PPU Branch Prediction

When a branch is to be processed but is not yet resolved, the processor must do one of:

Sit still and wait until the resources needed by the branch
instruction are available and so the address where execution will
continue is known;

Continue execution simultaneously on both possible addresses - the
target and the fall-through addresses - until the correct address is
known. In this case, all the processing made on the wrong path must be
undone; or

Predict the address on which execution will continue and go to that
path until the correct address is known. In this case, if the predicted
address is found wrong all the processing must be undone back to the
branch instruction, and the branch instruction is re-evaluated.

The first option will always cause execution to stop. The second option
requires special hardware in the processor to execute the two different
paths simultaneously and requires intensive register renaming. The PPU
uses the third option. It is important therefore to understand how
prediction works in order to write more efficient code.

Branch prediction is the task of guessing the address where a branch
that is unresolved by the time it's processed will jump to. The PPU has
two branch prediction modes: static and dynamic.

1. Static branch prediction does not change the prediction as the
program runs. The compiler makes a guess about whether to predict taken
or not taken and sets a bit in the instruction word. When the
instruction runs, the hardware will always branch the same way
according to the value of the special bit.

2. Dynamic branch prediction changes the prediction as the program
runs. The hardware, not the compiler, must guess which way to branch
each time the instruction is executed.

-- From: http://lgjohn.okstate.edu/6253/lectures_old/brpredict.pdf

Static prediction comes in two flavors: prediction of the likelihood of
the branch to be taken and prediction of the target address at which
execution will continue. The prediction of the likelihood of the branch
to be taken is related to conditions that depend on CR and on CTR. A
branch which depends on CR and is to be processed but the results of
the instruction that updates CR are not yet available will cause a
stall, so the condition of the branch must be predicted. The same thing
happens with CTR.

The prediction of the address at which execution will continue is
related to target addresses that are in LR or in CTR. A branch which
depends on LR and where its value is not yet available will cause a
stall, so the continuation address must be predicted. The same thing
happens with CTR.

The prediction of the likelihood of a branch to be taken is given by
two bits a and t in the BO field of the branch instructions. It cannot
be changed while the program is executing, that's why the prediction is
called static.

a

t

Hint

0

0

No hint is given

0

1

Reserved

1

0

The branch is very likely not to be taken and execution will continue at the fall-through address

1

1

The branch is very likely to be taken and execution will continue at the target address

Dynamic prediction is performed by lookups in the Branch History Table
(BHT) and with the use of the Link Stack. Those structures are used to
predict both the likelihood of a branch being taken and the address
where execution will continue. There is no way the programmer can
interfere with dynamic branch predictions as it's done without the help
of the software, so this article will not dive into its details.

SPU Branch Prediction

The SPU doesn't have dynamic branch prediction hardware.
The rationale behind this is that that hardware would take increase both energy and chip area requirements.

The SPU static branch prediction isn't encoded in branch instructions, as it is at the PPU.
Instead, the SPU has three special instructions that provide hints of the continuation address of the branches.

Instruction

Opcode

Description

Hint for branch (r-form)

hbr branch_addr, reg

Hint that the branch instruction at branch_addr will branch to the address contained in reg.

Hint for branch (a-form)

hbra branch_addr, predicted_addr

Hint that the branch instruction at branch_addr will branch to predicted_addr.

Hint for branch relative

hbrr branch_addr, offset

Hint
that the branch instruction at branch_addr will branch to the address
of the hbrr instruction plus the signed value offset (relative address)

When a hint instruction is found, its arguments are saved in the Branch
Target Buffer (BTB). Then, every time a branch instructions enters the
issue stage of the execution pipeline, its address is compared to the
branch instruction address saved in the BTB. If they match, the SPU
uses the predicted address saved in the BTB along with the branch
address.

If a branch instruction which is not present at the BTB enters the
issue state, the SPU predicts that execution will continue at the
fall-through path. In other words, those branches behave like PPU
branches where a and t bits are set to 1 and 0, respectively. Branches
that are present at the BTB behave like PPU branches where a and t bits
are both set to 1.

Hints saved at the BTB remains valid until invalidated by a sync
instruction or replaced by another hint. Currently, the BTB has only
one entry, so only one branch hint will be active at a time, but many
branch prediction strategies can be coded in software by the use of the
hint for branch instructions.

The address where a hint instruction is placed in relation to the
branch instruction for which it provides a hint has consequences to the
processing of the hint. In short, hints too close to the branch will
have no effect, hints not far enough of the branch will be predicted,
but a stall will occur. All other hints will provide predictions
without stalling execution.