BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

When you convert your applications to native PowerPC code, they run lightning fast. To get the
most out of RISC processors, however, you need to pay close attention to your code structure and
execution. Fast code is no longer measured solely by an instruction timing table. The Power PC 601
processor includes pipelining, multi-issue and speculative execution, branch prediction, and a set
associative cache. All these things make it hard to know what code will run fastest on a Power
Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good optimizing compiler
to help you. In this column I'll pass on some of what I've learned about tuning Power PC code.
There are gotchas and coding habits to avoid, and there are techniques for squeezing the most from
your speed-critical native code. For a good introduction to RISC pipelining and related concepts
that appear in this column, see "Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every machine clock cycle,
but RISC processors can do this only in the best of circumstances. At their worst they're as slow as
CISC processors. The following loop, for example, averages only one calculation every 2.8 cycles:

By restructuring the code and using other techniques from this column, you can make significant
improvements. This next loop generates the same result, yet averages one calculation every 1.9
cycles -- about 50% faster.

The rest of this column explains the techniques I just used for that speed gain. They include
expanding loops, scoping local variables, using inline routines, and using faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its point of view.
You should understand how it looks at your code and what assumptions and optimizations it's
allowed to make. The more you empathize with your compiler, the more you'll recognize
opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code line by line
usually isn't optimal, because the processor stalls to wait for dependent instructions. The compiler
tries to move instr uctions that are independent into the stall points. For example, consider this code:

Each line depends on the previous line's result, and the compiler will be hard pressed to keep the
pipeline full of useful work. This simple example could cause 46 stalled cycles on the PowerPC 601,
so the compiler will look at other nearby code for independent instructions to move into the stall
points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance in several
ways. Loop expanding is one of the simplest methods. The idea is to perform more than one
independent operation in a loop, so that the compiler can reorder more work in the pipeline and
thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

On a CISC processor the second loop wouldn't be much faster, but on the Power PC processor the
second loop is twice as fast as the first. This is because the compiler can schedule independent
instructions to keep the pipeline constantly moving. (If the data doesn't occur in nice increments,
you can still expand the loop; just add a small loop at the end to handle the extra iterations.)Be careful not to
expand a loop too much, though. Very large loops won't fit in the cache, causing
cache misses for each iteration. In addition, the larger a loop gets, the less work can be done entirely
in registers. Expand too much and the compiler will have to use memory to store intermediate
results, outweighing your marginal gains. Besides, you get the biggest gains from the first few
expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the PowerPC
chip -- 32 general registers and 32 floating-point registers. By having so many, the processor can
often avoid slow memory operations. Your compiler will take advantage of this when it can, but you
can help it by carefully scoping your variables and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler examines the scope of
each variable when it schedules registers, and your code can provide valuable information about the
usage of each variable. Here's an example:

In this loop, the global variable gArraySize is scoped for the whole program. Because we call a
subroutine in the loop, the compiler can't tell if gArraySize will change during each iteration. Since
the subroutine might modify gArraySize, the compiler has to be conservative. It will reload
gArraySize from memory on every iteration, and it won't optimize the loop any further. This is
wastefully slow.

On the other hand, if we use a local variable, we tell the compiler that gArraySize and c[i] won't be
modified and that it's all right to just keep them handy in registers. In addition, we can store data as
temporary variables scoped only within the loop. This tells the compiler how we intend to use the
data, so that the compiler can use free registers and discard them after the loop. Here's what this
would look like:

These minor changes give the compiler more information about the data, in this instance
accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated
dereferences and typecasting, chances are the object code has wasteful memory instructions and
inefficient register usage. A great compiler might optimize well anyway, but don't count on it.
Judicious use of temporary variables (as mentioned above) will help the compiler understand exactly
what you're doing -- plus your code will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles on the
Macintosh. Code often contains double memory dereferences, which is important when memory can
move. But when you can guarantee that memory won't move, use a local pointer, so that you only
dereference a handle once. This saves load instructions and allows fur ther optimizations.
Casting data types is usually a free operation -- you're just telling the compiler that you know you're
copying seemingly incompatible data. But it's not free if the data types have different bit sizes, which
adds conversion instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true that often the
pipeline can keep moving even though a branch is encountered, because the branch execution unit
will try to resolve branches very early in the pipeline or will predict the direction of the branch. Still,
the more subroutines you have, the less your compiler will be able to reorder and intelligently
schedule instructions. Keep speed-critical code together, so that more of it can be pipelined and the
compiler can schedule your registers better. Use inline routines for short operations, as I did in the
improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know about. Some
are processor model specific. For example, the PowerPC 601 has 32K of cache, while the 603 has
16K split evenly into an instruction cache and a data cache. But in general you should know about
floating-point performance and the virtues of memory alignment.

Floating-point multiplication is wicked fast -- up to nine times the speed of integer multiplication.
Use floating-point multiplication if you can. Floating-point division takes 17 times as long, so when
possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to unaligned data
stall while the processor loads different words and then shifts and splices them. For example, be sure
to align floating-point data to 64-bit boundaries, or you'll stall for four cycles while the processor
loads 32-bit halves with two 64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about tweaking its
performance at all. For your speed-critical code, though, these tips I've given you can make the
difference between "too slow" and "fast enough."

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's been repeatedly
thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle. Fixing engine stalls, poor timing,
and rough starts proved difficult, but he was recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization option, and then
run on a Power Macintosh 6100/66 for profiling. A PowerPC 601 microsecond timing library is
provided on this issue's CD. *

Community Search:

MacTech Search:

Software Updates via MacUpdate

Vienna 3.0.6 :5eaf312: - RSS and Atom ne...

Vienna is a freeware and Open-Source RSS/Atom newsreader with article storage and management via a SQLite database, written in Objective-C and Cocoa, for the OS X operating system. It provides... Read more

Kodi 15.1.rc1 - Powerful media center to...

Kodi (was XBMC) is an award-winning free and open-source (GPL) software media player and entertainment hub that can be installed on Linux, OS X, Windows, iOS, and Android, featuring a 10-foot user... Read more

Bookends 12.5.8 - Reference management a...

Bookends is a full-featured bibliography/reference and information-management system for students and professionals.
Access the power of Bookends directly from Mellel, Nisus Writer Pro, or MS Word (... Read more

Chromium 44.0.2403.125 - Fast and stable...

Chromium is an open-source browser project that aims to build a safer, faster, and more stable way for all Internet users to experience the web.
Version 44.0.2403.125:
This release contains a number... Read more

iMazing 1.2.2 - Complete iOS device mana...

iMazing (was DiskAid) is the ultimate iOS device manager with capabilities far beyond what iTunes offers. With iMazing and your iOS device (iPhone, iPad, or iPod), you can:
Copy music to and from... Read more

Audio Hijack 3.2.0 - Record and enhance...

Audio Hijack (was Audio Hijack Pro) drastically changes the way you use audio on your computer, giving you the freedom to listen to audio when you want and how you want. Record and enhance any audio... Read more

FontExplorer X Pro 5.0.1 - Font manageme...

FontExplorer X Pro is optimized for professional use; it's the solution that gives you the power you need to manage all your fonts.
Now you can more easily manage, activate and organize your... Read more

Calcbot 1.0.2 - Intelligent calculator a...

Calcbot is an intelligent calculator and unit converter for the rest of us. Featuring an easy-to-read history tape, expression view, intuitive conversion, and much more!
Features
History Tape -... Read more

MTR 5.0.0.1 - The Mac's oldest and...

MTR (was MacTheRipper)--the Mac's oldest and smartest DVD-backup app--is now updated to version 5.001
MTR -- the complete toolbox, not a one-trick, point-and-click extractor. MTR is intended for... Read more

LibreOffice 4.4.5.2 - Free, open-source...

LibreOffice is an office suite (word processor, spreadsheet, presentations, drawing tool) compatible with other major office suites. The Document Foundation is coordinating development and... Read more

Bandai Namco has released Pac-Man Championship Edition DX on iOS and Android, which features the classic arcade gameplay that we've all grown to love.
Pac-Man Championship Edition DX can be enjoyed in much shorter bursts than the arcade versions... | Read more »

Angel Stone is Fincon's follow up to the massively successful Hello Hero and is out now on iOS and Android.
You play as a member of The Resistance, a group of mighty human warriors who have risen up in defiance of the Demon horde threatening to... | Read more »

The not exactly rumors were true and the birds are back. Angry Birds 2 has come to the App Store and the world will... well I suppose it'll still be the same, but now we have more bird-flinging options!
[Read more]
| Read more »

You Could Design Your Own Card for Chain...

If you've ever wanted to create your own item, weapon, trap, or even monster for Chainsaw Warrior: Lords of the Night, this is your chance. Auroch Digital is currently holding a contest so that fans can fight to the death (not really) to see which... | Read more »

Bitcoin Billionaire is Going Back in Tim...

If you thought you managed to buy everything there is to buy in Bitcoin Billionaire and make all the money, well you though wrong. Those of you who made it far enough might remember investing in time travel - and it looks like that investment is... | Read more »

Domino Drop (Games)

Domino Drop 1.0
Device: iOS Universal
Category: Games
Price: $1.99, Version: 1.0 (iTunes)
Description:
Domino Drop is a delightful new puzzle game with dominos and gravity!Learn how to play it in a minute, master it day by day.Your... | Read more »

Best Buy has iPad Air 2s on sale for up to $100 off MSRP on their online store for a limited time. Choose free shipping or free local store pickup (if available). Sale prices available for online... Read more

B&H Photo has the 13″ 1.6GHz/128GB MacBook Air on sale for $899.99 including free shipping plus NY tax only. Their price is $100 off MSRP, and it’s the lowest price available for this model.... Read more

Worldwide Tablet Market Decline Continues, Ap...

The worldwide tablet market declined -7.0% year-over-year in the second quarter of 2015 (2Q15) with shipments totaling 44.7 million units according to preliminary data from the International Data... Read more

The Apple Store has Apple Certified Refurbished iPad Air 2s available for up to $140 off the price of new models. Apple’s one-year warranty is included with each model, and shipping is free:
- 128GB... Read more

Updated Apple iPad Price Trackers

We’ve updated our iPad Air Price Tracker and our iPad mini Price Tracker with the latest information on prices and availability from Apple and other resellers.
Read more

Apple refurbished 2014 13-inch 128GB MacBook...

The Apple Store has Apple Certified Refurbished 2014 13″ MacBook Airs available starting at $759. An Apple one-year warranty is included with each MacBook, and shipping is free:
- 13″ 1.4GHz/128GB... Read more

Apple’s Education discount saves up to $300 o...

Purchase a new Mac or iPad at The Apple Store for Education and take up to $300 off MSRP. All teachers, students, and staff of any educational institution qualify for the discount. Shipping is free,... Read more

Jobs Board

*Apple* Retail - Multiple Positions (US) - A...

Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the
Read more

MacTech is a registered trademark of Xplain Corporation. Xplain, "The journal of Apple technology", Apple Expo, Explain It, MacDev, MacDev-1, THINK Reference, NetProfessional, Apple Expo, MacTech Central, MacTech Domains, MacNews, MacForge, and the MacTutorMan are trademarks or service marks of Xplain Corporation. Sprocket is a registered trademark of eSprocket Corporation. Other trademarks and copyrights appearing in this printing or software remain the property of their respective holders. Not responsible for typographical errors.

All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.