Altivec Optimizations and other PPC Performance Tips

quote:Originally posted by gcc:That's nearly an 8x (7.227) improvement in this function! This is without any of hobold and BadAndy's load hoisting and loop unrolling tips.

Don't let us pressure you into an optimizing death match! :-) If some routine is fast enough after a first level of tuning, direct your attention to wherever Shikari tells you.

hmm, Death Match, you say... Well, I already have this path for coding processing routines worked out where the first passes are functional scalar code followed by an altivec version. Then the next round involves tweaking and optimizing the scalar and altivec. The scalar probably needs a lot more attention in the second pass, which brings up a question:

This is cross-platform code, the altivec will be #ifdef MACOSX, but what about the scalar optimizations like the load hoisting and loop unrolling?

I'm pretty sure x86 benefits from the loop unrolling, but what about the load hoisting? The code should play well on both architectures, but currently the code written for the x86 side is typical 'crap scalar code' that p4s eat for lunch. Maybe this could be a bit of PPC revenge?

quote:Originally posted by gcc:Well, I already have this path for coding processing routines worked out where the first passes are functional scalar code followed by an altivec version. Then the next round involves tweaking and optimizing the scalar and altivec. The scalar probably needs a lot more attention in the second pass, which brings up a question:

This is cross-platform code, the altivec will be #ifdef MACOSX, but what about the scalar optimizations like the load hoisting and loop unrolling?

I think there is a more appropriate #define than MACOSX, something like __ALTIVEC__ should be provided by compilers that adhere to the PIM. Not sure about the exact flag, though.

quote:I'm pretty sure x86 benefits from the loop unrolling, but what about the load hoisting? The code should play well on both architectures, but currently the code written for the x86 side is typical 'crap scalar code' that p4s eat for lunch. Maybe this could be a bit of PPC revenge?

The trouble with 'x86 is the lack of registers. You can't really do any load hoisting or even loop unrolling without running out of registers. Worse, the different 'x86 implementations (Athlon, P III, P4) are affected differently by any given optimization strategy.

I suggest doing algorithmic and general optimizations for the scalar code only. For example avoid divides as much as possible, they are costly on every machine. Avoid conversions between int and float; AltiVec is the only hardware where these aren't terribly slow.

If you are serious about tuning scalar code, look into tricks like 'SIMD in a register' ... sometimes one can do several char operations in parallel with long int variables. But due to lack of hardware primitives, there are few opportunities to do this kind of stuff.

Try to think low-level ... maybe one can exploit the binary representation of values, or maybe it makes more sense to do things in integer instead of float or vice versa. Or maybe one doesn't really need to call sin() and cos(), because a crude approximation with a 3rd order polynomial will be good enough.

Opportunities like those make for good general scalar tuning that is widely portable.

quote:I suggest doing algorithmic and general optimizations for the scalar code only. For example avoid divides as much as possible, they are costly on every machine. Avoid conversions between int and float; AltiVec is the only hardware where these aren't terribly slow.

If you are serious about tuning scalar code, look into tricks like 'SIMD in a register' ... sometimes one can do several char operations in parallel with long int variables. But due to lack of hardware primitives, there are few opportunities to do this kind of stuff.

Try to think low-level ... maybe one can exploit the binary representation of values, or maybe it makes more sense to do things in integer instead of float or vice versa. Or maybe one doesn't really need to call sin() and cos(), because a crude approximation with a 3rd order polynomial will be good enough.

Opportunities like those make for good general scalar tuning that is widely portable.

The great thing about having Hobold around is that now you get to see some point/counterpoint.

You know, like those SNL routines that started with the line "Jane, you ignorant slut"

Abso-frickin'-lutely always look at algorithm improvements first (algebra & calculus always beats GHz). After that the major issue is _data_flow_ ... organizing the computation to use the caches and minimize FSB traffic.

These parts are generic and help all processors.

But when I write code for speed I _always_ register loop unroll & load hoist. My reason for this is very simple: my code runs across PPC (not POWER4 yet), SPARC, Athlon, some Pentium4. PPC and SPARC lap it up... big improvements. Athlon is often modestly helped by the loop unrolling, and Pentium4 doesn't seem hurt by it. Thus this code is the _best_across_the_board.

Remember that the x86, particularly the P4, have very fast bi-directional scalar-cache interfaces ... this is by design to help the CPUs cope with more limited ISA register count. The loop unrolling and load hoisting helps _every_ processor, to varying degrees.

The C standard says that explicit register allocations are obeyed in order until they run out, so if you simply allocate with that in mind there is no harm in having explicit register allocations that exceed any particular ISA.

If you take my scalar unrolled matrix 3x3 code and test it on Athlon and P4 ... you will find it helps and does no harm respectively, compared to the "dumbo" version.

quote:Originally posted by BadAndy:The great thing about having Hobold around is that now you get to see some point/counterpoint.

:-)

quote:But when I write code for speed I _always_ register loop unroll & load hoist. My reason for this is very simple: my code runs across PPC (not POWER4 yet), SPARC, Athlon, some Pentium4. PPC and SPARC lap it up... big improvements. Athlon is often modestly helped by the loop unrolling, and Pentium4 doesn't seem hurt by it. Thus this code is the _best_across_the_board.

My multiplatform testing was more limited to a few RISC platforms (PowerPC, MIPS, SPARC) and Pentium III. I saw the Pentium performance drop very slightly ... the more powerful OOOE of Athlon and P4 probably do better.

I was hesitant to recommend optimizations that would put a Pentium III at an unfair disadvantage. On the other hand, P3 is mostly phased out by now.

I was doing some more profiling using Shiakri and was generally pleased with the results of the altivec functions. So I looked into the other ops that were eating a lot of cpu time. Not surprisingly the top one was a Quicktime call. The strange thing was it wasn't the DV decompression code or some IDCT op, it's this YUV422_UC_To2VUY function. It's hard to tell what it does but based on other functions named YUV422_UC_ToARGB and the like, it's doing a color-space conversion from YUV to YUV. This probably doesn't even need to be done. I checked out the assembler output in Shikari: scalar!! Not only is it somewhat questionable conversion, it's also slow. Same for the software YUV->RGBA conversion. The kicker there is that the vector code for that is on the damn Motorola website as the video Altivec example!! Pretty funny. Unless you're actually trying to use QuickTime for DV.

Looks like this weekend might find me writing an altivec DV decompression component for my app. Otherwise I have to live with a 1ghz G4 burning 50-60% of it's cpu cycles getting DV in a buffer. And this is after QT 6.1 claims to have sped up DV!

Regarding x86 optimizations, loop unrolling is much less important because with robust OOOE you will have multiple loop iterations in flight at once. Thus it's probably only beneficial when you have really tight loops so as to reduce the branch penalty.

Playing around with software prefetch will probably be more beneficial.

quote:Originally posted by gcc:Not surprisingly the top one was a Quicktime call. The strange thing was it wasn't the DV decompression code or some IDCT op, it's this YUV422_UC_To2VUY function. It's hard to tell what it does but based on other functions named YUV422_UC_ToARGB and the like, it's doing a color-space conversion from YUV to YUV. This probably doesn't even need to be done. I checked out the assembler output in Shikari: scalar!! Not only is it somewhat questionable conversion, it's also slow. Same for the software YUV->RGBA conversion. The kicker there is that the vector code for that is on the damn Motorola website as the video Altivec example!! Pretty funny. Unless you're actually trying to use QuickTime for DV.

Are there options in the QuickTime API that allow one to select a preferred format for delivery of image data output by QuickTime?

Maybe you can accept the pixels in a different format and do conversion yourself.

quote:Originally posted by IlleglWpns:Regarding x86 optimizations, loop unrolling is much less important because with robust OOOE you will have multiple loop iterations in flight at once. Thus it's probably only beneficial when you have really tight loops so as to reduce the branch penalty.

Playing around with software prefetch will probably be more beneficial.

Thanks for the info. I'll pass that pdf on to my x86 counterparts. Is the prefetch a processor dependent function between AMD, p3, p4, etc? Also, are the most recent cpus like the p4s and Athlon XPs the only ones with really nice OOOE or is this a general x86 feature?

Are there options in the QuickTime API that allow one to select a preferred format for delivery of image data output by QuickTime?

Maybe you can accept the pixels in a different format and do conversion yourself.

Yes, the format for decompression is an option in QuickTime. However, the limiting factor seems to be the old Mac GWorld screen drawing routines. These use hardware acceleration for things like YUV->RGB conversion, but GWorlds only natively deal with ARGB 16/32bit pixels. It's really designed for decompression and a quick dump to the backbuffer for display. Which is fine for watching clips and the like, but not so hot for heavy video processing in YUV format.

I suppose a hardware conversion to ARGB followed by an altivec'ed RGB->YUV is possible, but that seems so unnecessary and pointless. I think a DV specific decompressor is really the most efficient way to go, but the QuickTime coders should have already done this by now...after all that's what they are paid to do!!

quote:Thanks for the info. I'll pass that pdf on to my x86 counterparts. Is the prefetch a processor dependent function between AMD, p3, p4, etc? Also, are the most recent cpus like the p4s and Athlon XPs the only ones with really nice OOOE or is this a general x86 feature?

All the Athlons have good OOOE, as do the P4s. The P3s are are less well endowed. The prefetch instruction is available on all Athlons, PIIIs, and P4s.

Then there's processor specific optimizations such as which instructions to avoid on which architectures, which addressing modes to use (Athlon likes the complex addressing modes for example), unrolling, use of FXCH, etc, etc.

Most of this stuff is irrelevent if you're not coding in assembly, and much will be taken care off by a good compiler. You might want to try Intel's compiler for linux.

But jokes aside, the quoted 'x86 material is good and maybe it underlines my point that tuning for 'x86 is very different from tuning for PowerPC (or most other RISCs).

I hope that AMD's 'x86-64 will make RISC style tuning more worthwhile with the increased number of registers.

Gcc: tuning for PowerPC is no mistake of course. After all, this is open source and 'x86 users are free to do a few rounds of tuning on their own. So you should probably comment every specific tuning you do right there in the source code to make clear that programmers on other platforms are supposed to not break your optimizations.

Another big thanks for the info. That list seems right in line with the PPC side.

quote:Most of this stuff is irrelevent if you're not coding in assembly, and much will be taken care off by a good compiler. You might want to try Intel's compiler for linux.

Yes, a good compiler is a must isn't it. Gcc for x86 seems to come out way on top as far as optimizing code. I've been maintaining a healthy distrust for PPC compilers ever since my run in with CodeWarrior 6 and that nasty Altivec bug (VRSAVE i think it was).

quote:Originally posted by gcc:I suppose a hardware conversion to ARGB followed by an altivec'ed RGB->YUV is possible, but that seems so unnecessary and pointless. I think a DV specific decompressor is really the most efficient way to go, but the QuickTime coders should have already done this by now...after all that's what they are paid to do!!

I was unclear. I meant to suggest to request the output of QuickTime in its most 'native' format, hopefully with no comversion at all. Then you process it any way you wish, and finally deliver it in the most 'native' format to the display subsystem.

Getting faster conversions is only the second best goal. Getting fewer conversions would be faster still.

Well, this might not be an option. Possibly a lot of the existing code is already hardcoded for another pixel format. But if you are still free to chose internal pixel formats, go with whatever the surrounding system handles 'voluntarily'.

Or write a detailed mail to Apple. Rumour has it they listen to specific feature requests. And in this case they should, because it would make their 'digital hub' much more powerful for video.

One of the unpleasant facts surrounding "good code" is that x86 ISA processors outnumber all other "desktop" processors by a very big margin. So x86 enthusiasts sometimes even take the "what's good for x86 is the way it ought to be ... screw the rest of you" position explicitly. But a lot more frequently this position is inadvertent and pervasive ... most code comes from the x86 world so most of it is tuned to x86 to the extent it is tuned to anything at all.

For those who don't compute on x86, it's sort of like being a lefty in a right-handed world (which I am) ... one suffers the "handicap" itself with neglible annoyance (in fact it has advantages ... I was a fencer and won several matches with my opponent muttering about lefties as they walked off) ... but it is the the endless self-backpatting rectitude (think about the very meaning of being "right") of the majority which is oppressive. Lefties are gauche & sinister in the literal meanings of the words. So are PPC advocates. Get used to it.

Writing well-optimised PPC code can be like leaving a pair of left-handed scissors around ... I _actually_ had somebody who took the scissors off my desk without asking come back and complain what lousy scissors they were ... without realizing _what_ they were!

And this leads to an obvious point ... anytime I develop code the first step in any algorithm is to get one which is correct, clear, and easy to check. This is often (but not always) as short and simple as possible.

And if this doesn't dominate the execution time of anything you care about ... quit there!

But when I need to optimize beyond that, always keep the simple version around in the code... it can be buried in an if-def to avoid producting code when not wanted ... but it serves as both part of the documentation for the optimized version and an alternative which others are free to try. On deep OOOE machines... who knows, it might be competitive!

IWs coments about x85 prefetch and SIMD need to be prefaced with the obvious comments that:

* there isn't any equivalent for DST, only for the cacheline touches

* there isn't any SIMD equivalent for vecperm ... this a major limitation

* there isn't a separate full-width SIMD pipeline, so the advantages of SIMD are variable and generally less compelling than with Altivec (particularly Altivec compared to the limited scalar performance of G4, G4+)

Many aspects of all of this are in the eye of the beholder.., but writing good x86 SIMD is IMO much harder than writing altivec (and I've done some), and generally less rewarding.

Altivec is one of the really good things that Motorola (and Apple) brought to the computing world. The x86 crowd doesn't see it yet, but it is. IBM has finally knuckled under and will build processors with it, gcc has accepted it as a lexical extension to the compiler.

OK, so I'll entertain you with an anecdote about the first rule of optimization ("Profile, profile, and then profile some more").

I started writing a few central routines for my pet project (still the mysterious vectorized raytracer), but soon I got stuck over a few basic design decisions. There are a few alternatives for organizing data structures, and I had no clue what kinds of operations on those data structures would be most critical for overall performance.

If I had some prototype handy, I could simply profile it and get some 'realistic' statistics. Maybe not in terms of wall clock time spent in specific routines, but in terms of the number of calls, and with hard facts on caller/callee relationships.

Well, such a prototype would sure be nice to have ...

In fact that's what I'm doing now. It's still far from being feature complete (even for what little I have in mind) and it doesn't contain any of the planned central speed goodies yet, but it already taught me a handful of important things:

* single precision fp isn't as accurate as I thought (the 'fudge factors' are farther from 1.0 than I expected)

* my few changes (improvements?) to the 'standard' simple lighting model do not sacrifice visual quality despite a notable reduction of calculations

* a lot of time seems to be spent in places I haven't given much attention to up to now

* a basic raytracer can be thrown together quicker than you think (ok, it does help to have browsed a few good books on the topic, and to have the math worked out already)

* a carelessly written raytracer is fairly slow, even if the underlying math has been tuned for speed

The morale of the story: you can profile nonexistant programs by writing a quick and dirty prototype, and it _is_ worth the while. Just don't make the mistake of never throwing away that prototype.

quote:* single precision fp isn't as accurate as I thought (the 'fudge factors' are farther from 1.0 than I expected)

I'd agree to some degree, although I always felt BMRT managed pretty well with floats (although I never bothered to check if the CPUs were calculating doubles even though they weren't explicit)...

quote:* a basic raytracer can be thrown together quicker than you think (ok, it does help to have browsed a few good books on the topic, and to have the math worked out already)

* a carelessly written raytracer is fairly slow, even if the underlying math has been tuned for speed

You could probably say that about most commercial raytracers! Of course 3Delight's pretty fast (although I haven't done any full raytracing with it, just selective), too bad it's not open source so one could peek at their AltiVec optimizations...

quote:Originally posted by archie4oz:I'd agree to some degree, although I always felt BMRT managed pretty well with floats (although I never bothered to check if the CPUs were calculating doubles even though they weren't explicit)...

I wasn't aware BMRT's source was available. I thought there were only 'free' binaries.

quote:

quote:* a carelessly written raytracer is fairly slow, even if the underlying math has been tuned for speed

You could probably say that about most commercial raytracers! Of course 3Delight's pretty fast (although I haven't done any full raytracing with it, just selective), too bad it's not open source so one could peek at their AltiVec optimizations...

Uh, I don't think my efforts at a tuned raytracer will ever produce anything beyond 'toy' stage, feature-wise. A render back end for RenderMan scenes plays in a different league.

On the other hand, it will be very interesting to think about 'advanced animation and rendering techniques' that are much more difficult to vectorize. :-)

quote:I wasn't aware BMRT's source was available. I thought there were only 'free' binaries.

It's not, in fact good luck finding a binary these days unless you know somebody who has it... It was mostly a point with regards to the Renderman spec and it's data types.

quote:On the other hand, it will be very interesting to think about 'advanced animation and rendering techniques' that are much more difficult to vectorize. :-)

On the otherhand AQSIS is open-source (although it's a REYES scan-line renderer so you're out of luck there)...

Shame the GPU thread died so unceremoniously in the Programmer's forum, I would have liked to hear more on that. I've had my own silly attempts at trying to do audio processing and running A* on an XDK...

quote:Originally posted by archie4oz:Shame the GPU thread died so unceremoniously in the Programmer's forum, I would have liked to hear more on that. I've had my own silly attempts at trying to do audio processing and running A* on an XDK...

As I alluded to farther up this thread, I've been thinking about a "computation description language" that could be used to write algorithms for the vector unit. That is just the beginning, however, especially if you include a description of the data being processed and you provide for an asynchronous calling interface. Such a model starts to look a lot like the vertex programs of modern GPUs. Given a problem description of that sort you can imagine doing all sorts of things with it -- running it on a modern GPU, running it on AltiVec, running it on SMP AltiVecs, running it on multiple GPUs, running it on a cluster of machines, etc. The first problem is how do describe the computation & data in a way that is well suited automatic implementation on a variety of hardware. I don't think VAST is the right solution, and pre-GFFX GPUs aren't flexible enough for use as generalized computing engines. I haven't really gone looking so I don't know if anybody is doing any real research into this, but now seems like the right time if its going to be ready when massivel parallel general purpose hardware becomes available.

quote:Originally posted by archie4oz:On the otherhand AQSIS is open-source (although it's a REYES scan-line renderer so you're out of luck there)...

REYES is fine for what it was intended to do, but I don't like some of the limitations it has. I am always annoyed by the faked penumbrae, for example. Shadow borders should be sharper the closer an occluding body is to the surface where its shadow is cast upon. One would think that a little trickery with shadow map filtering should fake this effect well, but apparently a devil showstopper is in the details somewhere.

That said, I can't think of anything that could replace REYES in terms of achieveable realism and efficiency.

My small goal is to have just the few features I ever used in renderers like Rayshade and PoV-Ray, and make those as fast as I can, specifically targeting my own (AltiVec-) machine.

Apart from that, I want to follow the general principles "optimize for complex scenes", "keep it simple", and "let others use the source, Luke".

Hi there, my name is Seth and I have been lurking around for some time now in this AltiVec thread, and the others. Since we have all of these AltiVec powerhouses in the room, I thought I may as well ask for help here.

The Problem:We have a ecosystem simulator that we run at a dept. that I am working for at the University of Wisconsin - Madison called SAGE (Center for Sustainability and the Global Environment, http://www.sage.wisc.edu/ ). This simulator is called IBIS (Integrated BIosphere Simulator, http://www.sage.wisc.edu/pages/datamodels.html#Anchor-IBIS-49575) and is written in FORTRAN 77 (ick). A normal run will take 3-5 days on some old SGI machines that we were using. Those machines are now dying, so the scientists are having to move IBIS to their desktop G4s with (mostly) no changes. I have heard that the average run time has dropped closer to 3 days, which is not bad. As the funding runs out for the maintenance of the SGIs, we are going to be loosing them and migrating completely to our desktop G4s. Given that IBIS performs calculations on arrays of floats of size normally between 1k and 10k numbers each, I think that there is much room for (AVed) improvement.

The Task:So, I have begun to use VAST to AltiVec this code. I am having problems getting this to work because VAST tends to corrupt data at some (seemingly random) loops. Therefore, I have been going through and setting compiler directives for each loop at a time and then recompiling and rerunning the simulation to make sure that VAST didn't screw things up. Of course, the most time intensive loops are the ones that refuse to be AltiVeced.

In short, what VAST does is create vectorized functions in C that duplicate the behavior of a given loop. Then it places calls in the FORTRAN code to the C functions.

Currently, using VAST, I have only been able to drop the execution time on my test runs from 380 secs to 370 secs. I think that I should be able to do better than that. I looked at the intermediate files and things look like this:Original FORTRAN 77 loop:(note: in this case, int "npoi" = 912 and foo(i) is a reference to a var in an array, for those not familiar with FORTRAN)

I am not experienced in such matters (yet), but I am guessing that a monkey with a crayon could make more efficient code than this (that's where I come in). So, I thought that I would recode the time intensive parts of IBIS in C just as VAST has done. First I would do the loops (as shown here). Then I would move on to recoding a few entire routines into C with AV. Plus, writing more of the code in C will get rid of problems in allocating/deallocating vars into aligned vector format (as in I'll only need to do it once per subroutine, instead of once per loop/call as VAST does).

Does this sound like a reasonable goal? In the end, I suspect that I would be re-coding approx. 1-3k lines of FORTRAN 77 into C w/ AltiVec by the time I am done.

Would this method show substantial overall speed increases? I anticipate that the scientists really don't want to learn C in order to modify their calculations. I also think that if I show amazing speed increases they will be persuaded. (remember, doubling the speed of the program overall would result in 1.5 days of saved computing time on the average run)

Any more ideas/comments?

Me: I am an undergrad studying Comp Sci. For the classes that I have been through, we have used Java. I am not yet fluent in FORTRAN or C (ya got to start somewhere). I have done much in PHP for various jobs that I have around campus, so I have more experience than your average undergrad in constructing programs (now I just need experience in compiling them, something PHP more-or-less skips).

The reason I went through this huge intro just for a few tiny questions is I may be spending a lot of time on these boards asking silly AltiVec questions from a mostly-C-ignorant point of view. I think that this would be a good way to teach the lurkers out there AltiVec basics by starting at the least common denominator (me). (and this is a good intro to get everyone on the same page)

quote:Originally posted by mailseth:So, I have begun to use VAST to AltiVec this code. I am having problems getting this to work because VAST tends to corrupt data at some (seemingly random) loops.

Which OS version and GCC version are you running? There used to be a few bugs that could cause data corruption in AltiVec code that on its own was correct.

quote:Of course, the most time intensive loops are the ones that refuse to be AltiVeced.

Then you should start by analyzing this most time intensive code. Some loops are rather hard to vectorize, for example when there are dependencies on previous iterations or when the loop body contains too many conditionals.

(some VAST-generated C code snipped)

quote:I am not experienced in such matters (yet), but I am guessing that a monkey with a crayon could make more efficient code than this (that's where I come in).

Well, vectorizing compilers aren't easy to predict. In some cases they might be able to outdo all but the most skilled programmers, and in other cases they might refuse to vectorize code because of merely syntactical issues.

Chances are that human insight and creativity can outdo VAST in your case, but don't rely on this to be trivially easy.

quote:So, I thought that I would recode the time intensive parts of IBIS in C just as VAST has done. First I would do the loops (as shown here). Then I would move on to recoding a few entire routines into C with AV. Plus, writing more of the code in C will get rid of problems in allocating/deallocating vars into aligned vector format (as in I'll only need to do it once per subroutine, instead of once per loop/call as VAST does).

Does this sound like a reasonable goal? In the end, I suspect that I would be re-coding approx. 1-3k lines of FORTRAN 77 into C w/ AltiVec by the time I am done.

Yes, that sounds reasonable. You will want to touch only the few loops where most time is spent. And you will want to do it step by step, because the speed bottleneck will be a moving target.

If you were really serious about getting maximum speed, you might have to rewrite things on a larger scale; for example to change data structures for a better fit to SIMD.

quote:Would this method show substantial overall speed increases? I anticipate that the scientists really don't want to learn C in order to modify their calculations. I also think that if I show amazing speed increases they will be persuaded.

Be scientific about it. Profile a typical run and focus your attention on the stuff that eats most time. Try to find out why VAST didn't want to vectorize those loops.

It might be enough to restructure the original source code in ways that make it easier for VAST to detect and exploit parallelism. You wouldn't get the most optimal speedup that way, but you'd also save a lot of effort.

mailseth's example raises an interesting issue with regard to to the future of Altivec optimizations of large established bodies of code:

How much of what is currently perceived wisdom with regards to optimizing code for Altivec can be expected to remain so in the foreseeable future. (I would take that to mean the next generation of processors, specifically the PPC970 or similar over the next several years). On the one hand, significantly improved FP performance coupled with increased clock speed can be argued negate, or at least de-prioritize, the need for laborious and expensive Altivec optimizations. Yet at the same time, the thought of such a competent vector processor suddenly running wild and free, unhindered by today's restrictive memory and bus subsystems leads to some exciting possibilities!

Based on the best guesses available today, how do people expect the "sweet point" to shift between cost of optimization and increased performance?*

CGI

*This is not just idle curiosity for my part. Behind this question lies a very large scientific code base running mostly unoptimized on X86 Linux. We have now reached the point where optimizing the code might be worth it for us and attendant with this decision, is the possible option of switching to PPC w Altivec. Any such shift would be in an 18-36 month timeframe meaning we want to make the decision bearing in mind not what is now, but what can reasonably expected to be.

quote:Originally posted by mailseth:So, I have begun to use VAST to AltiVec this code. I am having problems getting this to work because VAST tends to corrupt data at some (seemingly random) loops.

Which OS version and GCC version are you running? There used to be a few bugs that could cause data corruption in AltiVec code that on its own was correct.

I am not using GCC. I am not sure if the reason for that is it dosen't do FORTRAN, or if it isn't as good, but in the end I am just logged into someone else's machine so I don't have the much control over versions. v77 is at v7.4. I am using the Absoft f77 compiler and OS X 10.1.5. Does GCC do a good job with FORTRAN code? Should I be using it instead?

quote:

quote:Of course, the most time intensive loops are the ones that refuse to be AltiVeced.

Then you should start by analyzing this most time intensive code. Some loops are rather hard to vectorize, for example when there are dependencies on previous iterations or when the loop body contains too many conditionals.

Almost all of the code is just straight loops. I really wasn't expecting it to vectorize the conditionals and dependencies.

quote:If you were really serious about getting maximum speed, you might have to rewrite things on a larger scale; for example to change data structures for a better fit to SIMD.

If I have a bunch of arrays from unknown origin (elsewhere in the program) what is the best way to make sure that they are aligned? Do I need to un-align it when my subroutines are done so the rest of the program can handle it? (By changing data structures, you mean aligning it and using fewer larger arrays (as opposed to more smaller arrays), yes?)

quote:

quote:Would this method show substantial overall speed increases? I anticipate that the scientists really don't want to learn C in order to modify their calculations. I also think that if I show amazing speed increases they will be persuaded.

Be scientific about it. Profile a typical run and focus your attention on the stuff that eats most time. Try to find out why VAST didn't want to vectorize those loops.

It might be enough to restructure the original source code in ways that make it easier for VAST to detect and exploit parallelism. You wouldn't get the most optimal speedup that way, but you'd also save a lot of effort.

As I mentioned before, the code is already in many rather simple loops. I have already spent quite a few hours over the past few weeks trying various directives and reorganizations to speed the program up. So far the best I can do is about 2-3%. I don't see why I can't do better than that. When I turn on vecorization for everything I get junk output, but the speedup is closer to 20-30%

If I have this array of 1k-10k items, is the best way to prefetch them with vec_dst or by loading the vector a few instructions before it is needed with vec_ld (as it seems VAST is doing)?

I haven't really gone looking so I don't know if anybody is doing any real research into this, but now seems like the5{going to be ready when massivel parallel general purpose hardware becomes available.

Yep, you've hit the nail on the head. It's going to be important to start thinking outside of just the CPU because there could potentially be more than one vector unit on board (eg. GPU or others) -- each with their own set of restrictions on operations and datatypes. How to come up with a general purpose description of the task and automatic generation is going to be the challenging part.

quote:It's going to be important to start thinking outside of just the CPU because there could potentially be more than one vector unit on board (eg. GPU or others) -- each with their own set of restrictions on operations and datatypes. How to come up with a general purpose description of the task and automatic generation is going to be the challenging part.

Perhaps a resurgence in functional programming!?! w00t! I do love Lisp for graphical programming!

quote:Originally posted by mailseth:I am not using GCC. I am not sure if the reason for that is it dosen't do FORTRAN, or if it isn't as good, but in the end I am just logged into someone else's machine so I don't have the much control over versions. v77 is at v7.4. I am using the Absoft f77 compiler and OS X 10.1.5. Does GCC do a good job with FORTRAN code? Should I be using it instead?

The first thing you might want to do is check out Apple's scitech mailing list, search through the archives to see if anything similar has been dealt with and post your question there. The Apple mailing lists are all at lists.apple.com, you can search/view archives without signing up for them.

I'm not sure what the comparison would be between Absoft's compiler vs the gcc equivalent, you can probably assume there would be some pain involved in the switch and it might not be faster under 10.1.5 (I don't think OS X's mathlib was fully optimized until 10.2.x).

Since you're thinking of switching to g77, you might want to take a look at Nag's compiler, they have a trial version and claim better performance than Absoft. It's F95 as opposed to F77 though:

OT Warning: Apologies as this post deals with the strategic position of altivec optimization rather than the actual pragmatic implementation of it:

quote:Originally posted by MrNSX:(snip)...because there could potentially be more than one vector unit on board (eg. GPU or others) -- each with their own set of restrictions on operations and datatypes.

Well how spooky is that. A few posts after mine, not even directly relating to it, and someone has brought up exactly one of the options we are working on!

Although I didn't elaborate last post, we do numerical simulations of engineering phenomena. We are looking at new hardware solutions beyond the conventional multiple CPU cluster brute force method and to this end, we've reconsidered our overall strategic approach starting with the actual way we solve our equations. In a very simple way, what we are now looking at would first involve transforming our data set (using wavelets... a long story) and this would be mostly straightforward FP-intensive calculations performed on a traditional CPU (and probably not very Altivec friendly). The we have vectorized the governing equations and this is where it gets interesting.

Applying the equations to the transformed data is very conducive to vector processors, even if it is only 128 bits wide. (Normally in the past, our code required double-precision FP). The most obvious solution would be to use a PPC w/ Altivec to crunch through and apply the vectorized equations. (With the CPU required to perform generalized time-stepping in the transformed coordinates every once in a while.) But there are problems with this. For starters, the restricted bandwidth of the current G4 together with the high cost of Apple hardware means that at the end of the day, our predicted cost/iteration is not any better what we achieve today with our brute force method. Coupled with the higher upfront development costs, and this is a solution that we can not justify pursuing. (At least with today's hardware...see my previous post inquiring about what is coming downstream and how the game is about to change.)

However, when we were looking at the maths and algorithms, one of my postgrads piped up wistfully "I wish we had more control over the GPU". Sure enough! Everything we wanted to do, one of today's advanced GPUs could do better than any general purpose CPU! Staring us in the face is what appears to be a custom-made maths engine for our needs. They have a prodigious vector processing capability on a fast and wide processor together with very impressive bandwidth performance. It *seems* like functionally everything we need is there in hardware, the only problem being that when it comes down to it, we can not get good enough access to it as things stand today. In fact, it seems we are _so_ close to a great solution that we were tempted to contact a GPU manufacturer to see if they could help us. (It seems like the hurdles are not in the silicon but in the implementation of it.)

And so now I read a post that maybe GPUs will indeed be heading towards becoming more general vector processors anyway. What an exciting promise that might be! Do you have any reliable information or references on that, MrNSX, or is this just rumor and speculation so far?

Addendum: I assume that any such development would be platform agnostic which again makes it less likely that we are heading towards an Apple solution...

quote:Originally posted by mailseth:I am not using GCC. I am not sure if the reason for that is it dosen't do FORTRAN, or if it isn't as good, but in the end I am just logged into someone else's machine so I don't have the much control over versions. v77 is at v7.4. I am using the Absoft f77 compiler and OS X 10.1.5. Does GCC do a good job with FORTRAN code? Should I be using it instead?

No, GCC is not better. And if you are under 10.1.5 you should be safe from AltiVec bugs in the development environment.

quote:Almost all of the code is just straight loops. I really wasn't expecting it to vectorize the conditionals and dependencies.

There is a chance that VAST _can_ vectorize these. AltiVec provides good primitives; much better than SSE for example.

So your troubles were in the lack of correctness of VAST generated code?

quote:If I have a bunch of arrays from unknown origin (elsewhere in the program) what is the best way to make sure that they are aligned? Do I need to un-align it when my subroutines are done so the rest of the program can handle it? (By changing data structures, you mean aligning it and using fewer larger arrays (as opposed to more smaller arrays), yes?)

Alignment is one issue that usually influences code fartrher away. Also the merging or splitting of arrays as you mention. I think there is an issue of C vs FORTRAN with regard to the layout of multidimensional arrays. So it could be that one would want to, say, transpose 2D arrays for better vectorization.

There is never a need to un-align data, scalar code simply doesn't notice alignment.

But there can be cases where you want to convert between 'array of structures' and 'structure of array' format, and then you would need to restore the original format after calculations.

quote:As I mentioned before, the code is already in many rather simple loops. I have already spent quite a few hours over the past few weeks trying various directives and reorganizations to speed the program up. So far the best I can do is about 2-3%. I don't see why I can't do better than that. When I turn on vecorization for everything I get junk output, but the speedup is closer to 20-30%

I see. OK, in that case it is not bold to assume that you can do a lot better. Ideal speedup for code consisting purely of calculations is four, more if there is a sizeable number of divides or square roots. A few things like min/max are also more efficient in AltiVec floating point than in scalar code.

quote:If I have this array of 1k-10k items, is the best way to prefetch them with vec_dst or by loading the vector a few instructions before it is needed with vec_ld (as it seems VAST is doing)?

Depends not on the number of items, but on the number of bytes occupied. If the typical working set size is below 256KB, the on-chip L2 cache will handle things nicely. With sufficient load hoisting, no prefetches are needed in this case.

For working set sizes between 256KB and 2MB, you work from L3 cache. You need prefetches to cover load latencies then, but you still have enough bandwidth for most things (ca. 4GB/sec).

Only for working sets sizes beyond that will you be hampered by the slow front side bus.

quote:Originally posted by Cogitur Ignace:Applying the equations to the transformed data is very conducive to vector processors, even if it is only 128 bits wide. (Normally in the past, our code required double-precision FP). The most obvious solution would be to use a PPC w/ Altivec to crunch through and apply the vectorized equations. (With the CPU required to perform generalized time-stepping in the transformed coordinates every once in a while.) But there are problems with this. For starters, the restricted bandwidth of the current G4 together with the high cost of Apple hardware means that at the end of the day, our predicted cost/iteration is not any better what we achieve today with our brute force method. Coupled with the higher upfront development costs, and this is a solution that we can not justify pursuing. (At least with today's hardware...see my previous post inquiring about what is coming downstream and how the game is about to change.)

Wether bandwidth is a limit or not depends on working set size. Two megabytes of L3 cache per processor in Apple hardware offer 4GB/sec of bandwidth each, and aggregate bandwidth scales up for dual processor systems.

When your data fits in there, Macs would actually have more effective bandwidth than typical PC hardware.

Furthermore, the learning curve for anything else is probably steeper than for AltiVec. If development time and costs are a concern, you shouldn't underestimate the comparable ease of hand coding for AltiVec vs. most other, more specialized, things.

quote:Originally posted by hobold:So your troubles were in the lack of correctness of VAST generated code?

Yep. I encountered some internal errors (which I should report to Veridian), some corrupted output, and now I am noticing when I turn auto alignment off the output dosen't seem to be changed. How can I tell for sure whether the data is being aligned on the fly from the C code above? How large of a performance difference should there be between using aligned and non-aligned data (that needs to align itself on the fly)?

quote:Alignment is one issue that usually influences code fartrher away.

So I jut need to make sure that the beginning of the array lies on 16-byte bounds? (I see that number a lot when I am reading) Anyone know how to do that in fortran 77? I can't find it anywhere on the web or in any of the reference manuals that I have.

quote:I see. OK, in that case it is not bold to assume that you can do a lot better. Ideal speedup for code consisting purely of calculations is four, more if there is a sizeable number of divides or square roots. A few things like min/max are also more efficient in AltiVec floating point than in scalar code.

How about multiplication? The code has its fair share of div and sq roots, but most is mult.

quote:If the typical working set size is below 256KB, the on-chip L2 cache will handle things nicely. With sufficient load hoisting, no prefetches are needed in this case.

For working set sizes between 256KB and 2MB, you work from L3 cache. You need prefetches to cover load latencies then, but you still have enough bandwidth for most things (ca. 4GB/sec).

From what I figure, on average a loop will use about 10 arrays of 1k-10k floats each or 40-400kb. What is the best way to handle that? From what I read, I can only start up 4 prefetch streams of data. When I am using >10 arrays, should I try to prefetch the last ones used so they are loaded while other calculations are going and load hoist the rest? Or am I just splitting hairs?

For the subroutines in need of optimization there are about 20-30 arrays of said size scattered through 5-27 loops, so many of them are reused, they just don't fit nicely in cache. Should I attempt to reorganize the work to use less arrays?

quote:How about multiplication? The code has its fair share of div and sq roots, but most is mult.quote:

If the typical working set size is below 256KB, the on-chip L2 cache will handle things nicely. With sufficient load hoisting, no prefetches are needed in this case.

For working set sizes between 256KB and 2MB, you work from L3 cache. You need prefetches to cover load latencies then, but you still have enough bandwidth for most things (ca. 4GB/sec).

From what I figure, on average a loop will use about 10 arrays of 1k-10k floats each or 40-400kb. What is the best way to handle that? From what I read, I can only start up 4 prefetch streams of data. When I am using >10 arrays, should I try to prefetch the last ones used so they are loaded while other calculations are going and load hoist the rest? Or am I just splitting hairs?

For the subroutines in need of optimization there are about 20-30 arrays of said size scattered through 5-27 loops, so many of them are reused, they just don't fit nicely in cache. Should I attempt to reorganize the work to use less arrays?

SP-float multiplication will be done parallel (4 wide) by altivec very efficiently. In most cases one sees multiplication plus addition (or subtraction), and that is a unitary operation, if the code is generated well. This can mean 8 flops/cycle peak.

The rest of your questions are a bit too general to answer as posed. Echoing hobold, the absolute first thing to do is to profile and find out which loop(s) are really costing the time.

Then, if they aren't ridiculously long... post them. Looking at the little scrap of code you posted, and comparing to the VAST compiler generated... this code would be _stinkin_ easy in altivec and produce very few lines of code except for _one_ ugly thing....

note the code uses references to lai(i,2), sai(i,2), sai(i,1) e.g. where each time around the loop it is referencing THREE variables by non-least-incrementing memory count, and only Two variables (fu, fi) by least incrementing.

This is where all the memory thrashing iscoming from, and all those vec_perms. The absolutely FIRST question to ask is can you transpose lai and sai? (I mean in terms of everything, set up your data structures that way.)

The second thing that puzzles me is whatis the magical significance of the "1th," and "2th" indexes of lai, sai? Are these stored into an array in this form simply because of some programmer's idea of tidiness? How often are these updated? Simply having linear one-D arrays of these sai, lai coeffs would make the vector code very simple, and the streaming very fast.

How big is npoi typically? How much of these arrays are in cache from previous operations?

If npoi is big enough that this routine"looks like a blitter" (memory to memory)can this code be "moved upstream" in thealgorithm so that it is done when eitherthe sai, lai, are being calculated,or the fu, fi being calculated...which ever is recalculated more?If so then that will save bandwidth.

quote:As I mentioned before, the code is already in many rather simple loops. I have already spent quite a few hours over the past few weeks trying various directives and reorganizations to speed the program up. So far the best I can do is about 2-3%. I don't see why I can't do better than that. When I turn on vecorization for everything I get junk output, but the speedup is closer to 20-30%

I see. OK, in that case it is not bold to assume that you can do a lot better. Ideal speedup for code consisting purely of calculations is four, more if there is a sizeable number of divides or square roots. A few things like min/max are also more efficient in AltiVec floating point than in scalar code.

When I started this thread I was wondering: how much gain does Altivec really provide? The hype seems too good to be true, but in most cases the speedups have been 400-725%. People dealing with the same code on the x86 side are lucky to see 40% max with MMX, and it seems much harder to write.

quote:How about multiplication? The code has its fair share of div and sq roots, but most is mult.

It's all faster basically. If you are working on floats then it's multiplying 4 floats at once, shorts 8 at a time, and chars 16 per op. Also, the more you do with each chunk of data loaded into the vector the greater the gain. If the example posted above is typical, then you will see large gains from altivec.

quote:Originally posted by BadAndy:SP-float multiplication will be done parallel (4 wide) by altivec very efficiently. In most cases one sees multiplication plus addition (or subtraction), and that is a unitary operation, if the code is generated well. This can mean 8 flops/cycle peak.

Great! I just wish it was obvious from the previously mentioned VAST timings...

quote:This is where all the memory thrashing is coming from, and all those vec_perms. The absolutely FIRST question to ask is can you transpose lai and sai? (I mean in terms of everything, set up your data structures that way.)

It sounds like I should investigate creating lai1 and lai2 to replace lai. Same goes for sai. But I figured that the memory wasn't taking to much of a hit because lai(i,2) and lai(i,1) are both used and reside next to eachother in memory. So shouldn't a load of one still load the other into cache? (and thus not take the memory access hit)

quote:How big is npoi typically? How much of these arrays are in cache from previous operations?

npoi should stay between 1k and 10k

quote:If npoi is big enough that this routine "looks like a blitter" (memory to memory) can this code be "moved upstream" in the algorithm so that it is done when either the sai, lai, are being calculated,or the fu, fi being calculated... which ever is recalculated more? If so then that will save bandwidth.

I've looked into that, and whenever I combine loops the preformence drops unless it is stunningly obvious like:

quote:Originally posted by mailseth:(...)It sounds like I should investigate creating lai1 and lai2 to replace lai. Same goes for sai. But I figured that the memory wasn't taking to much of a hit because lai(i,2) and lai(i,1) are both used and reside next to eachother in memory. So shouldn't a load of one still load the other into cache? (and thus not take the memory access hit)

The problem here is not memory latency, but the fact that AltiVec cannot load scattered data elements into one vector efficiently. It is better to have homogenous arrays where all elements not only have the same data type but also the same 'meaning'.

For example an array of complex numbers can be processed more efficiently in AltiVec when all real parts are in one 1D array and all imaginary parts are in another 1D array (as opposed to one array of real-imaginary pairs).

quote:Originally posted by hobold:[QUOTE]Alignment is one issue that usually influences code fartrher away.

What do you mean "influences code fartrher away"?

Don't mind the typo. The 16 byte alignment is needed for easy processing with AltiVec, but the memory allocation code might be in quite a different place of the program and has to be adapted. Also, if you do alignment explicitly, your cleanup code might have to be careful about freeing memory again (because the manually aligned pointers cannot be directly passed to the OS for free()ing).

quote:So I jut need to make sure that the beginning of the array lies on 16-byte bounds? (I see that number a lot when I am reading)

Alignment isn't stricly necessary, but it makes AltiVec code a bit more efficient and considerably easier to write.

quote:Anyone know how to do that in fortran 77? I can't find it anywhere on the web or in any of the reference manuals that I have.

Sorry, I know nothing about FORTRAN except that I should know it better than I do, nominally being an educated mathematician. :-)

quote:From what I figure, on average a loop will use about 10 arrays of 1k-10k floats each or 40-400kb. What is the best way to handle that?

Are these max. 400KB the complete working set? In any case, stream prefetches are the way to go.

quote:From what I read, I can only start up 4 prefetch streams of data. When I am using >10 arrays, should I try to prefetch the last ones used so they are loaded while other calculations are going and load hoist the rest? Or am I just splitting hairs?

You are taking these four hardware prefetch streams too literally. In practice one fetches not complete streams but small overlapping blocks. You'd use the prefetch engines for blocks of each array in turn.

quote:For the subroutines in need of optimization there are about 20-30 arrays of said size scattered through 5-27 loops, so many of them are reused, they just don't fit nicely in cache. Should I attempt to reorganize the work to use less arrays?

Reorganizing doesn't make the working set smaller. You should try to identify streams that are read infrequently and use 'transient' prefetches and memory accesses for those. This means they will be less likely to displace more frequently needed data from the caches.