Altivec Optimizations and other PPC Performance Tips

quote:Originally posted by BadAndy:It is absolutely, totally, wonderful to see a critical mass of altivec-interested folks growing on MacAch.

Traditional Mac user base had a low percentage of coders, I suppose. With revision 10.2, OS X finally began to draw in unix users in sizable numbers, a much larger fraction of which cares about programming.

quote:It's also wonderful to have another altivec guru, who is, despite his modesty, at least my equal in this mater ... carry some of the water.

First of all, let me say that all of you contributing to this forum are gentlemen and scholars. I really appreciate the time you have spent answering my questions

Mr NSX:

I'm all for GPU processing. I use the Apple YCbCr texturing extension of openGL, and the app is mainly GL based so tons of transforms and processing can go on there. Tell me, what sort of per-pixel ops do things like pixel-shaders enable on the GPU? I am going to write some blending objects using glBlend, but I suspect that the next gen cards will allow far more options for pixel processing.

Hobold:

Thanks for the info about the cache etc.

quote:

Who does the decompression? Your own code (or at least code that you have some control over)?

I'm using Quicktime for all of the decompression and media handling. The flexibility and depth far outweigh any speed concerns at the moment. It was the prime reason me and my friend ported this app to OSX.

quote: If so, consider decompressing only a part of a frame, say, a number of scanlines, and do your own processing on those scanlines while they are still residing in the cache.

You'll save one round trip to memory that way. You'll save many trips to memory if your processing consists of a customizable chain of effects applied in sequence.

(The basic idea behind tiling of images is a more general principle in optimization. I have seen it referred to as 'strip mining' or 'cache blocking'.)

I did work on an app that used 'tiling' to help fit partial images into the tiny caches of athlons and p3s. I'm not sure the benfit of this wasn't outwieghed by the somewhat unwieldy code required to do it. Quicktime is giving me the data in the same format as openGL needs for texturing so that's probably a big plus in a way.

This app does in fact have a highly customizable chain of effects. In fact, it's wide open: someone can attempt to load 18 DV clips add a blur to each and then mix and key them together. The result would be a ~1fps mess but it is indeed a possibility.

I am a bit concerned that the modularity of the app might indeed hinder the altivec stuff, but I'm also thinking that it's already doing the same for the scalar bits, so altivec is still going to be a big win anyway.

The scalar code is pretty much squared away, so I'm trying to build a prototype altivec object that does everything right before moving on to the rest of the code.

I'm all for GPU processing. I use the Apple YCbCr texturing extension of openGL, and the app is mainly GL based so tons of transforms and processing can go on there. Tell me, what sort of per-pixel ops do things like pixel-shaders enable on the GPU? I am going to write some blending objects using glBlend, but I suspect that the next gen cards will allow far more options for pixel processing.

Yep, but even beyond simple YCbCr conversion, there's the whole penalty of IDCT for decoding the DV frame that can be alleviated from the CPU. As far as ops in the pixel shader, you can do all your straightforward things like fetching from different sources (doesn't have to be an actual texture, it could be a 1D texture representing a table etc), ADD, SUB, MUL, MAD, linear interpolation, MOV, conditionals, dot products, exponent, log, power, cross product, reciprocals, etc. Essentially you're writing programs that tell the GPU how to process each pixel, which it then applies in real time to every pixel that goes through the render pipe. The nice thing is that the ops are all single cycle throughput (even the complex ones), and you have massive amounts of memory bandwidth (ie > 20GB/s) so you don't run into the biggest issue with Altivec which is having the execution units starved. It's really quite powerful, and set to become even more powerful when the next gen GPU's come out.

Anyhow, this is the second time I've sidetracked this thread, so why don't you start a separate thread if you'd like to know more about how to accomplish what you need through the GPU.

quote:Originally posted by hobold:But there is no denying that I am a theoretical person. To me, thinking about and discussing efficient algorithms and efficient implementations is 70% of the fun, programming is 20%, and seeing the finished program run is only 10%.

It has been my experience that people like that make some of the best coders. I'm personally closer to 40/40/20, with the first two chunks usually lumped together as one big mathematical whorl

quote:Originally posted by MrNSX:Anyhow, this is the second time I've sidetracked this thread, so why don't you start a separate thread if you'd like to know more about how to accomplish what you need through the GPU.

GPU processing in general is a matter of interest. I second the motion for a new thread! I'd start one myself, but I'm cowardly.

I'd third the request for a GPU thread... note that this needn't be Mac-specific at all really.

It almost goes without saying that anything computable on _specialized_ hardware designed for that purpose will be next-to-impossible to beat with more generalized hardware ... so from a purely functional perspective what can be done in a GPU "probably should be" ... but ... reasons not to include:

* if all the target systems have the vector unit as part of the CPU, but not all are guaranteed to have a GPU for which your code will work.

* you know how to do it in the vector unit, and it is pretty easy and will be fast enough

* the data are already flowing through the FSB for some other reason

Re that "35 cycle" comment of Ian's I didn't see, and following hobold's explanation with an addendum:

Take a 1 GHz processor on a 133 MHz SDR FSB as "today's" typical target. This is a PowerBook, or a medium tower (soon to be obsolete) running a single-thread task.

If you are writing a "blitter" (which means an algorithm which reads and writes streamable data, with very high locality of memory usage) and are writing out a byte for everyone read then there would be 30 cycles of compute time per vector read, processed, and written.

This is Ian's "35" .... 30 - 35 cycles is a LOT of Altivec computation. This is why I harp and harp at the FSB limitation. It is also why it is rare to find an altivec problem were MP really pays for itself on today's Macs _except_ for one subtle point ... MP you get twice as much L2 & L3 storage if you can take advantage of it.

Where altivec really shines is on computation where the data are already in cache anyway. This includes "fairly large" matrix and convolution problems (including FFTs) ... which are the Altivec "bread and butter."

It is _really_ uncommon to find a blitter-like (external memory to external memory) problem where the necessity for vec_perm games exceeds 30 operations per vector processed. What this means is that for blitter problems, generally vec_perm is just about "free" ... provided that the loop is unrolled enough that the code can be scheduled to bury the instruction latencies. (You can do a vec_perm and another vector operation each cycle.) This is generally pretty easy to do. The only real impact of excessive vec_perms is often you start wasting vector registers holding permute constants... and these are sometimes costly to regenerate on the fly.

But generally the only types of codes where you need to be very watchful of the cost of vec_perms are ones on cached data.

In the codes I write my MAJOR first consideration is to structure the algorithm data-flow to avoid FSB traffic to the absolutely greatest degree (i.e. best locality of data, careful cache management). My next general consideration is often vector-register "spill code" -- A topic I never got to in the old altivec thread is that with gcc and CW it is MUCH better for you to explicitly manage your "spill code" than to simply write an algorithm with more than 16 vector variables ... and let the compiler shuffle them in and out of the cache. Instead register 16 vectors.... and explicitly load/store as needed... _thinking_ about it.

(Losing a few style points for quoting myself, but I have to clear this up)

quote:Originally posted by hobold:The part about avoiding vector permutes touches rather philosophical aspects of programming ... in a sense, permutes don't _process_ data, they merely _move_ it around.

Closer inspection reveals that there are two distinct and fundamentally different ways to apply vector permute:

1. with an invariant 'control vector' and source operands coming directly from your respective input data. This is the case meant above, where bytes are merely shuffled but not otherwise modified.

2. with invariant source operands and a 'control vector' directly derived from your input data. This is very different from above, because it is essentially a table lookup and can be equivalent to a whole lot of actual processing.

Obviously, permutes of the second category _are_ doing work and need not be eliminated from the code. In fact, many of the most amazing AltiVec examples of blazing processing speed are based on very clever use of permute.

BTW, another source to look for a few AltiVec programming tricks is the example code available from Apple's AltiVec pages. I think their code for handling multiple input streams plus one output stream each of arbitrary alignment is a good example how complicated things will get when you may not dictate data layout. On the plus side, you don't have to program that yourself, just use their code. (Beware, some examples are so old that the compiler will issue warnings or errors. For example, "vector long" has since been replaced with "vector int", but these bugs are trivial to fix.)

quote:Originally posted by BadAndy:I'd third the request for a GPU thread... note that this needn't be Mac-specific at all really.

Me too, me too. I've done a fair bit of GPU work on plaforms other than the Mac/PC, and I think there is going to be a great deal of potential for non-graphics work on the next generation of GPU.

quote:It is also why it is rare to find an altivec problem were MP really pays for itself on today's Macs _except_ for one subtle point ... MP you get twice as much L2 & L3 storage if you can take advantage of it.

Another plus to the MP machines is that there is inevitably a bunch of things running on the machine and if you have two processors then one can (hopefully) handle everything else while the other grinds away on the AltiVec algorithm, which avoids disrupting its streams and polluting its caches. If the other processes are relatively light in their FSB usage then the impact on the AltiVec process is minimal (thanks in part to the other processor's large caches).

quote:Originally posted by iapole:It has been my experience that people like that make some of the best coders.

Yes, their code can be delicious to both human and machine ... if they ever finish a piece beyond the proof of concept stage. :-)

quote:There is a place for theory in the world.

You bet!

Anyone still remember my remark about tinkering with a vectorized raytracer? Instead of writing code, I have been playing with Maple V (a computer algebra system) using my university account. I already gained a potential 2.4x performance improvement before even writing a single line of code (well, for one important routine at least).

(When you are shooting rays at a general quadric surface and are allowed to do some preprocessing assuming a fixed origin of rays (i.e. a fixed camera position), you might be lead into thinking that you need 24 multiplies per ray to find the three coefficients of the corresponding one-dimensional polynomial. This is the price for using a relatively high-level mathematical form, namely representing a quadric as a symmetrical 4x4 matrix.

When you break the quadric down into its uglier form as a polynomial in three variables, you will find that 12 multiplies are enough for the same task.

And finally, when you notice that ray direction vectors of unit length are a lot more useful in later lighting calculations, you can as well normalize them right away and consequently bring the number of multiplications down to 10 per ray-quadric-pair.

The number of adds is below that, and they can all be nicely merged into fused multiply-adds. Pipeline latencies can be taken care of by processing several rays simultaneously.

Sorry for the digression. Raytracing thoughts currently have a firm grip on my mind. If this goes on for another few weeks, I guess I'll have to call it an obsession. :-)

quote: Another plus to the MP machines is that there is inevitably a bunch of things running on the machine and if you have two processors then one can (hopefully) handle everything else while the other grinds away on the AltiVec algorithm, which avoids disrupting its streams and polluting its caches. If the other processes are relatively light in their FSB usage then the impact on the AltiVec process is minimal (thanks in part to the other processor's large caches).

You'd wish it would work this way... but it doesn't seem to on osX, even recently. The fundamental problem is once you MP your vector code then all other tasks seem to hit both CPUs semi-randomly ... trashing their caches. You're only going to have one cache untrashed in the best of all possible worlds anyway.

You can get a real benefit by making the process(es) real-time with long slices.

What I would _like_ to know how to do but have never explored is to go into the Mach tasker and change the processor sets (asmentioned by nibs in the old altivec thread). In effect let osX have one of the processors and take the other as a dedicated "sequential batch job" processor for heavy numerical tasks so that its caches are NEVER hit.

Has anyone started this thread yet? I was thinking of doing a Mac specific thread that dealt with topics like: Which chips, available on the Mac, have these new advanced features? How are these features implemented in Apple's OpenGL framework?

There could be a more generic thread in the Programming forum (A/V club?), but the differences between chips on various platforms and DirectX vs OpenGL might be too great to do much more than compare feature sets. And as we all know, mentioning Macs and PCs and performance in the same thread is often hazardous.

Has anyone started this thread yet? I was thinking of doing a Mac specific thread that dealt with topics like: Which chips, available on the Mac, have these new advanced features? How are these features implemented in Apple's OpenGL framework?

There could be a more generic thread in the Programming forum (A/V club?), but the differences between chips on various platforms and DirectX vs OpenGL might be too great to do much more than compare feature sets. And as we all know, mentioning Macs and PCs and performance in the same thread is often hazardous.

It takes 60 days (I think) for infopop to close a thead, so death is hardly imminent.

The last one closed for several reasons, but among them was that my mac had a dread "Maxtor hard-drive corruption" with play projects not backed up, and I lost a fair amount of work on a efficient altivec gamma-function evaluator (this is in fact a fairly common general issue, tho increasingly GPUs deal with it for you), for the RGB->CMYK ars testbench algorithm... to vectorize it.

A rash of personal and work problems dumped on me at that time and I've not gotten back to it.

One of the open questions for the MacAch community is what is really wanted? One of the things I've _thought_ about doing was taking the last big thread, plus maybe part of this one ... abstracting it down to the good stuff, correcting the many minor errors, and putting it together as a pdf ... but this would take some time.

quote:Originally posted by BadAndy:You'd wish it would work this way... but it doesn't seem to on osX, even recently. The fundamental problem is once you MP your vector code then all other tasks seem to hit both CPUs semi-randomly ... trashing their caches. You're only going to have one cache untrashed in the best of all possible worlds anyway.

You can get a real benefit by making the process(es) real-time with long slices.

Well multiple processors should at least decrease the frequency with which tasks switch. And didn't I just recently read something about 10.2 adding processor task affinity?

quote:Originally posted by hobold:Yes, their code can be delicious to both human and machine ... if they ever finish a piece beyond the proof of concept stage. :-)

Ahh, that has often been my problem

quote:Anyone still remember my remark about tinkering with a vectorized raytracer? Instead of writing code, I have been playing with Maple V (a computer algebra system) using my university account. I already gained a potential 2.4x performance improvement before even writing a single line of code (well, for one important routine at least).

I am intrigued, both by the idea in general and by the specific nature: my current project is going to require me to write a raytracer at some point.

Busy me.

quote:(When you are shooting rays at a general quadric surface and are allowed to do some preprocessing assuming a fixed origin of rays (i.e. a fixed camera position), you might be lead into thinking that you need 24 multiplies per ray to find the three coefficients of the corresponding one-dimensional polynomial. This is the price for using a relatively high-level mathematical form, namely representing a quadric as a symmetrical 4x4 matrix.

When you break the quadric down into its uglier form as a polynomial in three variables, you will find that 12 multiplies are enough for the same task.

Polynomials are beautiful

quote:And finally, when you notice that ray direction vectors of unit length are a lot more useful in later lighting calculations, you can as well normalize them right away and consequently bring the number of multiplications down to 10 per ray-quadric-pair.

Wow. I'm going to have to do a lot of thinking about this, clearly.

Boy, I wish I knew something--anything--about AltiVec

quote:The number of adds is below that, and they can all be nicely merged into fused multiply-adds. Pipeline latencies can be taken care of by processing several rays simultaneously.

Wow... that's impressive.

quote:Sorry for the digression. Raytracing thoughts currently have a firm grip on my mind. If this goes on for another few weeks, I guess I'll have to call it an obsession. :-)

quote:And finally, when you notice that ray direction vectors of unit length are a lot more useful in later lighting calculations, you can as well normalize them right away and consequently bring the number of multiplications down to 10 per ray-quadric-pair.

Wow. I'm going to have to do a lot of thinking about this, clearly.

Just ask me, it's not a secret. If I ever get beyond the proof-of-concept stage, the code I release will be open source. I bet someone will find ways to improve it, and I wouldn't want to miss out on their good ideas.

Once I knew that an efficient approach existed for applying modern short-vector SIMD to raytracing, I couldn't stop thinking about details ... a lot more things can be done when you start out with SIMD in mind, not just employ it later as an afterthought as those authors did.

The second central idea to my 'dream raytracer' is also an old one: vastly speed up CSG (constructive solid geometry) by using a spatial subdivision scheme that prunes the CSG trees for each voxel. This will also significantly speed up shadow ray tests, because you can flag all voxels that are fully contained in an opaque solid object. So you can often determine that a light source is occluded, without ever doing an explicit test of a ray against a primitive.

To do these CSG tricks, you must base your object description on primitives that have well-defined inside and outside, triangles or bezier patches won't do.

I ended up with quadrics as the simplest such primitives. Another plus is that square root calculations (needed to intersect rays with quadric surfaces) can be done very efficiently with AltiVec, hardly slower than the divide needed for a triangle.

The only difficult part remaining is an algorithm for intersecting a general quadric with a voxel (an axis-aligned bounding box in my case). I didn't easily find a solution for this; general quadrics can be unwieldy: infinite, non-convex, consisting of two disjoint parts.

But eventually I realized that 'polynomials are beautiful', :-) and that the good old linear optimization would almost do it ... I merely had to fix up the non-linearity of a parabola (which quadrics are based on) by explicitly handling its single extremal point. Each 'half' of a parabola is a monotonously increasing (or decreasing) function, so for my purposes it was similar enough to a straight line to make things work.

Before I'm straying any further from the topic AltiVec (if that is at all possible), I'll close with noting that I believe all of the above calculations can be arranged very SIMD friendly ... most conditionals can be hidden in min/max and a few other edge cases can be handled by exploiting IEEE floating point tricks based on 'infinity' values or NANs.

I just got my first altivec object working!! The change in CPU load reflected by Shikari is pretty profound: from 28% to 6% in one case and in another down to 1.2%! I'm liking this already

It took about 4-5 hours to convert a fairly simple scalar function to a working altivec one. I need to get the Altivec Cross-reference tattooed on my arm or something.

quote:Boy, I wish I knew something--anything--about AltiVec

Hey, it's not really that hard. If I can do it for a real-time video app then anyone can do it. I only got back into C last May and video programming in Sept, so I'm hardly an old pro. If you know C then you can do altivec. Pick a weekend, grab the docs and examples out there, and get to it!

I also have dreamed up a new killer app for Altivec programming: Altivec Assistant. It takes an equation like a = (a*b) + c and spits out the altivec calls with profiling comments. A great way for the intro programmer to learn and a time-saver for the pro. Anyone want to take a crack at it? (all eyes turn to hobold, just kidding)

For folks who just want to dip their toes into altivec, and have a little fun and bragging rights.... there are a bunch of routines in the Ars Testbench which are really easy "low hanging fruit" for altivec... the single-precision Vandermonde Eqn one comes to mind, and there are others. I didn't do these hoping others would take them on.

The whole BF catfight largely died from exhaustion ... but if you want to hold ars/MacAch bragging rights for something then you can pretty easily fiddle with some very tiny code fragments (some of these routines are no more than 10 lines long), and get huge factors of speedup, and have your name on the Mac version of the ars-testbench as a contributor.

If you look in the old altivec thread at my matrix3x3 example ... there are things where just adding explicit register declarations, loop unrolling, and adding a dst will provide factors of 3 for _scalar_ code... don't even need to touch altivec.

Like hobold, I am often attracted to rather outre problems for the fun of it... and I'll mention one you wouldn't think of probably ... an altivec chess engine.

This couldn't be "pure" altivec all the way without it being a bizarre but not useful tour-de-force ... but there are portions of typical chess engines which dominate the total execution, and are altivec-able.

quote:Originally posted by gcc:I also have dreamed up a new killer app for Altivec programming: Altivec Assistant. It takes an equation like a = (a*b) + c and spits out the altivec calls with profiling comments. A great way for the intro programmer to learn and a time-saver for the pro. Anyone want to take a crack at it? (all eyes turn to hobold, just kidding)

quote:Originally posted by gcc:Hey, it's not really that hard. If I can do it for a real-time video app then anyone can do it. I only got back into C last May and video programming in Sept, so I'm hardly an old pro. If you know C then you can do altivec. Pick a weekend, grab the docs and examples out there, and get to it!

Ahh, for my lack of weekends and things to vectorize At the moment I'm scrambling to get some code done in a desperate attempt to get some income--trying to get on my feet, as it were. So I'm not sure I can afford the time to learn AltiVec not because it'd take a weekend but because it would give me other distracting ideas, and I'm not sure I'm disciplined enough to just write them down and work at my main project. I am easily distracted by ideas.

I think part of my problem is that I'm only a programmer to support my interests as an artist. So I bring the two together; I write tools, I write OpenGL demos, et cetera, because nothing like I want exists.

I think I can relax a bit once I've got something useful made out of the program I'm working on now. And I'm certain I'll use that opportunity to learn more, and AltiVec is a good candidate

quote:Originally posted by BadAndy:For folks who just want to dip their toes into altivec, and have a little fun and bragging rights.... there are a bunch of routines in the Ars Testbench which are really easy "low hanging fruit" for altivec... the single-precision Vandermonde Eqn one comes to mind, and there are others. I didn't do these hoping others would take them on.

Well, it sounds like something I could give a shot once I have a G4 and a bit of a relaxed schedule.

quote:The whole BF catfight largely died from exhaustion ... but if you want to hold ars/MacAch bragging rights for something then you can pretty easily fiddle with some very tiny code fragments (some of these routines are no more than 10 lines long), and get huge factors of speedup, and have your name on the Mac version of the ars-testbench as a contributor.

w00t. That would be cool.

quote:Like hobold, I am often attracted to rather outre problems for the fun of it... and I'll mention one you wouldn't think of probably ... an altivec chess engine.

This couldn't be "pure" altivec all the way without it being a bizarre but not useful tour-de-force ... but there are portions of typical chess engines which dominate the total execution, and are altivec-able.

No. This is a pain. I know I could learn AltiVec wicked fast if I had incentive to do so. I just don't want the incentive when I'm trying to focus on other things.

Bad BadAndy! Bad!

I think I'm going to go clone myself a few times. I will have an "AltiVec" clone and a "electronic engineer" clone and a "superstar DJ" clone and hopefully not a "fascist" clone (there are logistical problems with cloning, you see).

Gcc, I guess you are busy reading, but if you have any question concerning vectorization of unwieldy algorithms, feel free to post them.

The same goes to anyone else willing to give AltiVec a try.

I'm trying to write an Altivec version of the collision detection function in my program, but I don't really know enough to phrase the questions I have (it's my first Obj-C/Cocoa program, so I'm focusing on that before focusing on Altivec). Basically it works by having a 2 bit mask of each image and doing a bitwise AND on each overlapping pixel. If anything non-zero shows up it means they're overlapping (I'm checking bounding boxes first, and only using this for a precise check if they overlap).

This code is very straightforward, and as far as the direct algorithm flow is concerned I see no way to improve it.

You don't show loop or load/store motion tho. Note that the vperm (vec_merge? etc) instructions have 2 cycle latency and the vec_mul? more... and you should unroll the loop at least 2x so the instructions can be interleaved to bury these latencies.

similarly this will allow good load hoisting.

These issues won't matter much if this operation alone is DRAM to DRAM, but my assumption is that this operation is part of much more computation, on cached data.

I'm trying to write an Altivec version of the collision detection function in my program, but I don't really know enough to phrase the questions I have (it's my first Obj-C/Cocoa program, so I'm focusing on that before focusing on Altivec). Basically it works by having a 2 bit mask of each image and doing a bitwise AND on each overlapping pixel. If anything non-zero shows up it means they're overlapping (I'm checking bounding boxes first, and only using this for a precise check if they overlap).

Starting with algorithmic optimizations is always a good idea.

AltiVec will probably be of help for this specific problem, but there will be some coding work necessary to align the bit masks of different objects with each other. Also, if most objects are rather small, it might not be worth the trouble writing vectorized code. One vector register can hold 64 two-bit masks, but if objects tend do be smaller, much of this potential computing capability is wasted.

BTW, for this specific problem, you can try to write vector code using the scalar registers (bit-wise boolean operations don't care about operand size anyway). Each 32 bit integer can hold 16 pixel masks, but you have to align everything by hand with shifts.

quote:Originally posted by gcc:I did work on an app that used 'tiling' to help fit partial images into the tiny caches of athlons and p3s. I'm not sure the benfit of this wasn't outwieghed by the somewhat unwieldy code required to do it.

Before I forget it, there is a less effective trick to improve cache hit rates which is far easier to code than full/fledged tiling. I call it the 'zig zag rule'.

When you process a large chunk of data in several subsequent passes, each pass being a more or less sequential sweep over all the data, you can hit a worst case scenario when that data is just a little bit larger than the capacity of available cache. Say, you process 480 rows of pixels from top to bottom, and L3 can hold circa 400 rows. When you start the next pass from top to bottom, the topmost 80 rows have to be fetched from main memory again. However, this newly fetched data will displace anther 80 rows from the cache, namely rows 80 to 160 ... so you end up fetching the data from RAM in every single pass.

This can be avoided simply by alternating the direction of processing between top down and bottom up. In case your routines are programmed using an explicit Y-axis stride (common for code that handles variable image width), it's just a matter of switching the sign of the stride and initializing a starting pointer. The inner loop body will remain unaffected.

[QUOTE]Originally posted by BadAndy:You don't show loop or load/store motion tho. Note that the vperm (vec_merge? etc) instructions have 2 cycle latency and the vec_mul? more... and you should unroll the loop at least 2x so the instructions can be interleaved to bury these latencies. [\QUOTE]The vector complex unit (handling integer multiplication and sum across) has a letency of 3 cycles in G4 (7400/7410) and 4 cycles in G4+ (7450/7455).

Unrolling twofold should be enough. There is an invisible cost when you unroll more than absolutely needed: you occupy more vector registers, which then need to be saved/restored on subroutine calls and interrupts. Things like 'loop pipelining' (also called 'modulo scheduling') might dictate rather much unrolling at times, though.

Gcc, if you are not clear on how load hoisting can be done, give us a nod.

quote:Gcc, if you are not clear on how load hoisting can be done, give us a nod.

I beat this one to death with the matrix multiply 3x3 in the old altivec thread (which is the altivec "YIQ->PAL" algorithm in ars testbenchX now ... a terrible name for what is really a floating-point bandwidth test in disguise)

And hobold's got a point about saving/restoring registers ... BUT

* I haven't seen a compiler yet that really does anything except set VRSAVE all ... so unless you hand-tune that this is a non-issue ... and

* the benefits of load-hoisting enough to cover L2 cache latency are huge on osX because the L1 gets trashed so often.

Often the biggest reason not to unroll is the implication on instruction cache usage... but I haven't found a problem yet where further unrolling (within the limit of free registers) actually cost me speed.

quote:Originally posted by BadAndy:* the benefits of load-hoisting enough to cover L2 cache latency are huge on osX because the L1 gets trashed so often.

Being able to cover L2 latency is a very valuable performance feature, but I have been lazier in this regard.

My 'ideal model' for tuned code always involves prefetching to get data into L1, with later loads only scheduled to cover L1 latency.

(I'm probably a bit biased concerning register pressure, because I once implemented a handful of wavelet transforms (only scalar code) that all ended up using just short of 30 registers even without aggressive load hoisting. Keeping register values live for two or three more cycles would have killed performance.)

i'm a bit dense with a lot of this terminology, so i might be way off here. also, there might be a fair number of lurkers interested in getting this straight as well. at any rate, the scalar code needs some optimizations in addition to adding altivec for the g4s.

quote:Originally posted by gcc:I have to admit that I'm not familiar with the term 'load hoisting', but i checked out BadAndy's post about ithttp://episteme.arstechnica.com/eve/forums?a=tpc&s=50009562&f=8300945231&m=8790959504&r=5450976084#5450976084

You're talking about using more local temp variables in place of the all on one line equations right? so the scalar code above:

This will hide the latencies of memory reads. Load instructions will have some cycles of latency, even if they hit in cache.

quote:i'm a bit dense with a lot of this terminology, so i might be way off here. also, there might be a fair number of lurkers interested in getting this straight as well. at any rate, the scalar code needs some optimizations in addition to adding altivec for the g4s.

Just ask! Especially when I use obscure terminology ...

I'll try a tiny but illustrative example to show how loop unrolling and load hoisting can go together.

Original loop:

for (i = 1 .. N) {dest[i] = WorkOn(source[i]);}

(You would not want to do a subroutine call in the loop body, this is just an illustration.)

Something I've been thinking about for a while is a "vector algorithm description language". The idea is to have a simple expression language which describes a series of operations on scalar values which needs to be applied along vectors. The translator then converts this into C/C++ code with the AV intrinsics. Since my problems usually involve conversion from structure -> planar -> structure this would be included in the generated code. With the output C/C++ code you can hand tweak it, plus the compiler gets a crack at optimizing it (if that's a good thing).

The VAST vectorizing compiler does things along these lines, but it requires you to express your algorithm as a loop in a variation of C/C++ and then compiles it straight to machine language. This means you can't get in and optimize, and you have to be careful to write your original code in a way that vectorizes nicely.

This approach is much like GPU vertex programming -- you specify the algorithm for one vertex, and the system parallizes it for you.

quote:Originally posted by programmer:The VAST vectorizing compiler does things along these lines, but it requires you to express your algorithm as a loop in a variation of C/C++ and then compiles it straight to machine language.

It's been a while since I checked out VAST/AltiVec, but back then it was just a preprocessor that parsed ordinary C/C++ code. Its output was also C/C++, but with vector intrinsics as specified in the Programming Interface Manual.

quote:Originally posted by gcc:That's nearly an 8x (7.227) improvement in this function! This is without any of hobold and BadAndy's load hoisting and loop unrolling tips.

Don't let us pressure you into an optimizing death match! :-) If some routine is fast enough after a first level of tuning, direct your attention to wherever Shikari tells you.

Using a profiler is the first and most important step in any tuning effort anyway.It would probably be a mistake to keep tuning a piece of code that accounts for a negligible percentage of total runtime - unless you have tuned everything else already and need that last bit of speed.