Altivec Optimizations and other PPC Performance Tips

This thread is an extension of something brought up the now epic 'Future Apple CPU' topic. BadAndy posted about how Altivec was really only part of the PPC/G4 performance equation, and that prompted a response from me which follows:

quote:This post is probably straying off topic but I couldn't resist. I should probably start another thread or revive the old Altivec thread.

quote:IMO 90% of the manual code optimization effort is in streaming (or other cache-hints) to bury load latency... the rest is in instruction scheduling, and the compilers are getting better at the later (tho gcc 3.1 is _still_ sucktard when optimization is turned on at trying to reorder instructions to "improve" dispatch IPC, in the process destroying load-hoists which are sufficient to bury L2 latency, which is often the real issue. Since the L2 latency is inside the 16 cycle limit, load hoisting to bury L2 latency can bring _dramatic_ performance gains for many numerically-intensive codes, particularly altivec).

BadAndy

I'm currently 'Altivec'ing' a real-time video processing app. Everyone I talk to seems to mention things like streaming and cahce-hints and the like, but I'm at a loss as to how this is actually implemented. I'm using gcc 3.1 and it's a bit troublesome to hear that it 'is _stll_ sucktard when optimization is turned on...', so obviously I want to learn what's going on and how to improve the situation.

quote:

What is unambiguously true is that withoutOOOE, standard C-ish tight loops with load/store motion and no cache hints are badly penalized on G3/G4/G4+ compared to the newer x86) ... they require loop unrolling, often explicit register allocation (because the compilers are still stupid about this), load hoisting, and dispatch scheduling. Compilers still don't do this anywhere near so well as I can.

Currently the app has 'standard C-ish tight loops' for the scalar code and I think the altivec code might fall into the 'load/store motion and no cache hints' category. All of the Apple stuff and other source seems to be like this too, so what changes need to be made for maximum efficiency and speed? Apple seems to imply that doing altivec code will magically make your apps slay and it's really a paint by numbers type thing. But when the topic of altivec is brought up by those in the know, these other issues are often the first thing mentioned.

You and others like hobold know this stuff inside out so if you could point me to some good example code for dealing with this I would be very thankful.

gcc

So, rather than derail that thread any further I started this one to discuss some of the tricks of the trade concering getting the maximun performance out of the current crop of Macs.

quote: gcc ... see if you can "revive the old Altivec thread" by posting a _manageable_ scrap of what you need to do ... and likely I and hobold and others will chime in. Infopop may not let you do that, an appeal to the mods might make sense because it seems to me smarter to restart it than start a new one.

The macros for dst and cache hints (dcbz etc... you need em!) are working for 3.1 ... contributed by Ian Ollman and others.

I provided a bunch of examples in the old altivec thread... my general advice is so basic it is obvious:

1. always loop unroll to the limit of the registers

1b. always use explicit "register" keyword allocations ... gcc does some stupid register allocations at times if you don't

2. load-hoist to allow 11 cycle latency ... or as much as you can get. This will swallow L2 load latency ... and in a multi-tasking OS your L1 is getting stepped on so often that lots of loads are L2

I have a real-time video app that's a natural fit for altivec, so after coding up some scalar processing code, it's time to altivec. In this thread I will present a simple function for adding two images together and clamping the result.

dstImage = clamp(srcImage1 + srcImage2);

clamp is a macro that clamps values between 16 and 240 because I'm using YCbCr/YUV. Pixles are repesented using 16bits and two pixels share U and V values (U Y V Y).

so the scalar goes a bit like this:

for ( h=0; h<image.ysize; h++){ for (w=0; w<image.xsize; w++)

dstImage[U] = clamp(srcImage1[U] + srcImage2[U]);

dstImage[Y1] = clamp(srcImage1[Y1] + srcImage2[Y1]);

dstImage[V] = clamp(srcImage1[V] + srcImage2[V]);

dstImage[Y2] = clamp(srcImage1[Y2] + srcImage2[Y2]);

dstImage++;srcImage1++;srcImage2++;

the basic altivec replaces the scalar add with

dstImage[0] = vec_adds(srcImage1[0],srcImage2[0]);

and that's about it apart from casting the buffer to a vector type and changing the loop structure slightly.

So assuming my example is not too trivial, where does all the stuff BadAndy mentioned come in?? I'm particularly interested in points 2, 4 and 5, as I'm using -O3 on gcc 3.1, and praying for the best.

quote:clamp is a macro that clamps values between 16 and 240 because I'm using YCbCr/YUV. Pixels are repesented using 16bits and two pixels share U and V values (U Y V Y).

Ok, you need to go back and start at the beginning of the old altivec thread and wade through it, get Ian's tutorial, browse it... see a lot of the discussion in the altivec thread. BUT .. some _real basics for starters....

This clamp function is so trivial itself that it it is _too_little_ computation for anything like optimal altivec code.

Assuming I understand your data storage-form right you have two pixels packed into 32 bits, as U Y V Y ... and all you want to do is clamp all 8-bit values to the range 16 ... 240 ?

since these are VSIU operations there is even no real scheduling issue (meaning that the latency is 1 cycle) ... the only issue is load-hoisting and dst usage. I'm not goingto write out the whole loop (you gotta do some of your own homework ... see the matrix3x3 example in the altivec thread for similarissues) but basically

you get a big repeated version of an unrolled loop where there are 8 repeats (dat1 .. 8) of

by itself (since there are 16 VSIU operations in the loop and they can't run faster than one per cycle) this loop will bury L2 latencies ... add a dst in the loop (you gotta do that, read the altivec thread for how) and it will trivially run at the bandwidth limit. Follow nibs' description of how to set up real-time slice priority in Darwin/osX and this will be as good as it gets.

The computation is so trivial there really is nothing else to fiddle with if this example is taken at face-value ... it is the dumbest possible "blitter."

Note also that as I've shown it the computation is write-over (what you imply) ... if not then two dcba or dcbz in the loop (because each loop does 16 * 4 bytes, or two cachelines) will increase performance by 30% or so by eliminating unneeded reads for data you'll overwrite anyway (given write-combining isn't working ... which you shouldn't assume is)

However the REAL point is not this internal fragment ... it is that you are doing somebigger set of manupulations which are somesort of color-coordinate transformation perhaps (from RGB?), including possibly gamma corrections? (lookups will slay you here... there is an incomplete example I was driving at and will finish if need be about how to _compute_ gammas rather than look them up) and then the storage form is being changed, right? and then you clamp (or somewhere else along the line) and write them out.

The whole frickin' main issue of altivec is to do absolutely as much computation as you can with the data registered ... don't read and write more than once to main memory if at _ALL_ possible.

quote:clamp is a macro that clamps values between 16 and 240 because I'm using YCbCr/YUV. Pixels are repesented using 16bits and two pixels share U and V values (U Y V Y).

Ok, you need to go back and start at the beginning of the old altivec thread and wade through it, get Ian's tutorial, browse it... see a lot of the discussion in the altivec thread. BUT .. some _real basics for starters....

This clamp function is so trivial itself that it it is _too_little_ computation for anything like optimal altivec code.

Assuming I understand your data storage-form right you have two pixels packed into 32 bits, as U Y V Y ... and all you want to do is clamp all 8-bit values to the range 16 ... 240 ?

since these are VSIU operations there is even no real scheduling issue (meaning that the latency is 1 cycle) ... the only issue is load-hoisting and dst usage. I'm not goingto write out the whole loop (you gotta do some of your own homework ... see the matrix3x3 example in the altivec thread for similarissues) but basically

you get a big repeated version of an unrolled loop where there are 8 repeats (dat1 .. 8) of

by itself (since there are 16 VSIU operations in the loop and they can't run faster than one per cycle) this loop will bury L2 latencies ... add a dst in the loop (you gotta do that, read the altivec thread for how) and it will trivially run at the bandwidth limit. Follow nibs' description of how to set up real-time slice priority in Darwin/osX and this will be as good as it gets.

The computation is so trivial there really is nothing else to fiddle with if this example is taken at face-value ... it is the dumbest possible "blitter."

Note also that as I've shown it the computation is write-over (what you imply) ... if not then two dcba or dcbz in the loop (because each loop does 16 * 4 bytes, or two cachelines) will increase performance by 30% or so by eliminating unneeded reads for data you'll overwrite anyway (given write-combining is

quote:Originally posted by BadAndy:The whole frickin' main issue of altivec is to do absolutely as much computation as you can with the data registered ... don't read and write more than once to main memory if at _ALL_ possible.

Doesn't this sort of follow with just about any assembler? I have only minimal experience here (I wrote a heavy test app for some PPC asm code a friend of mine wrote; it was a producer-consumer solution designed for multithreading so there's a lot of concern about syncing and the like), and none at all with AltiVec specifically, but don't you generally want to make things as atomic as computerly possible anyhow?

First, my apology for the two copies of my reply, but I have no clue about how that happened (particularly that the second is truncated!) except that infopop seemed to go dead for awhile after it took the post.

Next, always dangerous to put an "afterthought" on something like this as you are wrapping it up ... and particularly an afterthought addressing a "mildly esoteric issue, dcba/dcbz usage:"

quote:Note also that as I've shown it the computation is write-over (what you imply) ... if not then two dcba or dcbz in the loop (because each loop does 16 * 4 bytes, or two cachelines) will increase performance by 30% or so by eliminating unneeded reads for data you'll overwrite anyway (given write-combining isn't working ... which you shouldn't assume is)

Estupido mio ... this loop as described writes 8 _vectors_ ... two vectors per cacheline (at present in G4, G4+ ) so there would need to be _FOUR_ dcba or dcbz per loop ...

Adding these cache pre-write hints is the absolutely last step of program tweaking and they are used only if you are NOT writing the new data over the old data (otherwise the cacheline is already there).

In principle write-combining "should" eliminate the need for these (in otherwords the memory interface can in principle detect that you have writes that completely overwrite a cacheline (presuming they arrive closely enough together in time), so there is no need to read the cacheline in from main memory before writing it back out) -- but write-combining has been a murky issue for G4, G4+ ... and for unknown reasons to me Apple disabled write combining at various times ... likely to protect against subtle errata (the early MPC7450 had a big pile of errata).

What this means is that you may test adding the dcba (or dcbz) and get no improvement on your system, but it shouldn't hurt you, and if/when the code runs on something which isn't doing write-combining it will get you 30% or so for a bandwidth-limited task.

And so if the whole issue of dcba usage is too big a headache to fool with... then just forget it, even if the code isn't writing the data over the input. But, carrying on.....

Another nuisance point is suppose you are writing a "general purpose" blitter where the algorithm is passed read/write pointers and it is intended to allow either overwrite (passed pointers are equal) or not at the user's discretion? In this case there isn't a truly tidy solution ... the simplest thing is to omit the dcba.

Other obvious "solutions" are to have two copies of the loop, with and without the dcba and the pointer equality is tested once on entry and the correct loop entered (ugly and wasteful of code store), or to have an internal test and branch inside the loop (ugly ditto). These solutions bother me a lot.... particularly when one hopes that write-combining eliminates the need for all of this anyway.

This leads to one slightly skanky trick I have used which is to do the following... your routine declares 3 local "junk vectors" (which can't be registered, duh) these will end up on the stack-frame (and hence automatically discarded when your routine exits):

Thank you for your more than generous replies. I have been soaking up a lot of info the past 24 hours.

You are right that my example is very trivial, and most of what I'm doing is a bit more complex, but still probably falls into the 'simple blitter' category. For example, blending two images together would be more work but still would not keep the vector units busy long enough to not wait for memory to catch up. I think once I move into doing 3x3 matrix ops and the like, altivec will pay much larger dividends.

I do have a few specific questions at this point:

- how do i figure out the best size for vec_dst() to load into the cacheline? Ollman's tutorial has his own function to determine this but he doesn't go into too much detail about why he chose the size to load. He basically says to try 64, 128 or 256 bytes, and that the streams should overlap (p 57). He also cautions about the stream potentially oupacing your code which would lead to errors. Do I need to factor in the length of my processing loop or how many vectors I need to load for each processing iteration, etc? Or is this process one of trial and error based on profiler results? It's a bit fuzzy if Ollman is using the profiler results as illustration or as part of his own process (I suspect it's both).

Another thing about the cacheline streams: if i have teo image buffers in memory that I'm processing should I explictly do a vec_dst() for each buffer and should I use a different stream id for each?

In general how efficient is it to do comparisons on the vector unit? Some of the processing I do is keying, which is based on per-pixel comparisons of two images. If all I'm doing is making a comparison and moving memory based on that, are the vector comparison ops more efficient for this or is it a waste of resources? I would think that this is maybe too trivial to bother with but branching inside a huge for loop is something that tends to make people cringe (in this case it's unavoidable).

Again thanks for all your input, there is way more to learn about this than just modifiying my scalar ops to be vector ones. I have a lot more homework to do now.

About vec_dst: don't get your knickers in a twist ... _try_it_and_see_. Any half-way decent dst pays huge dividends for blitter-like codes... and you won't be able to get a perfect "optimization" across all Macs out there anyway.

Generally I find setting up the read dst(s) to get about 4 to 8 vectors ... and about 300 -500 cycles "ahead" (assuming that there wasn't any MP) seems about right. This fits fairly well with just restarting the dst once a loop in a fairly unrolled loop.

With Darwin/osX each time the tasker interrupts your task it kills all dst. (and when your task is swapped out generally L1 is toast too) This has good and bad implications .. the only good part is it lets you be piggy about using all the streams if you want to.

The bad parts are that:

* it really encourages you to make tasks real-time with fairly long time-slices ... on a non MP system these might get derogated tho ... I've no experience with this. More guidance is needed from Apple on this subject. I program for MP machines almost exclusively.

* if L1 is getting trashed then using the dstt (the transient one) isn't such a good idea... you want the loaded data to make it to the L2, and L2 latency load hoisting _IS_IMPORTANT_ This is another "you need to be piggy, if you want performance in the multi-tasking environment." In an MP environment with real-time slices for _real_ serious code then dstt may make sense for some algorithms (where you are reading local data that _won't_ be reused, and you have data in L2 that will be).

* I'm not sure how successful the new "affinity scheduling" is. On MP machines if the task was running on one CPU and then is "pogo'ed" to the other CPU all the cache stores are nearly wasted. Apple now has implemented affinity scheduling, but this remains another strong argument for setting up real-time priority on MP machines... where you can more safely assume they won't be derogated. (nibs ... any commentary on this?)

* since your dst are subject to frequent death if not real-time you need to take (i.e. restart) your dst in small chunks ... this favors one dst per loop anyway.

quote:Originally posted by gcc:In general how efficient is it to do comparisons on the vector unit? Some of the processing I do is keying, which is based on per-pixel comparisons of two images. If all I'm doing is making a comparison and moving memory based on that, are the vector comparison ops more efficient for this or is it a waste of resources? I would think that this is maybe too trivial to bother with but branching inside a huge for loop is something that tends to make people cringe (in this case it's unavoidable).

"comparisons" is too vague. I don't understand your "keying." You want to avoid conditional branching inside the vector loops if at all possible. The whole game is to do these functions using saturation arithmetic and/or vec_sel ... look at vec_sel and then the comparison operators with which you can generate the vec_sel selector mask. These allow you to implement most conditional selections of data _without_ branching.

Also, once you get your mind around it, vec_perm can be used to do impressive tricky things in many cases. It will take you awhile to "rethink" how to implement this kind of stuff. When you run up against a "can't doit without branching" bottleneck ... post it here or on the altivec.org and see what replies you get.

The other big, big, big issue is that "SIMD don't do lookups" ... using vec_perm you can do 8bit (256 element) table lookups fairly efficiently ... beyond that you get in trouble.

Faced with an algorithm that appears to do a lookup the absolutely first thing to consider is can it be converted to a computable function? _THAT_ is the recipe for vector speed....

By keying I mean luma and chroma keying. Basically if a pixel in imageA falls into a certain range of user defined values then it is replaced with the corresponding pixel in imageB. This of course requires the comparison to be made on a per pixel basis inside the processing for loop. Hence my question about doing vector comparisons. I think that adding something like a blend function after the compare would better utilize the vector unit. I guess I need to think in terms of doing more is actually more efficient in a lot of cases in vector-land.

Thanks for the dst_vec() advice. I'll look into the info about the tasker as well. I was thinking if the video processing is hammering the cpu pretty hard then the cache would be filled with the right data. That's probably not the case though. Some silly UI event or daemon might be clobbering it.

By keying I mean luma and chroma keying. Basically if a pixel in imageA falls into a certain range of user defined values then it is replaced with the corresponding pixel in imageB.

_EASY_ use the vector compares to test the range of pixels in imageA to get a mask for vec_sel. That's it ... two vector compares, a vec_and (or vec_or), and a vec_sel ... no branches ... vector wide, fast as can be.

It's time to get cracking on a brand new TestBench. The old one had way too many flaws, too little effort put into it, and was never designed to end up where it did. Unfortunately I have zero time to contribute to something like this these days, but it's going to be an important tool to have. Anyone up for starting it? BadAndy? Hobold? Nibs? IlleglWpns?

quote:Originally posted by MrNSX:It's time to get cracking on a brand new TestBench.... Anyone up for starting it? BadAndy? Hobold? Nibs? IlleglWpns?

Regards,

MrNSX

Hmmmmm.... interesting question. I think the _first_ question(s) should be what are we trying to measure? And then the next questions are "what is 'fair' ... what does that mean?"

I find the BF _so_ distressing I don't really want to build another acrimonious battle in the BF... it just doesn't make sense to me.

Things I would say, having experience playing with the existing ArsBench:

1. It is a grab-bag of code snippets, but many of them are redundant, and many of them are mislabeled in terms of what they really exercise (particularly on Macs most of the FP codes are really bandwidth tests).

One thing in this regard I would recommendis that every "benchmark algorithm" shouldbe tested over memory-ranges from smaller than L1 up through those that exercise theexternal bandwidth fully... often that scaling (for different machines) is the interesting part. This also makes it immediately apparent when an algorithm has become a bandwidth test in disguise.

2. Endless battles over the semantics of "algorithm" ... I personally do not consider any particular C-language representation of an algorithm "definative" compared to any other ... yet clearly, particularly on machines with limited OOOE ... one can often make profound changes in execution speed by changing the C-source in ways that do not change the fundamental nature of the computation.

3. I think the _process_ of "gaming the benchmark" is actually more interesting than the benchmark results! The fact that a really knowledgeable programmer can come in and make factor-of-ten improvements in some cases... and then the degree to which these techniques are portable or processor specific ... that seems to me to be more interesting than the raw results or BF slanging matches.

4. I think a "good" benchmark set of algorithms would have the following properties:

a) they are _useful_ algorithms and are believed to be the most efficient forms numerically available at the time selected.

b) they are fully open source (duh)

c) they are surrogates for a wider range of computation that are relevant to our userbase

d) somehow, we have a set of agreed "rules" to cover what is "fair" in fiddling with them to improve performance on particular platforms. In contrast to the folks that play the spec games ... we _don't_ control our own compilers, so we can't play the "fudge the compiler to win the game" the way the big-boys do. This is good and bad, but what it means is that it is as much about compilers as anything else.

e) results checking must be mandatory. For some algorithms (most FP and even some integer) this means that one gets into wrangles over error-limits.

f) I would argue that "all expressions which can meet the results-checking critera for any input are by definition "the same algorithm"

One solution would be to have different categories ... including an "anything written in C" category, and a vectorized category and an "anything-goes human optimization category" (write the damn thing in assembler if you want). All ofthese must pass results checking on arbitrary inputs.

With these rules the algorithm expressions themselves may morph as they get more efficient. That's reality too folks!

1. It is a grab-bag of code snippets, but many of them are redundant, and many of them are mislabeled in terms of what they really exercise (particularly on Macs most of the FP codes are really bandwidth tests).

Agreed. It started off with a different motivation (trying to measure specific latencies, stalls, IPC over different architectures) and then morphed into a pseudo CPU bench. The problem was that I had very very little time to devote to it, and I couldn't search the net looking for ideal algorithms. I grabbed whatever I could find, and just slapped it in, not caring about readability, organization, or even suitability of the algorithm necessarily. The idea was that if we had enough tests, (eg. 100 or more), there would be enough different paths exercised that it might actually result in a somewhat useful number at the end Unfortunately we never had the time to put in anywhere near that number of algorithms which caused the results to be badly skewed if a compiler did a superb job optimizing/cheating on a particular test.

Something started anew would have to take better, more relevant, open sourced algorithms and the whole project would have to be more cleanly organized. It would have to be more modular/extensible, optionally multi-threaded, and as you mentioned allow for different memory ranges (ie small enough to fit into L1 to always blow out an L3).

2. Endless battles over the semantics of "algorithm" ... I personally do not consider any particular C-language representation of an algorithm "definative" compared to any other ... yet clearly, particularly on machines with limited OOOE ... one can often make profound changes in execution speed by changing the C-source in ways that do not change the fundamental nature of the computation.

Not much you can do about that, unfortunately. The stipulation can be that we're not allowed to modify the source from the original open-sourced base. This of course brings up the fallacy of using benchmarks like these to begin with, but it's a good compromise IMO.

3. I think the _process_ of "gaming the benchmark" is actually more interesting than the benchmark results! The fact that a really knowledgeable programmer can come in and make factor-of-ten improvements in some cases... and then the degree to which these techniques are portable or processor specific ... that seems to me to be more interesting than the raw results or BF slanging matches.

Absolutely, and these can be in some optional plugin section where you can present your own alternative to an existing algorithm. We can always have it so that these scores are not used as part of the final score calculation.

e) results checking must be mandatory. For some algorithms (most FP and even some integer) this means that one gets into wrangles over error-limits

Yes, absolutely.

One solution would be to have different categories ... including an "anything written in C" category, and a vectorized category and an "anything-goes human optimization category" (write the damn thing in assembler if you want). All ofthese must pass results checking on arbitrary inputs.

This can always fall into the 'alternative' section under "gaming the benchmark" where you can do whatever you want to try and speed it up as long as it still passes the results CRC (or whatever) check.

quote:MrNSX: Not much you can do about that, unfortunately. The stipulation can be that we're not allowed to modify the source from the original open-sourced base. This of course brings up the fallacy of using benchmarks like these to begin with, but it's a good compromise IMO.

If you don't allow the source to be fiddled with then you have implicitly chosen a machine/compiler it favors. This may not have been done conciously... but is inevitably true.

In the specific case of machines with large numbers of registers and limited/no OOOE (e.g. G4, G4+, Sun's Sparc) loop unrolling is a de facto necessity. Compilers generally don't do this efficiently because they try to obey strict end and pointer conditions which human programmers relax.

Presenting the algorithm in loop-unrolled form generally doesn't hurt machines with deeper OOOE... so a strong case can be made that the best (meaning most generally efficient) and most neutral cross-platform source would be loop-unrolled and explicitly registered in any case.

Not doing this merely proves that deep OOOE is needed to rescue sophomoric programmers from their lack of knowledge ... which is a given from the get-go. No reason to build a benchmark to prove that!

A huge amount of the existing "benchmark" codes demonstrate how well a particular compiler/CPU can execute stupidly implemented algorithms ... which no _competent_ programmer in search of speed would employ.... when a quick google search will get you a much better one free for the asking.

One solution to this is for Ars to simply adopt some of the standard numerical functions which are publicly maintained and optimized (e.g. BLAS, ATLAS, etc) as part of the suite. This at least means that the algorithm _and_the_implementation_ have been reasonably optimized for each CPU.

quote:BA:I think the _process_ of "gaming the benchmark" is actually more interesting than the benchmark results! The fact that a really knowledgeable programmer can come in and make factor-of-ten improvements in some cases... and then the degree to which these techniques are portable or processor specific ... that seems to me to be more interesting than the raw results or BF slanging matches.

MrNSX:Absolutely, and these can be in some optional plugin section where you can present your own alternative to an existing algorithm. We can always have it so that these scores are not used as part of the final score calculation.

It's that "final calculation" I don't like... nor the rules you suggest governing it. There isn't ONE AND ONLY ONE way to view the results. The instant you provide one formula, whatever it is, with whatever set of criteria, you are imposing a very arbitrary and endlessly-debatable value schema on the results and process.

Let the tests and categories "speak for themselves" ... your statement that to the effect that "only the results of the C-qalgorithm in the form we like matter to _the_ final result" ... WHAT "final result?" There isn't anything "final" about a result if there are demonstrations that other ways produce faster results!

In this circumstance you are also "proving" that C expressions are generally extremely poor forms for efficient machine optimization of numerical processes, because they are too literal and low-level ... for an abstraction model which is NOT the way modern CPUs work. Fortran, PL2, Ada etc are far more efficient syntactic forms for numerical-code optimization. Why prove this over again too?

Programmers exist for many reasons... but tweaking code for speed is definately one of them. Time is money. A good programmer can often outperform fancier silicon.

One solution to this is for Ars to simply adopt some of the standard numerical functions which are publicly maintained and optimized (e.g. BLAS, ATLAS, etc) as part of the suite. This at least means that the algorithm _and_the_implementation_ have been reasonably optimized for each CPU.

Yes, that was my suggestion BUT if you're going to be an idealist then these results aren't useful either because you could go in and hand tweak it further to be more optimal on a given CPU. We both understand that benchmarks are generalizations and it's an impossible task to try and guesstimate overall performance regardless of which subtests we throw in there. In spite of this, we still accept suites such as SPEC as reasonably good approximations of performance. Why is that?

Let the tests and categories "speak for themselves" ... your statement that to the effect that "only the results of the C-qalgorithm in the form we like matter to _the_ final result" ... WHAT "final result?" There isn't anything "final" about a result if there are demonstrations that other ways produce faster results!

There has to be SOME kind of baseline otherwise what are we measuring? CPU A executing the algorithm usings its integer pipeline whereas CPU B using its SIMD pipeline with hand optimized assembler? What good is that comparison other than to say that some individual can get a nice boost in performance by hand tweaking the code. The point is to start with some set of criteria that are relatively equivalent for both CPU's (the same source code for example) and then throw the compilers at it and see what results. This more closely approximates what happens in the real world. You and I both know that 95% of programmers do not bother to go back and hand optimize the code, even in a high level language, to take advantage of the particular architecture. Most programmers don't have the slightest clue about the underlying architecture, let alone how to optimize for it. Obviously changing the type of compiler or what optimizations we enable and such will all influence the final results, often to an alarming degree. What can we do about that? Either we accept that this is not going to be a perfect benchmark, because there is no such thing, or we throw our hands up in the air and just say why bother?

I'm not saying that we shouldn't have hand optimized versions or SIMD implementations of some of the tests. I'm saying that if we're going to use the numbers as some kind of useful comparison of relative CPU performance, then they both better be doing the same thing. It's hard enough to compare with all of the variables introduced by the compilers, let alone changing the source too and then trying to use this as some kind of CPU measure. If the purpose is NOT to use this as a CPU benchmark, but just as a "how fast can I make this algorithm on this particular CPU?" then I agree with everything you've said.

quote:Originally posted by MrNSX:One solution to this is for Ars to simply adopt some of the standard numerical functions which are publicly maintained and optimized (e.g. BLAS, ATLAS, etc) as part of the suite.

Plenty of engineering situations to pick from also. Fluids/Heat Transfer...

By picking something of that sort, the inputs don't need to be fixed numbers necessarily. You can't turn the entire benchmark into a single table lookup if the 'table' of potential inputs is essentially infinate.

If there's too wide a variation in how long a particular test takes based on inputs, it gets turned into an average of a long list of random (or pseudo-random if that helps) inputs. In that case it helps if there's some way to prevent re-writing your bench after the numbers are known.

For instance, for the SETI@Home project, there _is_ a right 'answer' for each block, but _NO_ONE_ knows what that answer is before you run the block! They re-verify blocks... but there's effectively a ridiculous number.

On the otherhand, if someone comes up with an approach _that_solves_the_problem, by doing something unusual -> more power to them. Require the source to be submitted -> if some other CPU can use the same technique -> great. If it is a universal approach -> apply for patent, it's useful

quote:Originally posted by KwamiMatrix:Quick question to all you altivec giants: Apple has implemented both a Velocity Engine ASM language, as well as Objective C( or is it C, or C++) subset, right??

I'm not an AltiVec anything, but... what do you mean by subset? We've got the asm bit (whether it was from Apple or no), and in /System/Library/Frameworks on OS X you'll find veclib.framework. Veclib has a bunch of algorithm implementations coded in two versions: scalar code for Rob and his iMac, and vector code for everybody with a G4.

It's not really a subset, just a toolset that makes use of AltiVec if you've got it available to you.

Perhaps it would be helpful to simply submit fairly high level pseudo-code with large datasets and known results. Individuals can then have at it to write the most optimal code they can. This undoubtably biases results to "best programmer" but since nobody is going to agree who that is...

I agree that well known problems from engineering, media, etc are probably best. A judiciously chosen set would provide a decent range of results.

I also think that any open source libraries should be fair game for use (ATLAS etc.)

BA: One solution to this is for Ars to simply adopt some of the standard numerical functions which are publicly maintained and optimized (e.g. BLAS, ATLAS, etc) as part of the suite. This at least means that the algorithm _and_the_implementation_ have been reasonably optimized for each CPU.

MrNSX:Yes, that was my suggestion BUT if you're going to be an idealist then these results aren't useful either because you could go in and hand tweak it further to be more optimal on a given CPU. We both understand that benchmarks are generalizations and it's an impossible task to try and guesstimate overall performance regardless of which subtests we throw in there. In spite of this, we still accept suites such as SPEC as reasonably good approximations of performance. Why is that?

You are raising/mixing two different issues here. The publicly maintained high-performance numerical codes _ARE_ much closer to the optimum which can be achieved for each platform (with C source), precisely because competent programmers (sometimes "the factory team") _do_ work at it. I can tell you from personal experience that it is hard to get a lot of improvement over the ATLAS routines (I can get some tho, particularly on gcc, because it _still_ does stupid instruction scheduling for G4) ... and I have put effort into doing exactly that. In contrast I can blow the doors off most of the Arsbench routines on the Mac with very little trouble indeed.

It would be absurd to maintain, for instance, that the ATLAS routines for Pentium4 are somehow "inferior" in their attention to those for the Mac or Sparc ... (in the beginning with a very new CPU or one that is little used they will be).

Now these numerical routines _aren't_ the be-all and end-all of computation... but they certainly are one relevant issue.

As far as SPEC goes ... if you like SPEC... then why do it again? Because Apple/Moto _and_ IBM won't play with G3/G4 class hardware?And why won't they play?

There are many, many reasons here ... one of which is simply that they don't like the unflattering results, but these results stem from many reasons ... a very big one being again that the C-expressions _are_ optimal for single-CPU machines with fairly low register count, fast bi-directional cache I/O for scalar variables, and deep OOOE.

More generally SPEC has the following merit: it says "if you are a programmer who writes codes like THESE ... this is how fast the various machines/compilers will run them." As far as PCs go though, the vast fraction of users never write a line of code themselves, and instead use shrinkwrapped software which one _hopes_ has been written by programmers competent at optimizing for the platform.

The latter issue leads to the endless "application benchmarks" we've seen all over Ars. There is a value to application benchmarks ... but there is also endless specious cant over them and I don't want to go there as far as using them to make any statement about "machine performance." x-platform codes are almost never equally optimized across platforms.

Like it or not a _big_ fraction of the computationally demanding code that runs on "PCs" is _exactly_ the code for which the machines have SIMD short-vector extensions. It is really moronic to argue that a scalar "benchmark" means anything when there is a commonly used SIMD alternative for that "PC-class" CPU which is much faster.

Further, and planting BadAndy's evil little weed into the x86 garden, the altivec PEM has been _formally_accepted_ as a semantic extension to gcc. As far as the open-source community is concerned then PEM expressions _are_ C ... you can't run them right now on anything except g4 and g4+ CPUs ... but they are "C" to gcc, just -faltivec !!! Intel's programming form(s) for MMX/SSE/SSE2 were rejected, and apparently that issue is over and will not be revisited.

I think this is a major reason why IBM decided to add Altivec to the 970.

quote:BA: Let the tests and categories "speak for themselves" ... your statement that to the effect that "only the results of the C-qalgorithm in the form we like matter to _the_ final result" ... WHAT "final result?" There isn't anything "final" about a result if there are demonstrations that other ways produce faster results!

MrNSX:There has to be SOME kind of baseline otherwise what are we measuring? CPU A executing the algorithm usings its integer pipeline whereas CPU B using its SIMD pipeline with hand optimized assembler?

Whoa... stop right there. Read above. The vast fraction of altivec code is written in a formally accepted semantic extension to C.

Further, turn it around ... in your circumstance ... _why_ is CPU A using only it's scalar pipeline(s) (what you meant) ... because that's all it has which are relevant to the computation? Or, ha-ha-ha ... are the BF x86-lots going to start whingeing about what an "unfair" advantage Macs have because there just happen to be a few good programmers for the Mac ... and there aren't their equals to be found for intel and AMD CPUs? Do you _really_ want to go there? Or do you want the benchmarks to be purely "this is how fast these CPUs can run our favorite flavor of badly written algorithms." Which poison do you want to drink?

quote:MrNSX: The point is to start with some set of criteria that are relatively equivalent for both CPU's (the same source code for example) and then throw the compilers at it and see what results. This more closely approximates what happens in the real world. You and I both know that 95% of programmers do not bother to go back and hand optimize the code, even in a high level language, to take advantage of the particular architecture. Most programmers don't have the slightest clue about the underlying architecture, let alone how to optimize for it. Obviously changing the type of compiler or what optimizations we enable and such will all influence the final results, often to an alarming degree. What can we do about that? Either we accept that this is not going to be a perfect benchmark, because there is no such thing, or we throw our hands up in the air and just say why bother?

You've destroyed yourself with your own reductio ad absurdum. You have cornered yourself into the position "our benchmarks measure what bad programmers will achieve on these platforms."

quote:I'm not saying that we shouldn't have hand optimized versions or SIMD implementations of some of the tests. I'm saying that if we're going to use the numbers as some kind of useful comparison of relative CPU performance, then they both better be doing the same thing.

I start pulling my hair out here at the "Alice in the Looking Glass" irrational redefinition of "the same thing." This is all about semantic games about "algorithm" ... and you are playing the Red Queen, for whom words mean only what she wants them too, and can believe many more than 5 impossible things before lunch.

Replace "algorithm" with "computable function." A computable function is a black box which delivers reproducible, and verifiable outputs for a span of input data... which may be semi-infinite in range (but where individual tests are always finite and the result is computable in finite time in a Turing sense). _Anything_ which can deliver the correct output(s) for all input data cases is an implementation of the computable function.

You are attempting to maintain that a particular expression of the computable function in C is "the algorithm." There is no justification for this whatsover except Red Queen logic. What it does lead to is the following justification: most programmers are stupid and careless. Therefore most CPU manufacturers add lots of expensive silicon to help shield them from their foolishness. We will define a "fast" CPU & compiler on the basis of how well it executes what the worst programmers write.

You might at least consider that for many purposes the relevant issue is what can the _best_ programmers extract from hardware, for the computable function in question? Seems to me that is at least equally relevant... given that most of the software PCs run is shrinkwrap presumptively written by professional programmers nominally expected to be proficient with their tools? And this is in fact what is different about "PCs" vs SPEC?

BA, I don't want to get into a big debate on this, mainly because I'm not going to be involved in its implementation so it doesn't really matter to me either way. This should also go into its own thread so we don't hijack this one

I think maybe you've spent so much time writing optimized code where performance is paramount, that you've forgotten that the vast majority of "professional" programmers don't spend a single iota optimizing their code. Their job is to get the functionality in and then at the end of the day they turn off debug, crank up the opt level and ship it.

No, that doesn't extract the best performance out of the machine, but that's also whats out there in the vast majority of open sourced code out there. Or are you planning on spending years writing your own variations and thereby limiting it to a handful of algorithms? Why not go with SPEC? Well it's expensive and inaccessible for the average user to run on their machine.

What you're proposing is great, but it's hardly any measure of cross platform performance. Since it's just a "computable function", you can have orders of magnitude diffence even within the same platform by coming up with a more intelligent algorithm or by throwing a competent programmer at the problem. If that's the goal, then great, but you also have to realize that it will be a useless measure for trying to compare relative CPU performance. Like I said, maybe that's too ideal a goal, and maybe that was never your intent.

That thing called "computer performance" is not defined precisely enough to truly measure it.

The most fair variant is to have a number of implementations for any benchmarked task, and let the benchmark user decide which implementation to use (i.e. the fastest).

That benchmark should be open source, not rely on third party libraries, and be compiled with a publicly available compiler (to prevent anyone from hiding their tuning tricks within a custom compiler that no one else may use).

This will degenerate into a 'programmer benchmark' after a few iterations of accepting new implementations for tasks. But this is the only way to ensure every machine has at least a chance to show their capabilities.

This implies someone has to maintain stuff like an official commentary on how benchmark scores must be interpreted. And it also implies contributed task implementations must come with documentation explaining the tuning done (otherwise there would be an unfair disadvantage for, say, an 'x86 assembly coder unable to reverse-engineer the tricks encoded in PowerPC assembly).

I think a good additional requirement would be to demand plain C(++) implementations of anything that is based on significantly differing algorithms than used so far.

Anyway, there is no such thing as a 'finished benchmark suite'. It must be constantly maintained to adapt to changes in the benchmarked hardware, to adapt to 'rule-ravers', and to show to the public that someone cares for that benchmark.

If done right, this project would not primarily be a benchmark. It would rather be a terriffic resource for code tuning which also happens to show what is possible on a wide range of machines.

quote:What you're proposing is great, but it's hardly any measure of cross platform performance. Since it's just a "computable function", you can have orders of magnitude diffence even within the same platform by coming up with a more intelligent algorithm or by throwing a competent programmer at the problem.

There are many numerical functions where it is provable that there is no (general) solution less than N operations, and algorithms with N operations are known and well tested. These eliminate 'a more intelligent algorithm' in the sense of numerical method. I took your "YIQ->PAL" and demonstrated the reduction to the general form; there is no _general_ form with fewer operations than the 3x3 Matrix multiply + offset.

As far as the coding is concerned, clearly each different CPU will have a differing optimal coding at the machine level.

Even with the restricted domain, say, of ANSI C expressions ... if I rewrite a C-expression in a form which improves performance on "my" machine ... how can it be "less valid" than the expression you chose, provided that my expression executes correctly on your compiler + machine (nevermind whether it is faster or not on your machine)?

Just take my _scalar_ version of the 3x3 multiplyPlusOffset ... it is plain-jane C ... beats the snot out of the original on the Mac... and if I test it on my Athlon linux box it is faster there too, but not by the same margin.

Interesting debate. I skimmed the last bit of it, so if I missed this point my apologies:

For the vast majority of code these days we don't care how fast it runs because it runs fast enough. That's why things like Java, Python, Perl, etc are acceptable in this day and age -- they are fast enough. There really isn't much point in optimizing or even benchmarking this code.

For code whose performance we do care about, somebody is going to pay at least a little attention to its performance. Failing to do so will make the software uncompetitive and inferior. The Mac is at a disadvantage because many programmers simply don't care about it and thus don't bother to consider how to make their Mac performance better. That is a bigger issue than what the CPU is actually capable of, and really that's what SPEC reflects.

There are some amazingly trivial things that can be done to improve its PowerPC performance. One I remember from years ago was a function that internally had an array of 16 variables -- the PC version was written to use an array and do pointer arithmetic. When looking at the PowerPC code generated it was obvious that the load/store instructions were killing it so I rewrote it to have 16 seperate local variables. The result cut the PC performance in half and quadrupled the PowerPC performance. A quick pass through other parts of the codebase revealed many other similar kinds of things... almost all of which would tradeoff x86 performance for PowerPC performance. There were even a few which traded off PentiumIV performance against Athlon performance, so this problem goes beyond just PowerPC but it does impact Apple more severely because their architecture is more different.

On a practical note... hobold is a better general altivec programmer than I am. He _has_ been a mainstay over on altivec.org for a long time and contributed some very useful support algorithms. Ian Ollman checks in once and awhile on Ars, as does Chris Cox.

And we are in the process of "growing" a few altivec'ers here... and the 970 supporting altivec is going to bring considerably more interest.

Sooo, generally speaking I'd say the Mac crowd on Ars is pretty well "armed" for SIMD wars. Where are the x86 equivalents?

OK, as a new 'altivec'er', I have a quick question about a few things I'm not 100% clear on:

In Ollman's tutorial he states that even using an optimized load there is something like 35 cycles between loads into a vector register. If this is the case then am I correct in saying that doing a several vector ops that take 18 cycles to complete is just as fast and efficient as doing two vector ops that take only 6 cycles to do?

Also, am I looking at having to do some vec_perm on my data to isolate the Y U and V data and deal with them differently. Ollman and a few other sources mention that lots of permutes should be avoided. Does it make sense to permute my interleaved UYVYUYVU... vectors into planar UUUU YYYY VVVV vectors, then process then permute back to interleaved? Or am I missing a better solution.

Also a lot of my computation involves moving the UInt8 channels into shorts and then back into unsgined char for storage. This seems like it goes hand in hand with my permute problem. I guess this would involve a vec_perm followed by a vec_merge for each channel. Is this really wasteful? Would I be better off arranging the data before it hits the vector unit?

quote:Originally posted by gcc:OK, as a new 'altivec'er', I have a quick question about a few things I'm not 100% clear on:

In Ollman's tutorial he states that even using an optimized load there is something like 35 cycles between loads into a vector register. If this is the case then am I correct in saying that doing a several vector ops that take 18 cycles to complete is just as fast and efficient as doing two vector ops that take only 6 cycles to do?

I'm not following this. what is "between?"Where in the tutorial is this suggested? It must be something special about that code, not some general limitation.

If you make a load out to main memory (meaning the data isn't in cache) the _latency_ can be much more than 35 cycles.

If the data are in L1 or L2 the latency is much less.

_IN_PRINCIPLE_ (and in fact in practice) your code can do one load or store every cycle to/from L1

quote:Also, am I looking at having to do some vec_perm on my data to isolate the Y U and V data and deal with them differently. Ollman and a few other sources mention that lots of permutes should be avoided. Does it make sense to permute my interleaved UYVYUYVU... vectors into planar UUUU YYYY VVVV vectors, then process then permute back to interleaved? Or am I missing a better solution.

Depends on your problem! But often the answer is _YES_ you do want to do those permutes... if you need to for the algorithm. Remember that CPUs in question here can dispatch & retire 2 vector instructions to different units, so the permutes can be done "free" as long there is not more permute instructions than others in the chain, and as long as you manage scheduling dependencies. See my rather endless discussion of various permute schemes associated with the RGB->CMYK algorithms in the old altivec thread.

quote:Also a lot of my computation involves moving the UInt8 channels into shorts and then back into unsgined char for storage.

Why? You don't want to do this unless really necessary... but see vec_merge, vec_pack, vec_unpack etc... browse yourway through these. They use the permuteunit but avoid the need on your part tocreate permute constants.

quote:This seems like it goes hand in hand with my permute problem. I guess this would involve a vec_perm followed by a vec_merge for each channel. Is this really wasteful? Would I be better off arranging the data before it hits the vector unit?

Only if you can do that by changing the fundamental data structures... don't use scalar code to do this re-arrangement! It will be far slower.

If however you _have_ the choice to arrange your data-storage to favor vector processing then by all means do so... in other words 16 byte vector of Y, vector of U, vector of V ... repeat. If your algorithms will ever work on _single_ color planes then you want to go to 32 bytes of each (cacheline) ... and if they will be more than transient fetched (i.e. some will hang around in the L2) then you want to go to 64, because the L2 is organized that way.

quote:Originally posted by gcc:OK, as a new 'altivec'er', I have a quick question about a few things I'm not 100% clear on:

In Ollman's tutorial he states that even using an optimized load there is something like 35 cycles between loads into a vector register. If this is the case then am I correct in saying that doing a several vector ops that take 18 cycles to complete is just as fast and efficient as doing two vector ops that take only 6 cycles to do?

I'm not following this. what is "between?"Where in the tutorial is this suggested? It must be something special about that code, not some general limitation.

If you make a load out to main memory (meaning the data isn't in cache) the _latency_ can be much more than 35 cycles.

If the data are in L1 or L2 the latency is much less.

_IN_PRINCIPLE_ (and in fact in practice) your code can do one load or store every cycle to/from L1

My grasp of the terminology is not too good, so maybe I'm not following this fully, but here's where I got that from:

Starting on page 34 of the Altivec Tutorial 1.2, in the section 'Memory Speed Is Often the Problem', Ollman starts discussing this: 'Loading a cacheline from RAM to L1 takes about 35-40 cycles on my G4/400..' He then goes on to say that if all one does is '...add those two vectors together (as little as 1 cycle), then during the other 39 cycles your code will do nothing.' He then repeatedly brings up the 35-40 cycles as the time frame you have to work on the data before the next chunk comes in from RAM.

Since my image buffers will be coming from RAM as I go though the loop, I interpreted this to mean that each successive iteration of the loop would incur the 35-40 cycle stall waiting for data. I'm not sure if I am mis-reading his illustration here, but I think I get his point about doing as much work as possible on each vector that is loaded in order to get maximum benefit (i.e. more computationally intesive algos will see better speed increases than simple cases). My functions use fairly large buffers (720*480*2) in RAM that won't fit into L1 or L2, so it seems that this is a somewhat critical point, which I want to get straight.

Here's the scenario I'm looking at right now:My goal is to be able to process DV in realtime. After the decompression of the data into my buffer and the display to screen is done there's not much cpu time left (~50% on a 1ghz G4), so my processing code needs to be pretty efficient. If I'm wasting too much time waiting for the image data to be loaded from RAM then that efficiency is gone. Scalar code will barely do simple porcessing, but it seems like proper Altivec code will enable this to happen on current G4 machines based on apps like Final Cut Pro. I don't want my app to suffer from the syndrome that Mr NSX describes above:

quote: vast majority of "professional" programmers don't spend a single iota optimizing their code. Their job is to get the functionality in and then at the end of the day they turn off debug, crank up the opt level and ship it.

This is not meant to replace BadAndy's comments, I merely try to put Ian's explanations into a bit of context.

quote:Originally posted by gcc:In Ollman's tutorial he states that even using an optimized load there is something like 35 cycles between loads into a vector register. If this is the case then am I correct in saying that doing a several vector ops that take 18 cycles to complete is just as fast and efficient as doing two vector ops that take only 6 cycles to do?

Yes.

The specific scenario Ian is talking about is this: your data is not in any cache, but you do use stream prefetching. So the memory bus is going full speed, non-stop. Nevertheless, subsequent cache blocks (i.e. aligned 32-byte blocks) will trickle in at a rate of one block per 35 clock cycles.

One cache block is just two vectors worth of data, so it makes no difference wether processing time per vector is 17 cycles or 5 cycles. Anything below 18 cycles is limited by the memory bus.

BTW, this figure of 35 clocks was true for something in the ballpark of a 450MHz G4 on a 100MHz bus ... any modern GHz machine will see relatively slower speeds, around 50 clock cycles per cache block.

quote:Also, am I looking at having to do some vec_perm on my data to isolate the Y U and V data and deal with them differently. Ollman and a few other sources mention that lots of permutes should be avoided. Does it make sense to permute my interleaved UYVYUYVU... vectors into planar UUUU YYYY VVVV vectors, then process then permute back to interleaved? Or am I missing a better solution.

The part about avoiding vector permutes touches rather philosophical aspects of programming ... in a sense, permutes don't _process_ data, they merely _move_ it around.

So if you find your routine needs a lot of permutes, this is a sign that this specific algorithm does not fit the SIMD model of parallelism.

Long talk, little sense: Ian's statement is not an order to reduce the number of vector permutes at all costs (that would not generally be a good idea); rather it is a gem of ancient wisdom that cannot be fully understood by merely looking at the superficial meaning of the words. :-)

In your specific case, if you do have the freedom to lay out data as you wish, by all means arrange it as vector-friendly as you can. However, don't inflate data items unnecessarily, this wastes both cache space (larger data items means fewer items in cache) and bandwidth (larger data items means each load instruction brings in fewer items).

quote:Also a lot of my computation involves moving the UInt8 channels into shorts and then back into unsgined char for storage. This seems like it goes hand in hand with my permute problem. I guess this would involve a vec_perm followed by a vec_merge for each channel. Is this really wasteful? Would I be better off arranging the data before it hits the vector unit?

It's not a good idea to make a special preparation pass merely to change the layout of data. As mentioned above, the processor will just sit there stalling for over 30 clock cycles, then do a few cycles worth of permuting, store data back, and stall again (assuming source data is not in cache, but destination data is).

If you can, merge all re-arrangements and (un)packing into as few permutes as possible. Sequences of permutes involving no more than two source vectors can always be collapsed into a single permute.

Unsigned unpack (for which there is no 'canned' instruction) can be made slightly more elegant with a little trick. Usually one could supply a vector of zeros as the source of the high bytes to be placed between the data low bytes. But it's slightly more efficient to use a permute control vector that starts with a zero and doubles as the left source vector. One such permute control vector could look like this:

quote:Originally posted by BadAndy:On a practical note... hobold is a better general altivec programmer than I am.

Sorry to burst the bubble, but I bet I have written fewer lines of AltiVec code than you did. :-)

I can admit, though, that I thought long and hard about every single line of AltiVec code I ever wrote. One of the privileges of doing unpaid programming is that one can dwell in excessive perfectionism ...

quote:He _has_ been a mainstay over on altivec.org for a long time and contributed some very useful support algorithms.

... excessive perfectionism like, for example, program generators that find snippets of 'perfect' code for tiny, but elementary and common programming idioms present in many AltiVec routines.

You could say it is my main hobby to try to fully understand the expressivenes of SIMD in general and AltiVec in particular. Generating those tiny fragments of 'optimal' code gave me a first impression how AltiVec instructions could be (ab)used.

But there is no denying that I am a theoretical person. To me, thinking about and discussing efficient algorithms and efficient implementations is 70% of the fun, programming is 20%, and seeing the finished program run is only 10%.

My goal is to be able to process DV in realtime. After the decompression of the data into my buffer and the display to screen is done there's not much cpu time left (~50% on a 1ghz G4), so my processing code needs to be pretty efficient.

Hehe, well if you were using a modern GPU, you could take a completely different approach to the problem and simply trigger a DMA of the DCT coefficients and have the GPU perform the IDCT and color space conversion. What would be even better would be to then perform your actual image manipulation as a shader that you've uploaded to the GPU. This way you'd barely tax the CPU while still having it perform in realtime.

I'm just poking fun with this example because I know we're talking about learning Altivec here, but I think as GPU's start having longer instruction lengths and more robust shading ISA's, it might not be so unreasonable to start firing off concurrent operations to the graphics chip in the future.

quote:Originally posted by gcc:My goal is to be able to process DV in realtime. After the decompression of the data into my buffer and the display to screen is done there's not much cpu time left (~50% on a 1ghz G4), so my processing code needs to be pretty efficient.

Who does the decompression? Your own code (or at least code that you have some control over)?

If so, consider decompressing only a part of a frame, say, a number of scanlines, and do your own processing on those scanlines while they are still residing in the cache.

You'll save one round trip to memory that way. You'll save many trips to memory if your processing consists of a customizable chain of effects applied in sequence.

(The basic idea behind tiling of images is a more general principle in optimization. I have seen it referred to as 'strip mining' or 'cache blocking'.)