Does anybody know how to get "direct read" and "direct write" on 970FX? It's clearly possible, and it ought to be in the IOkit manual, but I can't find a description of this.

A related problem (and a necessary issue for drivers, which of course are running with supervisory status enabled so they can access the MMU) is physical address accesses... which is often exactly what one is doing with direct-read or direct-write. It appears that there is no way to do a physical address operation WITHOUT toggling off address translation for all transactions... right? This means that drivers reading data from a physical address must do something like:

do {
toggle off address translation;
read one or more physical addresses;
toggle address translation back on;
write something to the user address space being serviced;
}
while(whatever);

And those transactions may or may not need to be EIEIO wrt the physical address reads.

Clearly if the purpose is streaming data from a device which is "set up" then a lot of the work can (hopefully) be offloaded to a DMA engine... but the device control transactions (at least) will look this way, yes?

Now the point of all of this is the question of how to write computational routines which might escape two unpleasant burdens: TLB misses, and cache/cacheline thrashing. There are LOTS of computational routines ( LUdecomp is a classic ... and virtually anything which 'pivots') which do a lot of read/write with local granularities much smaller than a cacheline, when working on large problems.

Apple has built a highly optimized and fully modern implementation of OpenGL into the very heart of Mac OS X. This article explains general OpenGL optimization techniques and specifically how to optimize your OpenGL code to maximize vertex data throughput. It is intended for developers with an OpenGL application that needs optimizing, a basic understanding of OpenGL programming, and a good idea of how their application's data is structured and used.

Originally posted by BadAndy:Does anybody know how to get "direct read" and "direct write" on 970FX? It's clearly possible, and it ought to be in the IOkit manual, but I can't find a description of this.

A related problem (and a necessary issue for drivers, which of course are running with supervisory status enabled so they can access the MMU) is physical address accesses... which is often exactly what one is doing with direct-read or direct-write. It appears that there is no way to do a physical address operation WITHOUT toggling off address translation for all transactions... right? This means that drivers reading data from a physical address must do something like:

do {
toggle off address translation;
read one or more physical addresses;
toggle address translation back on;
write something to the user address space being serviced;
}
while(whatever);

And those transactions may or may not need to be EIEIO wrt the physical address reads.

Clearly if the purpose is streaming data from a device which is "set up" then a lot of the work can (hopefully) be offloaded to a DMA engine... but the device control transactions (at least) will look this way, yes?

Now the point of all of this is the question of how to write computational routines which might escape two unpleasant burdens: TLB misses, and cache/cacheline thrashing. There are LOTS of computational routines ( LUdecomp is a classic ... and virtually anything which 'pivots') which do a lot of read/write with local granularities much smaller than a cacheline, when working on large problems.

Dunno exactly the mechanism on OS-X but on linux this is the way to do it.(I assume that you wan't raw accesses without any caching).

1) Use an ioremap call that given a physical address and a byte range will return a virtual address pointer mapped at the given physical area.

2) Use an IO block mapping where you can map 1:1 physical/virtual memory. Generally this is frowned upon.

I have no idea how many pages down this thread is (I get daily emails about it), but here's a bump.

The other OSX GEM programmer has done some work on Altivec conversions to and from various color spaces so we can interface with the billion and one other plugin APIs. It's a pretty good start and could potentially save others from some QT bullshit. The code is GPL and here's a link to the current CVS:

I have to do a 'yuvs' to '2vuy' convertor since the new Apple DVCPRO-HD codecs use the reverse pixel packing as all of the existing GEM processing code and the internal QT routine is scalar. It was a basically a coin toss back in mid-2002 and 2vuy looked pretty decent then! Oh well.

^^^ Those routines are pretty straightforward. They could get a close to another factor of two or so faster on G4+ with another 2x unrolling (would relieve latencies) and vec_dst().

Gains on 970s would need to be seen by test, but by eye I think the loop sizes are large enough that OOOE can't keep the loops executing over more than one loop generation, and so 970s might benefit as much from the 2x unrolling .. and maybe DCBTL.

Originally posted by BadAndy:^^^ Those routines are pretty straightforward. They could get a close to another factor of two or so faster on G4+ with another 2x unrolling (would relieve latencies) and vec_dst().

I agree, but keep in mind that's the first Altivec code my co-developer has written and I did nothing but give a few tips to help. At some point I will probably make the requisite second pass and do the unrolling and cache hints.

quote:

Gains on 970s would need to be seen by test, but by eye I think the loop sizes are large enough that OOOE can't keep the loops executing over more than one loop generation, and so 970s might benefit as much from the 2x unrolling .. and maybe DCBTL.

I have been thinking about targeting Altivec code for the 970 primarily since the G4s will run that code as well or better. One question I have is about DCBTL and the like. I have always used vec_dst() so I don't really know the proper use of the 'old' PPC cache instructions. Any tips or links to docs about using them to best result on the 970?

If you mean that you would fully-schedule (if-possible) for 970 instruction latencies... I agree that is a good goal, and will keep G4+ very happy. Both the processors have the same L2 latency (11 cycles) and generally that is the latency to work on in terms of serious optimization, via SERIOUS load hoisting, often requiring loop inversion to get it.

Spanning L2 latencies is really critical to real-world performance on Macs, particularly so on single-CPU machines... because the machines are constantly recovering from L1 trashes (and sometimes L2 trashes) from task swapouts.

In principle you can overcome a bunch of this using real-time scheduling, but a good algorithm should run well at user priority.

DCBTL is really easy to use and rather similar to vec_dst really... the only problem is that it can only do sequences of contiguous cachelines. On the otherhand, compared to G4+ ... you have load/store instruction issue slots to burn, so even hinting each cacheline loaded may not cause any issue-slot loss, depending on how densely optimized the code is otherwise.

These codes do more than one vector operation per load+store ... and so the codes would be instruction latency limited on 970 if there were no memory latencies (hah!) But the reality of these codes is that they will be memory latency limited when everything else is tweaked up ... and cache hints will help.

It is pretty easy to use #if .... #else ... #endif blocks to allow defined substitution of vec_dst() vs DCBTL

The other OSX GEM programmer has done some work on Altivec conversions to and from various color spaces so we can interface with the billion and one other plugin APIs. It's a pretty good start and could potentially save others from some QT bullshit.

I'm curious... What are you *still* doing that involves slow scalar color-space conversion(s) in the QTComponents API? Hell I lost ALL of the slow scalar YUV->YUV color-space conversion(s) w/ QT 7.0 under Tiger...

quote:

I have to do a 'yuvs' to '2vuy' convertor since the new Apple DVCPRO-HD codecs use the reverse pixel packing as all of the existing GEM processing code and the internal QT routine is scalar.

Ah I see...

quote:

It was a basically a coin toss back in mid-2002 and 2vuy looked pretty decent then! Oh well.

Well going by Apple's own (yeah right) documentation on the mailing lists 'yuvs' was always the preferred format... Hell just sample any Apple (GL) based application under Panther et al.

Originally posted by feelgood:I'm curious... What are you *still* doing that involves slow scalar color-space conversion(s) in the QTComponents API? Hell I lost ALL of the slow scalar YUV->YUV color-space conversion(s) w/ QT 7.0 under Tiger...

There is an attempt to get multiple pixel processing libs working together and they don't all use the same color-space. In GEM, we use ARGB and '2vuy', and we have to insulate ourselves against the slow ass code the other coders (<cough>x86 Linux<cough&gt write. The first step is to just get the translation from RGB or y420 or whatever working quickly.

The second part is the avoidance of the internal QT calls for conversion as that will happen from time to time.

quote:

Well going by Apple's own (yeah right) documentation on the mailing lists 'yuvs' was always the preferred format... Hell just sample any Apple (GL) based application under Panther et al.

Actually, Kevin Marks advised me to use 2vuy based on the stated design goals of having DV and SD Photo-Jpeg run on 1 Ghz G4s. The fast path for those was 2vuy at that point and DVC100 was not really around then. This was over 3 years ago and we just had pre-release 10.2 then.

A bit more digging found this from Apple, but I also see a few SIMDtech mail list mentions that dcbtl is really dcbt with a third argument TH. I have not found what TH means because the IBM and Moto PPC PIMs only have the two arg dcbt. And then there is the dcbt(l)128 instructions which I guess refer to the 128 byte cachelines of the 970.

A sidenote: c't magazine reports that IBM is to present a Cell based Workstation at the LinuxTag 2005 conference. The description of the talk only mentions the debut of hardware in passing, but c't made it the item's headline. I might be there ...

Originally posted by hobold: The description of the talk only mentions the debut of hardware in passing, but c't made it the item's headline. I might be there ...

Be sure to get a pricelist for me. I'm trying to work my way through the various layers of bureaucracy in order to find out the terms for getting an XBox 360 dev kit. I have a new job doing art programming and the price of a console is very appealing compared to the desktop PCs and Mac with similar graphics power.

As to how well this fits into the thread I'm begining to think that a number of people will be very interested in alternative platforms to run vector code on. At this point it looks like Cell and MS's chip are the only VMX alternatives with availability in the future.

Besides I'm becoming more and more interested and just how an OS like Linux will perform on Cell. Maybe not thread related (outside of VMX) but I ssupect that some of the performance allegations coming from Apples direction, with respect to Cell, are a bit over blown.

ThanksDave

quote:

Originally posted by hobold:A sidenote: c't magazine reports that IBM is to present a Cell based Workstation at the LinuxTag 2005 conference. The description of the talk only mentions the debut of hardware in passing, but c't made it the item's headline. I might be there ...