The html pages I made available was only a starter slideshow for my presentation.
Here are a few words about what I shown, and what I said.
Hi Gentlemen,

The html pages which are available at my site are just an introduction slideshow that I presented as a starter, it does not relate what I shown and said afterwards. Please find below a couple of notes about my presentation at Alchimie show.

Here is a quick summary of the presentation I did :

- I booted with boot delays set to 1 second for UBoot and for SLB (second level booter), thus less than 20 seconds after power on, we were with a fully loaded Workbench with Amidock.

- I started with the small html slideshow, presented on IBrowse. You can find those pages at the URL above. IBrowse loaded in 2 seconds with its About: page fully displayed. Browsing through the pages of the slideshow was instantaneous.

- While we were at it, I browsed the OS4 install guide, also as fast as can be, must say I also find this responsiveness impressive myself :)

Then I demoed as many things as I had time to during the 2 hours I had. Everything worked, fast & stable, and was smooth and impressed. I showed mainly:

- USB. I plugged a Microsoft IntelliMouse Optical into my USB hub, and we had 2 mice to control the system

- Crisot's slach5 winning demo (got applauses which it deserved)

- chip's rayrace realtime raytracer demo. When the juggler appeared, audience was impressed, but really took measure of what they were watching when I moved the mouse. Wait for the Altivec version !

- FPSE, AmiDog's PS1 emulator, with an oldskool game which ran nicely ('Cotton')

- some other visual toys I had at hand

The demos only grimreaped twice, and I anticipated the grims before they popped up :

- One was native glsokoban / w3d, when I launch it does a base page access (a “null pointer” bug in glsokoban). I didactically shown the disassembly which is available in the grimreaper window, was a store to r4, r4 was null etc. I clicked on continue, and it all went fine & fast.

- One was frying pan 0.3.1, I shown the app, and at one point I said 'now it should grim’ and it did. It still loaded fine though. I quit the app, clicked on reboot and less than 4 secs after, wb was up with amidock. That was the only reboot of the show.

I forgot to show (because of short time):

- Petunia... Almos, sorry, I had prepared something for that (side by side windowed jit & nonjit runs of voxelspace), but i both forgot, and was asked to stop at this point by the party organizers cause it was already 5:30 pm while I was scheduled until 4pm.

- ArtEffect

- USB with MassStorage (ie USB key or digital camera)

At the end, I had many fair questions to which I answered; my feeling is that the audience really appreciated the effort behind what I shown, and was conscious that we are not far away from a releasable 4.0.

Then came the expected question, 'and why doesn’t DMA work ?'

I said 'All what you saw was DMA since the 1st boot'.

I copied a few 100 MB files in a snap, with zero CPU (thanks to Pete Gordon for the clock/CPU docky, helps a lot). Then I switched to PIO, they saw 4 x slower and 80% CPU.

The audience understood that it was indeed DMA, and that was fast, and that was part of the overall smoothness of what I shown.

Then I explained the things below (this is the reference for my statements, please don’t rephrase or extrapolate or invent or whatever):

- the Ethernet chip only triggers the problem, but it is not at all related to it (a test using a PCI Ethernet shows the same behaviour)

- We have made a driver for a Silicon Image 680 PCI IDE UDMA133 controller chip, this does UDMA 133 nicely, including when Ethernet is used at full speed.

- The fact that a PCI IDE controller solution works, shows that the problem is *not* related to Articia, since PCI DMA is *also* handled by the Articia, and that works.

- The full Alchimie show demo was done using UDMA, both from the VIA and from the Si680, without problem (but with Ethernet off, would the Ethernet have been turned online, I would have had to revert the VIA into PIO before).

- Things are currently under more investigation

In the meantime there are 2 options for existing A1 board owners:

- Use the VIA IDE controller in PIO mode when using Ethernet, and UDMA at other times,

In reply to Comment 209 (Nicolas DET):Uhoh. ns called him for help... Who is next ?

>So if I understood well:

Obviously not. Or is it me ? Your code must be better than your english... Anyway I'm in good mood this morning so let's go for a lecture, you seem to need a serious one (yes I'm pedantic, but I know why).

>Linux places BD in cacheable area, and then you need to flush then to
>be sure they are really in memory and not in the cache.

I don't know what 'BD' is. And I never wrote a Linux driver, so I don't need anything.

Obviously anyway, in the case of a write to drive, data must be in ram at the moment when the DMA controller will fetch it and send it to the drive. This is common sense. Did you note the word 'write' just here ? Ok let's go on, I'll explain for 'read' further on.

This 'data must be in ram' situation (otherwise named 'cache is flushed') should be done in hardware, no doubt. This is the cache snoop feature of both the northbridge and the CPU : when a busmaster (let's say DMA controller, but could be anything else like another CPU) fetches data from memory, if this data is in cache and has not been flushed (ie 'dirty'), the busmaster memory access cycle is held, the CPU flushes the dirty cacheline (32 bytes for the PPC), and then the busmaster cycle is restarted.

This way, no code-driven cache flush is needed, it is provided in real time when needed. Note, 'when needed' means that only dirty cache lines are flushed, not the cachelines which contain data in sync with memory.

It works like that on almost all modern environments, except on some embedded platforms which don't provide the snooping feature (be it a lack of the CPU or a lack of the northbridge, by the way), like a renowned BSD coder pointed out some time ago. On those lower end environments, drivers have to explicitly flush the cache before starting a third party bus actor like a DMA controller. The driver will flush the cache for the whole buffer area, but note again, even with what we will call here a 'manual' flush, only dirty cache lines are physically flushed, not the cachelines that contain data which is in sync with memory, as flushing cachelines which are not dirty does nothing and makes no sense.

A quick digression here, to kill a myth : we thus have 2 situations :
- one where the hardware takes care of cache snooping by holding the DMA transfer each time a cache flush is needed,
- and another where the software flushes the cache and then starts the DMA transfer which will be going uninterrupted.

For a given buffer to write to disk, with a given set of dirty cachelines in it (say the buffer is 512 bytes, say there are 3 32 byte cachelines that have to be flushed), according to what I explained, both methods will end up doing exactly the same cache flushes on the bus : 3 cacheline flushes happening in the middle of the DMA transfer with hardware coherency, and 3 cacheline flushes happening in the beginning of the DMA transfer with manual cache flushing, or software coherency. Guess which is faster ? Tell us a good tale here please.

Ok, now back on track. Hardware coherency should work with the Articia also, it provides signals for that. Unfortunately, as known since long, there is a problem in its implementation on current A1 machines, so it does *not work*.

Thus, for a DMA driver to work on current A1's, it *has* to manually flush the buffer caches before asking the DMA controller to write to device.

This is a problem for Linux, since Linux drivers are assuming that they rely on hardware coherency. There is a very quick and dirty workaround, that needs no change in Linux drivers, it is to mark all memory noncacheable. This is obviously evil and stupid, it will work but it will crawl forever. The only advantage is that you can run untouched Linux drivers. The only other way on current machines, is to modify the Linux driver code, to add a manual cacheflush before stating the DMA write to device. Those are facts.

This is not a problem for OS4 drivers, as anyway we have written those from scratch, and along the AmigaOS device driver writing guidelines... The Amiga has never provided hardware coherency, and all the DMA drivers so far have been doing manual cache flushes. Eat it (I know, its hard, especially when you don't understand anything). Ever heard of CachePreDMA() and CachePostDMA() ? Ever thought about the fact that your beloved Ariadne, FastLane, etc & your blue dog's DMA-capable Classic Amiga board does that since day 1 ?

OK. So we implemented our OS4 drivers relying on the OS4-equivalents of CachePreDMA() and CachePostDMA(), which are StartDMA() and EndDMA(). Don't start jumping up and down, those functions are anyway mandatory otherwise OS4 could NOT run on classic hardware, where there is no hardware coherency. Oh, this reminds me. Your beloved blue OS also does that when running on a Classic machine.

By the way, those new OS4 functions also provide for things which are needed in a virtualized address space (like building scatter / gather lists), this is why we needed new calls in OS4.

Now, a word on the read transfers. For reads, no cache flush is needed at all, neither with hardware snooping nor with software cache handling. Why flush cache to ram when you will be overwriting ram with data read from the device ?

What is needed for read is cache invalidation. That is, telling the CPU that he has to go to real ram to fetch data, because its cache contains invalid data. And as Bernie Meyer pointed out earlier, this cache invalidation has to be done *before* the read and not after, contrary to intuition. Why ? If for whatever reason, the CPU wants to cache other areas, completely unrelated to the DMA read being done, it might decide to flush cached areas of the read buffer, in order to get free cache lines. If this flush, which happens 'in your back', happens after the DMA controller wrote data to ram, you end up with trash. Simple way to avoid that : make sure that no area of your read buffer is cached, ie invalidate the cachelines that might cover your buffer. This was one bug of mine a long time ago. You see, I'm not even reluctant to admit my own bugs.

So for reads, no flush, an invalidation (of course bound to your buffer address range, you're not going to invalidate the whole cache), and before the read.

>But if you do need to do that in a cache coherent environement -> the
>hardware is buggy -> it *has* to do it by itself as *documented* and
>has *every one else* does.

I said above, in truth and faith, that this indeed does not work on current A1's. This is known since long, and it is indeed a hardware bug of the current A1's named 'lack of hadware coherency', nothing new here. No need to shout either like if you were the victim of that thing, as far as I know you don't own a current A1 and never will. So relax, have a beer, and we'll continue here.

Did you note 'current' ? the µA1 'C' alias MK3 (and further machines) has proper snoop signaling implemented. Additionnally, this machine also solves the VIA / Ethernet interaction (and all other A1 oddities, like wrong implementation of the AC97 link between the VIA and the Sigmatel audio codec).

What will you say when you'll see untouched linux drivers work on that machine ? And VIA IDE DMA work on it too ? This with the very same Articia chip we have on current machines ? We might have some fun then.

>Is it this the ArticiaS 'feature' ? you have to flush the cache
>manually.

Read above. The articia is not at fault. It is its integration with CPU on current machines which is, in the 'snooping' department. It is the same kind of integration problem as for the VIA IDE clashing with the Ethernet.

Proof to come (yet to be seen by your own eyes, I admit) :
- for the articia being capable of hardware coherency : µA1 C and above does it with current articia chip.
- for the articia not being at fault in the VIA IDE / Ethernet DMA trouble : µA1 C and above does it with current articia chip, and additionnally, si680 PCI IDE DMA works fine on current A1's.

How's that magic possible ?

>And by the way, if you are in a cache coherent way, flushing the cache
>manually, after the hardware had (maybeŕ done in hardware is quiet
>useless/time consuming.

No cache flushing has to be done *after* a DMA transfer, makes no sense at all.

Additionnally, even if you would cacheflush before a DMA transfer with working hardware coherency, it would be useless (you're right) but not time consuming (you're wrong) because there would be no cache snoop flushes happening during the DMA transfer, since you flushed the dirty pages before. So at the end, zero perf impact. Get it ? If not, reread from start. Counter reached 100 ? Time to stop.

>If I believe this patch MAI use flush/invalidate_dcache_range. This
>call will always flush the cache, whereas if they would have real
>hardware they would use DMA_wback (or something like that) which is
>the same as previously if you compile your kernel in a cache coherent
>way or void if not.

Sorry, I cannot comment that statement, as I don't understand it.

>Do you really believe that your are speaking ?

I'm not speaking, I'm writing now. So I don't believe I'm speaking. But as to what I'm writing, I do believe in it.

> Because from here it
>looks like a load of bullshit, as usually...

Sure, as usual, what I write (or say, or show) looks like loads of bullshit to you, because you & your mates simply cannot understand things that are beyond your reasoning capabilities, and can only keep repeating the same 'articia does not work' kind of nonsense ad nauseam, even after seeing that pci ide DMA controller working with the same articia.

If I sounded a bit irritating and insulting to this small group of obscurantists who will recognize themselves... Well, then I succeeded.