It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster.

This is probably due to the store miss merging feature of the hardware. I guess that glibc uses the old trick with floating point registers to store 8 bytes at a time, but that still requires four store instructions to fill a single cache line. Apparently those cannot be collapsed into a single write burst transaction, but two vector stores can.

Scalar code would have to use explicit hache hints (dcbz) to prevent the read transaction caused by cache line allocation.

It's interesting to see, that even in cases where we don't hit the cache, Altivec is still almost 2x faster.

This is probably due to the store miss merging feature of the hardware. I guess that glibc uses the old trick with floating point registers to store 8 bytes at a time, but that still requires four store instructions to fill a single cache line. Apparently those cannot be collapsed into a single write burst transaction, but two vector stores can.

Scalar code would have to use explicit hache hints (dcbz) to prevent the read transaction caused by cache line allocation.

Would it be safe in generic code like glibc to use Data Streams?

I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not.

I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not.

One of the tests i want to make is running two tests that use altivec in parallel. And see what the effect of heavily using Altivec in 2 processes can have in each one. I don't expect 50% performance of course, but I'm not expecting tremendous drops either. Still, numbers will speak the truth

Probably not, because the dst instruction (and its companions) are part of AltiVec.

Quote:

I forget what the behaviour is but if you set up a new stream doesn't it kill the first one set up if there are already the maximum in use? Or does it simply NOP? I'm not sure if that behaviour would be friendly in a system library or not.

You always pass a stream identifier with each prefetch instruction. There are four streams that you must manage yourself. The convention is that the OS uses stream numbers from 3 downward, while user code uses stream number 0 upward. This minimizes collisions.

There are also scalar prefetch instructions that are present on most PPC models (there is a subset that is architected to be present on any PPC model). But using those has its own set of pitfalls, because they depend on cache line size. In the most extreme case programs can break when cache lines are of unexpected size. This is particularly true for dcbz which is architected to have the visible side effect of zeroing out a cache line.

Probably not, because the dst instruction (and its companions) are part of AltiVec.

Well the whole point of these benchmarks is to roll the AV code in. For a large buffer where the data isn't going to fit in the cache, data streams are going to help here specifically.

Quote:

convention is that the OS uses stream numbers from 3 downward, while user code uses stream number 0 upward. This minimizes collisions.

2 each then at worst case. And if you reuse an already existing stream, it's re-appropriated..? That's what I would define as unfriendly.. unless there's some context to it.

Quote:

There are also scalar prefetch instructions that are present on most PPC models (there is a subset that is architected to be present on any PPC model).

I always wondered what the big difference between data streams and the cache handling functionality was. Is it just a concession to define some cache prefetch stuff beyond the PowerPC subset defined as the minimum, or is there an advantage over using them?

Apple and IBM are discouraging the use of datastreams on the G5 in favour of the "old" cache handling. Probably because IBM botched it in my opinion

And if you reuse an already existing stream, it's re-appropriated..? That's what I would define as unfriendly.. unless there's some context to it.

Yep, new orders for one of the prefetch engines override the previous stream. And there is no architected context. BUT.

The rule for using stream prefetch is to prefetch small overlapping blocks. This has several effects:

- the stream is kept in synch with the computation
- the stream is restarted quickly after an interruption
- the stream does not pollute too much cache ahead of time
=> effectively, the running program _is_ the stream context

I should add that a data stream can be stopped at any time for no particular reason. For example all streams will stop at page borders, because it would require a trip to the MMU to determine the subsequent physical address.

The stream prefetch engines are a fairly low level feature, so the software will have to make up for some limitations of the hardware. Consider the stream prefetch to be a combination of hardware and software. The prefetch instructions must be used in a certain way to make the best use of this feature. We are not on a CISC processor here.

Quote:

I always wondered what the big difference between data streams and the cache handling functionality was. Is it just a concession to define some cache prefetch stuff beyond the PowerPC subset defined as the minimum, or is there an advantage over using them?

The main difference between traditional prefetch instructions and AltiVec data streams is that the streams are even more asynchronous than the 'data cache block ???' instructions. There is no guarantee how quickly a stream prefetch instruction can fulfill its task. You do have the advantage of saving instruction bandwidth as compared to issuing a cache block touch for every cache line, which is particularly important on CPU models with a single LSU. But the stream prefetch hardware cannot do address translation, so it is limited to a physical page when it increments or decrements its internal pointer. The scalar prefetch instructions always run their address through the MMU.

Quote:

Apple and IBM are discouraging the use of datastreams on the G5 in favour of the "old" cache handling. Probably because IBM botched it in my opinion

Well, there are still cases when data streams are beneficial on a G5, but they are implemented in a fairly limited way. At the same time, PPC970 has automatic stream detection for sequential accesses. Then there are its huge cache lines of 128 bytes, which pretty much fit the description of "small block of data". So there is less of a need for software directed data streams on G5, because each scalar prefetch instruction can pull in a lot more data.

I guess one could conclude that data streams were too low level, because they are not equally useful over a wide range of processor models. OTOH, I like the ability to specify block sizes and strides; pulling in a single cache line is just not as powerful.