In a previous blog/forum entry I provided an example of using a prefetch bif with an array to improve performance on a z10 or z196 machine.
These entries are in the context of this blog entry, which discusses how to utilize such bifs in binaries that will run on any machine, even if they are not z10 or z196's.

The basic principle of using prefetch instructions is to issue them with enough time for the requested memory to actually be brought into cache, and to do so in places where data cache misses are actually a problem. If d-cache misses are not actually occurring, then adding the bif unnecessarily increases the overhead of that piece of code for no benefit. If d-cache misses are a problem, but we issue the prefetch too soon, then there will be no net benefit.

However, the cost of a d-cache miss is so high that even if only a small fraction of prefetch calls are successful it can be a net win, so it is worth investigating their use.

This entry deals with using prefetches with a queue. Previously, an array was presented. An array presents an attractive target for prefetching because we know a contiguous piece of memory is being used, and valid prefetch addresses can be calculated from the current reference. If the array is being accessed in a predictable manner, then calculating an effective prefetch address can be straight forward. An effective prefetch address is one that will not be requested until the prefetch instruction has had enough time to bring it into cache.

Other data structures such as trees, linked lists and queues that are implemented with pointers can be more challenging to use with prefetch instructions. There's no guarantee that the 'next' pointer references memory that is in the cache (or not), or even in the same page. More importantly, there is generally no way to predict what specific memory locations will be needed in the future, so it can be difficult to issue the prefetch instruction far enough in advance to minimize d-cache misses. If, for example, a linked list is being traversed and we are simply checking the value of a single field, issuing a prefetch will probably not help because there is not enough time for the prefetch to bring the needed data into memory before it is actually needed. For example:

The prefetch will not help much, unless there are a lot of entries with the value 100.
However, if even a little work is being done between memory dereferences, then a prefetch can be advantageous.

The below example is compiled with

xlc -oprefetch -qarch=8 prefetchExample2.c

or

xlc -onop -DNO_PREFETCH -qarch=8 prefetchExample2.c

I ran a prefetch enabled version (prefetch) and non-enabled (nop) with varying amounts of work on a z10.
The larger the work, the more time the prefetch instruction had to work it's magic. The smaller the amount
of work, the less effective it was.

As the results show, even with a small amount of work prefetch instructions can help, although their benefit does taper off.

In a previous blog/forum entry I provided an example of using a prefetch bif with an array to improve performance on a z10 or z196 machine.
These entries are in the context of this blog entry, which discusses how to utilize such bifs in binaries that will run on any machine, even if they are not z10 or z196's.

The basic principle of using prefetch instructions is to issue them with enough time for the requested memory to actually be brought into cache, and to do so in places where data cache misses are actually a problem. If d-cache misses are not actually occurring, then adding the bif unnecessarily increases the overhead of that piece of code for no benefit. If d-cache misses are a problem, but we issue the prefetch too soon, then there will be no net benefit.

However, the cost of a d-cache miss is so high that even if only a small fraction of prefetch calls are successful it can be a net win, so it is worth investigating their use.

This entry deals with using prefetches with a queue. Previously, an array was presented. An array presents an attractive target for prefetching because we know a contiguous piece of memory is being used, and valid prefetch addresses can be calculated from the current reference. If the array is being accessed in a predictable manner, then calculating an effective prefetch address can be straight forward. An effective prefetch address is one that will not be requested until the prefetch instruction has had enough time to bring it into cache.

Other data structures such as trees, linked lists and queues that are implemented with pointers can be more challenging to use with prefetch instructions. There's no guarantee that the 'next' pointer references memory that is in the cache (or not), or even in the same page. More importantly, there is generally no way to predict what specific memory locations will be needed in the future, so it can be difficult to issue the prefetch instruction far enough in advance to minimize d-cache misses. If, for example, a linked list is being traversed and we are simply checking the value of a single field, issuing a prefetch will probably not help because there is not enough time for the prefetch to bring the needed data into memory before it is actually needed. For example:
<pre class="jive-pre">
while(ptr!=NULL)
{ __dcbt(ptr->next);
/* this if(ptr->value == 100){ printf("hi\n"); } ptr= ptr->next; }
</pre>
The prefetch will not help much, unless there are a lot of entries with the value 100.
However, if even a little work is being done between memory dereferences, then a prefetch can be advantageous.

I ran a prefetch enabled version (prefetch) and non-enabled (nop) with varying amounts of work on a z10.
The larger the work, the more time the prefetch instruction had to work it's magic. The smaller the amount
of work, the less effective it was.

As the results show, even with a small amount of work prefetch instructions can help, although their benefit does taper off.

Re: Using the zArchitecture prefetch bif with queues (z10 and newer)

‏2011-06-22T06:05:23Z

This is the accepted answer.
This is the accepted answer.

Great stuff Chris! It's invaluable information and I'm sure I speak for all of us when I say a big thank you.

It would be intereting to run your prefecth example using multiple threads with a shared queue to see just how effective
the shared cache arhitecture on modern hardware is. FWIW, I don't have access to a machine with prefetch!

Re: Using the zArchitecture prefetch bif with queues (z10 and newer)

Great stuff Chris! It's invaluable information and I'm sure I speak for all of us when I say a big thank you.

It would be intereting to run your prefecth example using multiple threads with a shared queue to see just how effective
the shared cache arhitecture on modern hardware is. FWIW, I don't have access to a machine with prefetch!