To surpass the overhead reduction offered by block prefetches, the final
step would be eliminating the ``per iteration'' instruction overhead
altogether. The two previous proposals which accomplish this are (i)
issuing large block prefetches for the entire data structure outside the
loop (as illustrated in Figure , or (ii) using
strictly hardware-based prefetching. However, both of these approaches have
drawbacks, which we will briefly discuss.

The large block prefetch approach was studied by Gornish et
al. [32][31] in the context of a software-coherent
shared-memory multiprocessor with binding prefetches. The potential problem of
large block prefetches is that data may arrive in the cache too early,
thereby exposing it to possible replacement before it can be used. In
addition, the bursts in network traffic caused by large block requests can lead
to queueing delays. We observed these negative effects when using large block
prefetches in LU during an earlier study [61]. While
hardware-based prefetching schemes can stream data into the cache more evenly,
they suffer from a number of disadvantages which we will describe in
detail later in Section .

To improve upon both of these techniques, we introduce the notion of programmable streams. With programmable streams, the software decides what
to prefetch and when the prefetches should be issued. Similar to the
``large block prefetch'' approach, this information is provided only once
outside the loop, so there is no per-iteration instruction overhead.
However, in contrast to the ``large block prefetch'' approach, where all
data is fetched at once, programmable streams provide a mechanism for
``flow control'' so that data can be streamed into the cache at a rate
matching the computation. This flow control is accomplished by associating
a ``trigger instruction'' address with each stream. A trigger instruction
can be any normal instruction in the loop, provided that it is executed
once per loop iteration. Once a programmable stream has been initialized by
the software, the hardware issues the next prefetch in the stream each time
the trigger instruction is executed. Therefore the data arrives in the
cache in the same manner as with software pipelining, but with essentially no
instruction overhead.

To initialize a programmable stream, the software must specify at least the
following: (i) a starting prefetch data address, (ii) the difference
between consecutive prefetch data addresses, and (iii) a trigger
instruction address. The values are stored in a hardware structure
representing the stream. The only value which changes is the ``current
prefetch data address'', which is initially the starting prefetch address,
but is updated after each new prefetch is issued upon a matching trigger
instruction. Once the loop completes, the software may choose to deallocate
the stream to free the corresponding hardware entry for later use.

Further refinements of this basic approach may be achieved by having the
software provide more information to the hardware. For example, the
prefetch block size may be specified to make the software more portable
across different cache configurations. Also, to optimize for cases with
spatial locality, the software could specify the number of iterations
between prefetches, and the hardware could account for this by maintaining
a modulo counter with each stream. Finally, given that the hardware knows
the size of memory latency, the hardware could compute the number of
iterations to pipeline ahead and automatically issue the prolog prefetches
if the software specifies the number of instructions in the loop body.