The goal of software pipelining is to issue prefetches the proper amount of
time in advance, such that the data will be found in the cache when it is
actually needed. The number of iterations to prefetch ahead must be
carefully chosen (see equation ()), since too few
iterations will not provide enough time to hide the latency, but too many
iterations may cause the data item to be replaced from the cache before it
can be referenced.

To evaluate the effectiveness of our software pipelining algorithm,
Figure shows a breakdown of the impact of prefetching
on the original primary cache misses. This breakdown contains three
categories: (i) those that are prefetched and subsequently hit in the
primary cache (pf-hit), (ii) those that are prefetched but remain
primary misses (pf-miss), and (iii) those that are not prefetched
(nopf-miss). The effectiveness of the software pipelining algorithm
is reflected by the size of the pf-miss category. A large value means
that the prefetches are either not issued early enough, in which case the
line does not return to the primary cache by the time it is referenced, or
are issued too early, in which case the line has already been replaced in
the cache before it is referenced.

The results in Figure indicate that the scheduling
algorithm is generally effective. The exceptions are CHOLSKY and TOMCATV,
where over a third of the prefetched references are not found in the cache.
The problem in these cases is that cache conflicts remove prefetched data
from the primary cache before they can be referenced. To adjust for this,
one might consider decreasing the prefetch latency compile-time
parameter (i.e. parameter in equation ()), which
was set to 300 cycles for these experiments. We will evaluate this
possibility later in Section . However, we observe
that when cache conflicts are the problem, they often occur frequently
enough that they cannot be avoided by simply adjusting the software
pipelining parameters. Later, in
Section , we examine these cases in more
detail and evaluate whether increasing the cache associativity can help.

Even in cases where prefetched data is replaced from the primary cache
before it can be referenced, there is still a performance advantage since
the data tends to remain in the secondary cache. Therefore although the
miss latency is not eliminated, it is often reduced from a main memory
access to a secondary cache access. This was shown earlier in
Table , where selective prefetching reduces
the average miss penalty from 24.8 to 12.3 cycles for CHOLSKY, and from
36.6 to 12.5 cycles for TOMCATV.