The selective prefetching algorithm uses several compile-time
parameters to model the behavior of the memory subsystem. Specifically
these parameters include the following: (1) cache line size,
(2) whether unknown loop bounds are assumed to be large or small,
(3) effective cache size, and (4) prefetch latency. The
most concrete of these parameters is the cache line size, which
can be set precisely. The other parameters, however, are more
heuristic in nature. To evaluate the robustness of our algorithm,
we measured the effects of varying these less obvious parameters.

The SUIF compiler performs aggressive interprocedural constant propagation to
statically determine as many loop bounds as possible. When this fails, our
current algorithm assumes unknown loop counts to be small, which tends to
overestimate what remains in the cache. When the compiler assumes unknown
iteration counts to be large, it produces identical code for 11 of the 13
benchmarks-the two benchmarks that change are OCEAN and MG. For OCEAN, the
difference in performance is negligible. However, MG performs 4%worse with
the large-loop policy, as shown in Figure 6(a). In this
case, the benefit of the extra prefetches is more than offset by increased
instruction overhead. Although assuming small loop counts happens to be better
in this case, the opposite could easily be true in programs where unknown
iteration counts are actually large. One solution may be to resolve loop
counts through profiling feedback. However, the more interesting result of
this experiment is how rarely loop bound uncertainty affects performance. We
observe that while nearly half the benchmarks contain loops with unknown
bounds, in most cases this has no impact on locality, due to the patterns
of data reuse within those loops.

For our experiments so far, the effective cache size has been set to 500 bytes,
which is only a small fraction of the actual cache size (8 Kbytes). When the
effective cache size is set to the full 8 Kbytes, our compiler generates
identical code for 7 of the 13 benchmarks. For 5 of the 6 benchmarks that do
change, the difference in performance is negligible. The one case that changes
significantly is CFFT2D, as shown in Figure 6(b). In this
case, fewer prefetches are issued with the larger effective cache size.
However, the prefetches that are eliminated happen to be useful, since they
fetch data that is replaced due to cache conflicts. As a result, the
performance suffers, as we see Figure 6(b).
(Note that this is in contrast with the effect we see in Figure
6(a), where issuing more prefetches hurts performance.) In
the case of CFFT2D, many critical loops reference 2 Kbytes of data, and these
loops happen to suffer from cache conflicts. An effective cache size of 500
bytes produces the desired result in this case. Overall, however,
the results are robust with respect to effective cache size.

Finally, for our experiments in Section 4.1, we set the
prefetch latency to 300 cycles. We chose a value greater than 75 cycles to
account for bandwidth-related delays. To evaluate whether this was a good
choice, we compiled each benchmark again using prefetch latencies of 100 and
1000 cycles. In nearly all the cases, the impact on performance is small. In
many cases, the 100-cycle case is slightly worse than the 300-cycle case due to
bandwidth-related delays. The most interesting case is CHOLSKY, as shown in
Figure 6(c). In this case, prefetched data tends to be
replaced from the cache shortly after it arrives, so ideally it should arrive
``just in time''. Therefore, the lowest prefetch latency (100 cycles) offers
the best the performance, as we see in Figure 6(c).
However, in such cases the best approach may be to eliminate the
cache conflicts cause this behavior[19].

In summary, the performance of our selective algorithm was affected
noticeably in only one of the 13 benchmarks for each parameter we varied.
Overall, the algorithm appears to be quite robust.