Sophisticated branch prediction and compiler optimization technologies result in a higher predictability of instruction references, thus making the branch target cache and prefetch buffer (BTC+PB) design appealing. However, it is surprising to find that this BTC+PB design actually performs worse than the non-partitioned instruction cache does. Further investigation shows that this degradation is mainly due to the limited bus bandwidth available for prefetching. To make up for this situation, we propose two load-balancing mechanisms for the BTC+PB design: multi-blocks target (MBT) and dynamic prefetched instruction placement (DIP) techniques. The basic ideas of these two techniques are to tradeoff cache space for bus bandwidth once the bus is found to be overloaded by prefetching. The resulting cache, called the LB+PB design, is found to have superior performance over current non-partitioned instruction cache designs do. Based on the SPEC95, the memory latency due to instruction references can be reduced by an average of 5% to 15%, with some benchmarks whose improvement can go up to over 50%.

Source Title:

Proceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors