As technology continues to scale, the memory hierarchy in processors is predicted to be a major component of the overall system energy budget. This has led many researchers into focusing on techniques that minimize the amount of data moved, and the distance that it is moved. While many techniques have been shown to be successful at reducing the amount of on-chip network traffic, no studies have shown how close a combined approach would come to eliminating all unnecessary data traffic, nor have any studies provided insight into where the remaining challenges are.
To answer these questions, this thesis systematically analyzes the on-chip traffic for six applications. For this study, we use an alternative hardware-software coherence protocol, DeNovo, and apply several optimizations to DeNovo to quantitatively show the traffic inefficiencies of a directory-based MESI protocol. With a fully optimized DeNovo protocol, we can remove most of the traffic inefficiencies caused by poor spatial locality, fetch-on-write write policy, poor L2 reuse, and MESI protocol overheads. In our discussion of these improvements, we highlight the software causes of these traffic overheads in order to generalize the results.
The final DeNovo protocol with all optimizations applied is able to reduce network traffic by an average of 39.5% and execution time by an average of 10.5%, relative to a baseline MESI implementation. On average, 8.8% of the remaining traffic is spent fetching non-useful data. Using only our optimizations, reducing non-useful data movement further would not be possible without losing performance because of irregular access patterns in the applications.