J. W. Haskins and K. Skadron
In Proceedings of the 2003 IEEE International Symposium on
Performance Analysis of Systems and Software, Mar. 2003.

Abstract
This paper proposes to speedup sampled microprocessor
simulations by reducing warmup times without sacrificing
simulation accuracy. It exploiting the observation that of the
memory references that precede a sample cluster, references that
occur nearest to the cluster are more likely to be germane to the
execution of the cluster itself. Hence, while modeling all cache and
branch predictor interactions that precede a sample cluster would
reliably establish their state, this is overkill and leads to longrunning
simulations. Instead, accurately establishing simulated
cache and branch predictor state can be accomplished quickly
by only modeling a subset of the memory references and control-
flow instructions immediately preceding a sample cluster.

Our technique measures memory reference reuse latencies
(MRRLs)--the number of completed instructions between consecutive
references to each unique memory location--and uses
these data to choose a point prior to each cluster to engage cache
hierarchy and branch predictor modeling. By starting cache and
branch predictor modeling late in the pre-cluster instruction
stream, we were able to reduce overall simulation running times
by an average of 90.62% of the maximum potential speedup
(accomplished by performing no pre-cluster warmup at all), while
generating an average error in IPC of less than 1%, both relative
to the IPC generated by warming up all pre-cluster cache and
branch predictor interactions.