Due: December 8, 2000

In this lab, you will work together with a partner to gain experience
in how hypotheses about memory system performance can be tested using
trace-driven cache simulation.

Because there is an odd number of students, exactly one group should
be of one or three rather than of two. Please try to find your own
partners; if you want to be the group of one or three (with two
others), or if you can't find a partner, see me. Each group should
submit a single jointly authored lab report.

Important: It is important that you complete the
Pre-Lab portion of this assignment before you start the In Lab
portion.

Pre-Lab

Rather than waiting for the first access to a block to start fetching
that block from main memory to the cache, it is possible to
prefetch a block likely to be accessed soon. Our textbook
mentions one form of prefetching, compiler-directed
prefetching, on page 620. In
this lab we will study a different form of prefetching, hardware
prefetching or auto-prefetching, in which the cache
hardware automatically chooses when to prefetch a block rather than
relying on ``hints'' from the compiler. Generally this involves a
very simple rule such as prefetching the next block after the one
currently being accessed.

Auto-prefetch can decrease the miss rate if the prefetched blocks are
used, but may also increase the amount of traffic between the main
memory and the cache if the blocks go unused. In fact, if the
prefetched blocks go unused, and bump other blocks out of the cache
that would have been used, the miss rate can actually go up rather
than down. A prefetch strategy is highly effective if it
causes a substantial decrease in the miss rate with only a minor
increase in memory traffic. For example, the miss rate might go down
by 50% while the memory traffic goes up by 1%. In the other
scenarios, where the miss rate doesn't go down much and/or the traffic
goes up a lot, we can term the prefetch strategy ineffective
(if the miss rate doesn't go down by much), costly (if the
miss rate is substantially reduced, but at a high price in memory
traffic), or counterproductive (if miss rate and traffic both
go up).

The question you are to answer is this: what is the relationship
between cache block size and total cache size and the cost and
effectiveness of auto-prefetch? That is, if you consider a
two-dimensional space of possible caches, where one dimension is
block size and the other is total cache size, and you were to map out
for each point in that two-dimensional space the percent increase in
memory traffic that auto-prefetching would cause, what would it look
like? Now suppose you mapped out for each point in the
two-dimensional space the percent reduction (or increase!) in misses
that auto-prefetching would cause, what does that look like? Putting
these together, in what regions of the space is auto-prefetching cheap
and highly effective? In which regions is it ineffective to the
extent of being pointless? In which is it out and out harmful?

Before starting the lab, you should decide and write down the
following two things:

What is your hypothesis regarding the answer to the above questions?
What is the rationale behind your hypothesis?

What is your experimental plan for testing your hypothesis? In
particular, what values of block size and total cache size are you
going to use? (Note: all the block sizes must be multiples
of four bytes, since all accesses are of four-byte words.
Additionally, the block sizes and cache sizes should be powers of
two.) What values are you going to use for the other cache parameters
that you hold constant? Which of the various auto-prefetch variants
supported by the dineroIII simulator are you going to use? Which
address trace(s) are you going to use? In order to answer these
questions, you will need to read ahead into the In Lab section below.

In Lab

You will use the dineroIII trace-driven cache simulator. A
trace-driven cache simulator reads in a trace, that is, a record of
the memory references from a particular execution (with each address
tagged with whether it was a read, a write, or an instruction fetch).
The simulator is parameterizable so that the same trace can be tried
with a variety of different parameters for block size, associativity,
cache size, prefetch strategy, etc. At the end of the simulation, a
variety of statistics are produced, including the miss rate for the
cache and the amount of memory traffic. DineroIII is a good quality,
flexible simulator often used in such work.

We have available on-line traces from three benchmarks (tex, gcc, and
spice). Each of the traces contains about a million memory
references. The traces reside in the directory ~max/MC48, and
have filenames tex.din.gz, cc1.din.gz, and
spice.din.gz respectively. The .gz suffix indicates that they
are compressed using the gzip program (to save disk space); to
decompress them at the same time you run dineroIII, you can use a
command that decompresses the file and "pipes" the decompressed
version directly into dineroIII. On one of our Linux PCs, the command
would be

zcat ~max/MC48/name.din.gz | ~max/MC48/dineroIII options

where name is the name of the trace you want to use and options is a
list of dineroIII parameters as specified in the accompanying manual page for dineroIII. Note that at a
bare minimum you need to specify the -b option and either
the -u option or the -i and -d
options. The -f option will also be particularly
relevant to your investigation. Remember that the block size for the
-b option is measured in bytes, so needs to be a multiple
of four.

You should observe the improvement prefetching makes in the number of
misses, specifically the so-called ``demand'' misses, i.e., the misses
on those blocks actually requested by the CPU, rather than by the
prefetching. You should also observe the increase in memory traffic
caused by unnecessary prefetches. Further, it would be interesting to
see whether instruction and data references are equally suited to
auto-prefetching. (DineroIII prints out separate statistics in each
category as well as totals.)

You may find it convenient to automate the data collection using a
feature of the unix shell known as ``foreach.'' In particular, you
can set up a shell script that uses nested foreach loops to run
systematically through a collection of cache sizes, and for each of
them through a collection of block sizes, and for each of them through
a collection of traces, and for each of them through two prefetching
strategies (don't prefetch, or some particular form of auto-prefetch).
The exact nature of this will depend on what you are trying to do.
The below is just an example, you should work from your own
experimental design choices (from the pre-lab) regarding what ranges
of values to iterate over. Keep in mind that you will need to analyze
whatever data you generate, and it may be easier to
generate lots of data than to analyze it. This example shell script,
for instance, will do 120 experimental runs, creating an output file
for each one. Each of those files will contain at least 4 interesting
numbers. That would leave you with 480 numbers to make sense of.

Note that although you could literally type something like the above
into a shell window it is probably more sensible to store it in a file
and then run it from the file. If you have it in a file called
script, you could run it using the command

csh script

Post-Lab

Your lab report should not only clearly summarize the pre-lab and in
lab portions of your work, it should also draw the two together by
comparing your hypothesis with your observations. Write in clear
technical English for an audience familiar with the material in our
textbook but unfamiliar with this lab assignment. Be sure to present
your experimental plan carefully enough that someone else could repeat
your experiments and observe the same output numbers.