Due December 5, 2008

In this lab, you will again be experimenting with the performance
of a C++ program, measuring the number of clock cycles needed to execute
the central portion of the program. Unlike in Lab 3, the program
is for a real purpose (simulating
Conway's Game
of Life), rather than just cooked up to test computers'
performance. Another major difference from Lab 3 is that the focus
this time is on the memory system. You may do this lab
with a partner so that you can talk about what you are doing and hopefully come do a clearer understanding.

You will need to write a clear report that explains what you did
(so that someone else could replicate your experiment), provides the
factual results of your experiment, and discusses what interpretation
you place on those results. Your report should be aimed at an
audience with background similar to our course, but without specific
knowledge of this lab assignment.

You and your partner should do all experimental runs on a single machine, which should be one of the following: oh326ll03, oh326ll04, oh326ll07, oh326ll11, oh326ll12, oh326ll13, oh326ll15, oh326ll16, oh326ll17, or oh326ll18. Eight of these are the eight machines on the side of the room with the glass windows and the art installation. I can point out the other two listed machines if you want, and can explain why I don't recommend the six machines that aren't on this list.

Be sure to specify which kind of computer you used; by
looking at the file /proc/cpuinfo, you can find both the
name of the processor and the amount of cache it has. The cache size
listed here is for the L2 cache, which is likely to be the relevant
one for the experiments you are conducting. In order to provide a
really satisfactory interpretation of your results, you will also find
it useful to know such attributes of the cache as its associativity
and block size. As I demonstrated in class, this information is
available within the /sys/devices/system/cpu/cpu0/
directory. The L1 cache information is also available there, if you
want to go the extra mile and document that for the sake of
completeness.

Compiling and running the program

You will need to download two files: life.cpp, which is the actual C++ source
code of the program, and cyccounter.h,
which is a helper file that contains the necessary definitions for
using the processor's cycle counter to get exact timings in units of
clock cycles. To compile the program, use the following command:

g++ -o life -O life.cpp

You can run the program using a command like

./life 1000 4000 0 1

The meaning of most of these numbers will become apparent from the
program's output, but to find the full story, you can read a comment
at the top of the source file.

Questions your experimentation should address

If you read the program, you'll see that the main portion, which is
timed, loops through the generations one by one, and for each
generation, loops over all the columns and rows of cells. Each position in the grid is called a "cell", whether it is live or dead; so the number of cells remains fixed, even as the number of living cells changes.

For each
cell in each generation, the work done is the same: adding up how many
neighbors are live and then (in a second pass over the cells) setting
the cell to be alive or dead based on the number of living neighbors
and the cell's own prior status. Because this same work is done for
each cell in each generation, one might think that the number of clock
cycles would scale up approximately proportionately to the number of
cells and generations. See if this seems to be the case when you double the number of cells, either by switching from 1000 rows to 2000 or from 4000 columns to 8000.

Some widths of cell grids may be particularly interesting, for
reasons that have to do with their particular numerical values rather
than their general size. For example, 4094 isn't much larger than
4000, but it has a special property: when you add in two border cells,
each row winds up occupying 4096 bytes, which is exactly 2 to the 12th
power. Does this make the program perform differently? Can you
improve its performance in this case by adding a little bit of padding
to each row? (The program has a built-in feature for adding a
specifiable amount of padding to each row.) You ought to be able to
connect your results to the structural organization of the L2
cache.

Each time the program loops over all the cells, it does it with two
nested loops that start with these lines:

In particular, these two lines appear in the two places
within the timed portion of the program, which is to say, between the two calls to read_cyccounter(). (They also appear outside the the timed portion, but that is less relevant.) If you interchange the order of these two lines in the two places they appear within the timed portion, the program will still
produce the same results, but it will process the cells in a different
order. How does this affect its performance? You should recompile
the modified program the same way as shown earlier in this lab handout
and should repeat the timing experiments you've previously done. In
accounting for your results, you may want to know that both the
Pentium 4 and the Core 2 can prefetch a cache block in advance of any
access to that block, if the preceding block is being read
sequentially.