CMPT 886: Special Topics in Operating
Systems and Computer Architecture

Assignment 1

In this assignment you will learn how to analyze performance of
applications using hardware counters. This is a skill that you could
use in many research projects. Your goal will be to analyze how
applications are affected by contention for the processor cache.

You will choose several experimental applications. Of these
applications, one will be the principal application, others will be the
interfering applications. You will run the principal application with
each interfering applications, ensuring that the two share the same
cache. (More on how to accomplish this below). Your will measure the
cache miss rate (in misses per instruction) and the instruction per
cycle (IPC) of the principal application as its runs with each
interfering application. Based on these data you will conclude how the
principal application’s cache miss rates and IPC are affected by the
interfering applications. Try to provide an explanation for the effects
you observe by learning something the nature of the applications (you
can do this by reading the code or by finding information about the
benchmarks online).

Experimental platforms:

You will perform experiments on one (not all) of these machines.

coolthreads.cs.sfu.ca – this is a Solaris/SPARC
(Niagara) system with eight cores running Solaris 11. On each core
there are four hardware threads, or virtual CPUs. The threads running
on the same core share an L1 instruction cache and an L1 data cache. In
addition, cores share an L2 cache. L2 cache is unified: it contains
both instructions and data. On Niagara you can measure interference in
two different cache levels: L1 cache and L2 cache.

To measure interference in both the L1 cache and
the L2 cache you will need to run your benchmarks on the same core.
There are eight cores on Niagara, and four virtual CPUs per core. In
total, there are 32 virtual CPUs numbered 0-31. Virtual CPUs on core 0
are numbered 0-3, virtual CPUs on core 1 are numbered 4-7, etc. To run
your benchmarks on core 1, for example, you could bind you benchmarks
to virtual processors 4 and 6, for example. (More on binding later).

To measure the interference in the L2 cache, but not in
the L1 cache, you will need to run your benchmarks on two different
cores. For example, you could bind them to virtual processors 1 and 5.

quad.cs.sfu.ca – this is a Solaris/x86 quad-core
system running Solaris 11. On quad’s motherboard, there are two
physical chips (or CPU packages), each CPU has two cores. The two cores
share an L2 cache. On quad you can only test the interference in the L2
cache, since the L1 cache is not shared. Virtual processors in quad are
numbered 0, 1, 2, and 3. L2 cache is shared among processors 0 and 1,
and among processors 2 and 3. To test L2 cache interference, you would
bind your benchmarks to processors 0 and 1, or to processors 2 and 3.

octavia.cs.sfu.ca – this is a Solaris/x86 quad-core
system running Solaris 11. Octavia is like quad's "big sister" -- it has
two of the CPUs that quad has. So it has eight cores in
total. To create interference for the cache resources, you
need to bind threads to the cores that share the L2
cache. Note that on octavia, cache is not necessarily
shared by cores numbered sequentially. It could be that
cores 0 and 2 share the L2 cache. Before running your
experiments, you need to figure out the IDs of the cores
sharing the cache.

Using hardware counters:

To evaluate cache interference of applications, you will need to
measure various performance statistics. For example, to evaluate L1
cache interference, you will measure L1 data cache misses, L1
instruction cache misses and instruction per cycle (IPC). Cache miss
rate should increase if there is a high cache intereference between the
benchmarks. IPC should decrease, on the other hand, the higher the
cache miss rate the fewer instructions per cycle the program executes.

To measure these performance statistics you will use a tool
cputrack. To find out what kinds of hardware counters are available on
your experimental machine run “cputrack –h”. Notice the difference in
hardware counters between coolthreads and quad!
Read the manual page for cputrack “man cputrack” to learn how it works.

Here are some examples of using cputrack.

To measure L1 instruction cache misses and IPC on coolthreads,
run your benchmark like this:

option -t asks cputrack to print processor cycles: you will
use this
for the calculation of the IPC.

option -c tells cputrack which events to count in the two
counters available on coolthreads. In this example, we count
instruction cache misses (IC_miss) in counter 0 and retired
instructions (Instr_cnt) in counter 1.

“sys” tells cputrack to count events that occur both at
user level and in system calls.

option -T0.1 tells cputrack to sample counters ten times
per second. Using larger intervals is not recommended due to a bug in
cputrack (that my students and I have found): it may result in cputrack
reporting wrong values.

option -o tells cputrack to write output to the file called
“output.txt”.

options -evf are very important to use. Read about them in
the man page.

Performance counters on quad

Quad has an Intel processor, often referred to as x86 architecture.
This is a CISC processor, and it has a complex structure and many
hardware counters. It may be quite challenging to figure out which
counter to use to count the events you want. Here are some hints for
this assignment:

To count the number of completed instructions use inst_retired

To count the number of instruction cache misses use ifu_ifetch_miss

To count the number of data cache misses use dcu_lines_in

To count the number of L2 cache misses use l2_lines_in

Also note that on quad only pic0 counter is working! So you
cannot use pic1.

Interpreting the output of cputrack.

Once you ran your benchmark via cputrack, cputrack will produce an
output file (output.txt) in the example above. This file will have many
lines reporting hardware counter values – one for each second of
running time. You probably care about the aggregate results for the
entire run, so you should look for a line that looks something like
this:

1229.003 12841 1 fini_lwp 157486144176 3040518
54988926444

There are seven columns in this line:

Column #1 is the wallclock time (in second) since the
beginning of the program.

Column #2 is the PID of the process running your benchmark.

Column #3 is light-weight process id – there would be
multiple of those if you ran a multithreaded application.

Column #4 (fini_lwp) tells you that this is the aggregate
statistics for when the process exits.

Column #5 is the number of CPU cycles that elapsed since
your program started (this measures the cycles that your program spent
on CPU, not counting the time it blocked on I/O or was descheduled).

Column #6 counts the number if instruction cache misses.

Column #7 counts the number of retired instructions.

To calculate the IPC, you would divide column #7 by column #5. To
calculate the instruction cache miss rate (in misses per instruction),
you would divide column #6 by column #7.

Experimental applications

You will use programs from the SPEC CPU2000 benchmarks suite. You can
find more information about these benchmarks here.
There
is
a
variety of applications in this benchmark suite – from them
choose two main benchmarks and two interfering benchmarks. (So you will
have four pairs of benchmarks in total). Make sure that you pick both
memory-intensive as well as CPU-intensive applications. One main
application should be memory-intensive, another CPU-intensive, same for
the interfering applications. Note, you will
need to run your benchmarks “by hand” as opposed to using the runspec
utility. You can read what it’s all about here.

When you run the main application with an interfering one, you
need to ensure that the interfering application keeps running while the
main application is running. So if the interfering application is
shorter than the main application, you'll need to restart the
interfering application while the main application runs.

Binding applications to the same core

Use the runbind utility described here. To bind
your benchmark to a particular virtual CPU (CPU 7, for example) , you
will run it like this:

runbind -p 7 <benchmark>

Recall that you also have to run the whole thing with cputrack to
measure performance. So you will do this like:

cputrack runbind -p 7 <benchmark>

Note that when you do this, your cputrack output file will contain
measurements for both the runbind command and your benchmark.
Be sure to get the output for the right PID. You can tell which PID
corresponds to your benchmark by examining the output file (hint: look
in the very beginning)!

To bind multiple benchmarks, for instance one to CPU 7 and another to CPU 8 use runbind like this:

If your program takes arguments of the form "-p" runbind will get
confused. For instance, in the following example:

benchmark -p -p

it will assume that "-p" argument denotes the
specificaiton of the next benchmark to launch. To get around
this problem, use the "runbind-one-command" program, and
launch multiple instances with cputrack as follows:

Tips and tricks

To ensure that your results are statistically significant,
you will
need to run each experiments more than once, measure the mean and
standard deviation of the measurements. Standard deviation should be
small, otherwise you cannot trust the numbers. Run each pair of
benchmarks three times. If the standard deviation is small (below 2% of
the mean), don't run any more experiments. If not, you will need to
repeat each experiment more times.

To ensure that you get sound data, you will need to run
experiments while no one else is using the machine. Therefore, you will
need to reserve time on coolthreads or quad. Reservation protocol is
described here.
You
are
asked
to reserve time judiciously, as many of your classmates
will need to use it as well. Note that you will only need to reserve
time exclusively when you run your final experiments. For the
time used on learning how to run the benchmarks and use cputrack, you
will need reserve the machine in a non-exclusive mode.

When you are learning how to run SPEC benchmarks, you can
use the machine dogwood.css.sfu.ca. aThis is a Solaris-Sparc
machine, whose system interface is very similar to that of coolthreads.
While hardware performance counters work differently on dogwood than on
coolthreads, running SPEC benchmarks would work just the same. You do
not need a reservation for dinosaur. You will need an FAS account to
log on to dogwood, which you should have if you are a graduate student.
If you are an undergraduate student you can apply for a FAS account.
You will need to fill out this
form and bring it to me for signature.

As machine time may become scarce, you are encouraged to
start the assignment early!

What to submit

You must prepare a well-written report of your analysis. Please spell
check your report before submitting! Pick your favourite paper that we
read so far, and model your report after the experimental section in
that paper. Your report must not exceed 5 pages in 10-point Times New
Roman font with 1-inch margins (so make your figures small and pretty
-- but not too small, so they are still readable). I will deduct points
if your report does not comply with the formating specifications.Do not
play with line spacing or other formatting tricks to fit more text.

You must perform the analysis of the interference between
the
benchmarks in any one type of cache (either L1 I-cache, L1
D-cache, or L2 cache).

Discuss the following in the report:

The goal of the study

Experimental platform

Benchmarks

Methodology for running the experiments (i.e., how you
set up the experiment to measure what you want to measure, how you
ensured that your results are statistically significant)

Graphs and charts showing the results

Analysis of the results, including discussion of any
anomalies in the data