For this first assignment, we will have a little programming
contest. Although the assignment isn't about ``parallelism'' per se,
it involves some of the more crucial issues in high performance
algorithm design, namely dealing with the memory hierarachy.

We've all coded matrix multiply at least once in our lives (and soon
at least twice). Here is the simplest algorithm to compute C = C + A*B,
where all matrices are n-by-n:

for i=1 to n
for j=1 to n
for k=1 to n
C(i,j) = C(i,j) + A(i,k)*B(k,j)
end for
end for
end for

Unfortunately, this algorithm is oblivious to the computer's memory hierarchy,
and optimizing compilers provide limited solace. Consider the
IBM RS6000/590
which has a peak speed of 266 Mflops. Running the above
code in Fortran, compiled using the f77 compiler without optimization,
yields the rather lower speed of 4.5 Mflops when multiplying 64-by-64 matrices.
Using f77 -O2 (optimization level 2) raises the speed to 65 Mflops, and
f77 -O3 -qhot further raises the speed to 92 Mflops, still a far cry from
IBM's own proprietary
ESSL library
routine, which runs at about 240 Mflops (ESSL is written entirely in Fortran).

If you didn't want to pay for ESSL, you might look around on the WWW in the
Netlib Repository of public-domain
software, and discover a public-domain matrix multiply optimized for the RS6000
(dmr).
Using f77 -O2 on this routine yields 152 Mflops, and using f77 -O3 -qhot
yields 186 Mflops, still a ways from ESSL's 240 Mflops. Not only that, since
the careful programmer always tests to see if the answer is correct, he or she
will soon discover that the answer using dmr with f77 -O3 -qhot is completely
wrong for n>64 matrices (well, the compiler did issue a warning message about
possibly changing the semantics of the program).

For this assignment, we will be coding an optimized
memory-hierarchy-cognizant matrix-matrix multiply routine in C.
For simplicity, we will only require square
matrices (worrying about the non-square case, essential for a good
library code, can be a bit time-consuming). The goal of the assignment
is to:

write a matrix routine C = C + A*B for square matrices,

get as close to peak performance as possible, while
still getting the right answer, and

get to know each other.

For extra credit, you may try to optimize your routine for another
architecture different form IBM RS6000 or implement one more BLAS3 routine.

We will be using an IBM RS6000/590, accounts to be made available
soon. Note that the IBM compiler supports a wide variety of compiler
optimizations
which you should read about
(see
http://www.austin.ibm.com/tech/options.html, or type 'xlc'
once you are logged on to the RS6000).

To judge the relative qualities of your routines, we will have a
contest with a non-negligible 1st prize and a ..., well, let's just
say a 2nd prize. We will work in mixed teams of three people. Each team
will create a single routine and submit it to the matrix
appropriations committee (MAC) consisting of the Professor and
TA. We'll all gather on Tuesday following the due date during the 6:00PM
time-slot with each team making a short 2-minute presentation about their
routine, and why it should run fast. Then we will time the routine on an
dedicated RS6000, as well as make sure you get the right answer.
The test program and method for measuring performance will be posted later.
The winning team will then be
applauded gratuitously, and presented with the non-negligible prize.

You are welcome to discuss your optimization strategies with other
teams. You should, however, keep in mind that this is a competition --
any hints might distort the contest results.