Overview

This is an introductory graduate course covering several aspects of
parallel and high-performance computing.
Upon completion, you should

be able to design and analyze parallel algorithms for a variety of
problems and computational models,

be familiar with the hardware and software organization of
high-performance parallel computing systems, and

have experience with the implementation of parallel applications
on high-performance computing systems, and be able to measure,
tune, and report on their performance.

Additional information including the course syllabus can be found in the
course overview.

All parallel programming models discussed in this class are supported on
BASS
or phaedra which are available for use in this class.

Announcements

The midterm will be held in class Tue Oct 18.

The scope of the exam is material in lectures 1-12.

You have 75 minutes to complete the exam.

You may may consult all course notes and other course materials during the exam.
You may use an electronic device to access these materials or as a calculator,
but the device cannot be used to access other materials or to communicate in any fashion.

(For Tue Sep 13) Look through
OpenMP Tutorial sections 1-6. Most examples are
shown in C/C++ and in Fortran. We will be using C/C++.
Ignore WORKSHARE and TASK directives, and discussion of
nested parallel constructs.

(Nov 15)
Written Assignment WA3 is available.
Due date is Tue Dec 6 at start of class. Sample solutions are available for WA3.

Platforms and Programming Models

Platforms

The Bass system supports the OpenMP and MPI programming
models.
The general instructions for
getting started on bass
are supplemented below with specific instructions for each programming model.
When you login to bass.cs.unc.edu you are connected to a specific node on bass dedicated
to interactive program development. You can compile programs on this node.
Shared-memory programs run within an individual node on bass. Distributed-memory programs run
across multiple nodes in Bass. The login node should not be used to run your programs,
although a short debug test for a few seconds and no more than 4 cores should be OK.
In general programs that need multiple nodes or dedicated nodes or GPUs should be
submitted to queues that are managed by the Grid Engine job scheduler.

The Phaedra system is a compute server supporting OpenMP, Cilk, and Xeon Phi accelerator
programming models.

gamma-x51-1 is a server supporting the CUDA accelerator programming model.

OpenMP

To get accurate performance information run your programs
on a dedicated node as a batch job with a shell script myjob using qsub -pe smp 16 myjob
or interactively via qlogin -pe smp 16
Do not park yourself on this node as everyone else in the class will be held up.

Phaedra is a 20-core Xeon E5-2650 server with eight attached
Intel Xeon Phi 5110P accelerators.
The server hosts the Intel Parallel Studio XE 2017
compilers and performance analysis tools to access the accelerators.
Students in COMP 633 have a login on phaedra, and OpenMP programs run directly
on the server.

MPI

Set up your environment on Bass
as directed here. Select the openmpi-x86_64 MPI implementation
(this is not in the list shown in the instructions).
Note that the instructions presuppose
you are running the bash shell. If your default shell is not bash, make sure you get into a bash shell
by executing "bash -l". If you do not do this, the next step will fail.

You can compile your program on the Bass front end as follows (use mpiCC for C++ programs and
remember to #include mpi.h in your programs)):mpicc -o myprog myprog.c

To execute your program you need to submit it to the grid engine for scheduling. You need to prepare
a shell script that will be used to start your program with the appropriate command line arguments,
and includes instructions to the scheduler for resources required.

Here is an example job submission script
runprog.sh. See the instructions in lecture 18 for additional details.
Note that if myprog above requires an argument you need to include the argument in
the script below after $WDIR/myprog.

The shell script specifies (through directives in the comments)
that the script should be run in a bash shell, that all output should be placed
in the current working directory, and that the job has a run time limit of 5 minutes.
The subsequent shell commands specify the invocation of myprog
on each processor assigned to the job.

To submit job to run using 8 processors use:qsub -pe MPI 8 runprog.sh

To make a larger scaling run use:qsub -pe MPI 32 runprog.sh (successive MPI processes will fill one node before being placed on the next)
or qsub -pe rrMPI 32 runprog (successive MPI processes will be placed on different nodes according to availability)
Note that your job and your MPI processes may end up sharing nodes with other requests,
although this is not very likely given the small number of uses of Bass at the moment.

Once the job is submitted you can check its status using qstat

It can take several minutes longer than the expected runtime of your job for it to complete
and for the files to show up in your directory.

Bibliography

This list will evolve throughout the semester. Specific reading
assignments are listed above.