CS 267: Lecture 6, Feb 1, 1996

Shared Memory Programming with Multithreading

Multithreading is a programming style suitable for shared memory
space MIMD machines, like the SGI Power Challenge, or the Sun SPARCcenter 2000.
These machines have a single address space, so that two processors loading
location 37, say, from memory are both guaranteed to load the same value.
Initially, a program consists of a single "thread" of control, as the main
routine begins. The user controls parallelism by creating other "threads" of
control, which are told to execute a subroutine of the users choice. These
threads can be thought of as (UNIX) processes, which are executed by the
available parallel processors. If the user creates more threads than there
are physical processors, then the threads are scheduled for execution similar
to the way UNIX schedule multiple independent processes to run.
Threads share the same address space, and so the the same code and most
of the variables of the program. They may also synchronize with and wait
for one another, so they can cooperate in parallel programs.

We begin by examining the multithreaded
solution of the first Sharks and Fish
problem, fish swimming in a current. The solutions are written in
C with calls to the Solaris multithreading library. You should start a new window
and click
here
to see the code as we discuss it.

After definitions of various macros and types, we see a set of variables (fishes through g_dsum)
which are declared outside the main routine; these are global variables that will be
visible in all routines on all processors.

Now skip down to the main{} routine. After mallocing some space (in fishes) to hold the
global array of data for NFISH fish, and some space to hold a little data for each
of the threads that will be created (in thread_ptr), a mutex variable mul_lock is
initialized by calling mutex_init. Mutex stands for mutual exclusion, and will later be
used to guarantee that at most one thread can execute a particular sequence of code
(a critical section) at a time. The second argument to mutex_init, sync_type = USYNC_PROCESS,
indicates that other processes can access this mutex variable too; the third argument is
ignored.

Similarly, barrier_init initializes a barrier variable ba. A barrier is a synchronization
point in the code that all threads need to reach before any can continue. Threads reaching the
barrier early wait until all have arrived. The second argument to barrier_init, NTHREADS, is
the number of threads that will synchronize at the barrier. The third and fourth arguments
are as above.

Now thread_ptr is initialized. It has one entry for each of the NTHREADS threads to be
created. thread_ptr[i].chunk is set to i for i=0 to NTHREADS-1. thread_ptr[i].tid is used
to store the system-assigned thread-id-number later.

The loop for (i = 1; i < NTHREADS; i++) {} actually creates the other NTHREADS-1
parallel threads (besides the main one executing main{}). This is done by the system call
to thr_create. The first two arguments describe where the stack for the new thread is to
be located and how large it is (0 indicates defaults). The third argument, move_fish,
is the name of a procedure that the newly created thread will begin executing when it
starts up; this is how parallelism is created. move_fish will be called with its argument
equal to the fourth argument of thr_create, in this case i. In other words, the
i-th thread created will be passed the value i; this will be used to divide up the work.
The fifth argument indicates when the thread will start up, and on which processor it will
run; 0 is a default. Finally, the system-assigned thread-id-number is returned in the
last argument; this is needed below for synchronization and termination purposes.

Then the main thread also begins moving fish by calling move_fish itself.
Finally, the program terminates by having the main thread call thr_join to
wait for all the other threads to return.

Now we examine the move_fish{} routine, which is right above main{}.
The argument is stored in a local variable called mychunk. Thus, thread i
has mychunk assigned to i, so the threads can divide up the work.
Mychunk is first used in all_init_fish{}, where it indicates that
thread i should initialize num_fish = NFISH/NTHREADS fish positions and
velocities starting at mychunk*num_fish. The argument fishes, which is
the array of all the fish data, is global and so visible to all threads.
The all_move_fish{} routine is similar. No synchronization is needed here;
each thread can move the fish it is assigned independently of the other threads.

The next three parallel operations, computing max_acc, max_speed and
sum_speed_sq, do require synchronization, and are performed by the routines
all_reduce_to_all_dmax and all_reduce_to_all_dadd. These routines compute
the global max (respectively sum) of their local arguments.

It suffices to
examine all_reduce_to_all_dadd{}. The first statement is barrier_wait(&ba),
which causes all threads to wait until all have reached this statement.
Then thread 0 initializes the global sum g_dsum.accum to zero and sets
g_dsum.zeroed to 1 to indicate to other threads that g_dsum.accum has
indeed been initialized. The other threads wait at the line
"while ( !g_dsum.zeroed )" for this to occur.
(myID contains the thread-id number of the calling thread.)

Then the pair of calls
mutex_lock(&mul_lock) and mutex_unlock(&mul_lock)
permits only one thread at a time to be executing the code between them (a so-called
critical section). Here the global sum is actually incremented.

If we do not use mutual exclusion, we may have a race condition as two
processors try to execute the critical section simultaneously, with the result
that g_dsum can be computed incorrectly. For example, suppose for simplicity
that we only have two threads, threads 1 and 2, that thread i wants to add dmax=i
to g_dsum, and that g_dsum has been initialized to zero. Thus, the correct result
is g_dsum = 1+2 = 3. Now suppose that threads 1 and 2 simultaneously enter the
critical section. We claim that when both threads finish, g_dsum could equal
1, 2 or 3. To see why, consider the following sequence of event:

thread 1 fetches g_dsum.accum = 0 from memory into a register
thread 2 fetches g_dsum.accum = 0 from memory into a register
thread 1 increments its register by 1
thread 2 increments its register by 2
thread 1 writes its register back to memory, setting g_dsum.accum to 1
thread 2 writes its register back to memory, setting g_dsum.accum to 2

Clearly, by reversing the last two event, g_dsum.accum could have been set to 1 as well.
The use of critical section protected by mutex locks prevents these sorts of
bugs, which are otherwise nondeterministic and hard to find.

There are actually two "levels" of threads available in SunOS
(Sun Operating System); this is not true of all thread systems.
The threads used above, which are the only
ones the user has to know about, are scheduled entirely at the user level,
so that the OS kernel does not have to get involved. This means they are
relatively inexpensive to create, start, stop and synchronize, since anything
involving the OS kernel is more expensive.
The second level of threads are called LWPs, or "light weight processes".
These are known to the kernel, and are consequently more expensive to
create and use (despite the name). SunOS supports both because some applications
are more effectively
programmed with one than the other. For our purposes we will
only consider the case where there is one thread per LWP, and one LWP per
physical processor, though more general situations are possible.

It is important to remember that a thread shares all the
instructions of the program that created it, and all the
data visible in its scope at the point of creation.
Once created, a thread gets its own ID (called tid), and its own
registers and stack for local variables; this allows it to execute
independently of the program that created it, but also to share data.

The first argument
*stack_addr indicates where the thread's stack is to be based, and
stack_size indicate how large it may grow; defaults are
available for both. When the thread starts running, it
calls func(arg), which may be any routine the user wants.
When func(arg) returns, the thread terminates. The flags
describe how the thread will execute. For example,
the thread may be suspended on creation, until a
later thread_continue() call starts it.
The thread may also either be allowed to "float", i.e.
executed by any available physical processor, or tied
down to execute on one of them; we will typically use the
latter, creating only as many threads as physical
processors, since it offers more control over parallelism.
Finally, thr_create returns a pointer to a thread identifier
uniquely identifying the thread it creates; this can be used
for synchronization purposes by other routines.

Thread_wait allows one thread to synchronize with
another, by waiting for the thread specified by
thread_id to exit:

Threads can synchronize using mutual exclusion (mutex),
condition variables, semaphores, or reader/writers locks,
all of which are traditional OS constructs.
We have already seen the use of mutex variables
in the
Sharks and Fish code.
Here we illustrate the use of condition variables,
which are used to wait until a particular condition is true.
For example, consider the code sequence

If the condition is false, cv_wait will be called, and the thread will release
the lock and suspend until reawakened by a signal. The signal is
generated by another thread calling either cv_signal or cv_broadcast (to wake
up all threads). If, when reawakened, another thread has entered the critical
section, the awakened thread must wait to reacquire the lock.