From the author of

From the author of

8.1.4 Multiprocessor Scheduling

On a uniprocessor, scheduling is one dimensional. The only question that must
be answered (repeatedly) is: ''Which process should be run
next?'' On a multiprocessor, scheduling is two dimensional. The
scheduler has to decide which process to run and which CPU to run it on. This
extra dimension greatly complicates scheduling on multiprocessors.

Another complicating factor is that in some systems, all the processes are
unrelated whereas in others they come in groups. An example of the former
situation is a timesharing system in which independent users start up
independent processes. The processes are unrelated and each one can be scheduled
without regard to the other ones.

An example of the latter situation occurs regularly in program development
environments. Large systems often consist of some number of header files
containing macros, type definitions, and variable declarations that are used by
the actual code files. When a header file is changed, all the code files that
include it must be recompiled. The program make is commonly used to
manage development. When make is invoked, it starts the compilation of
only those code files that must be recompiled on account of changes to the
header or code files. Object files that are still valid are not regenerated.

The original version of make did its work sequentially, but newer
versions designed for multiprocessors can start up all the compilations at once.
If 10 compilations are needed, it does not make sense to schedule 9 of them
quickly and leave the last one until much later since the user will not perceive
the work as completed until the last one finishes. In this case it makes sense
to regard the processes as a group and to take that into account when scheduling
them.

Timesharing

Let us first address the case of scheduling independent processes; later we
will consider how to schedule related processes. The simplest scheduling
algorithm for dealing with unrelated processes (or threads) is to have a single
systemwide data structure for ready processes, possibly just a list, but more
likely a set of lists for processes at different priorities as depicted in Fig.
8-11(a). Here the 16 CPUs are all currently busy, and a prioritized set of 14
processes are waiting to run. The first CPU to finish its current work (or have
its process block) is CPU 4, which then locks the scheduling queues and selects
the highest priority process, A, as shown in Fig. 8-11(b). Next, CPU 12
goes idle and chooses process B, as illustrated in Fig. 8-11(c). As long
as the processes are completely unrelated, doing scheduling this way is a
reasonable choice.

Figure 8-11 Using a single data structure for scheduling a multiprocessor.

Having a single scheduling data structure used by all CPUs timeshares the
CPUs, much as they would be in a uniprocessor system. It also provides automatic
load balancing because it can never happen that one CPU is idle while others are
overloaded. Two disadvantages of this approach are the potential contention for
the scheduling data structure as the numbers of CPUs grows and the usual
overhead in doing a context switch when a process blocks for I/O.

It is also possible that a context switch happens when a process'
quantum expires. On a multiprocessor, that has certain properties not present on
a uniprocessor. Suppose that the process holds a spin lock, not unusual on
multiprocessors, as discussed above. Other CPUs waiting on the spin lock just
waste their time spinning until that process is scheduled again and releases the
lock. On a uniprocessor, spin locks are rarely used so if a process is suspended
while it holds a mutex, and another process starts and tries to acquire the
mutex, it will be immediately blocked, so little time is wasted.

To get around this anomaly, some systems use smart scheduling, in
which a process acquiring a spin lock sets a process-wide flag to show that it
currently has a spin lock (Zahorjan et al., 1991). When it releases the lock, it
clears the flag. The scheduler then does not stop a process holding a spin lock,
but instead gives it a little more time to complete its critical region and
release the lock.

Another issue that plays a role in scheduling is the fact that while all CPUs
are equal, some CPUs are more equal. In particular, when process A has
run for a long time on CPU k, CPU k's cache will be full of
A's blocks. If A gets to run again soon, it may perform
better if it is run on CPU k, because k's cache may still contain some of A's blocks. Having cache blocks preloaded will
increase the cache hit rate and thus the process' speed. In addition, the
TLB may also contain the right pages, reducing TLB faults.

Some multiprocessors take this effect into account and use what is called
affinity scheduling (Vaswani and Zahorjan, 1991). The basic idea here is
to make a serious effort to have a process run on the same CPU it ran on last
time. One way to create this affinity is to use a two-level scheduling
algorithm. When a process is created, it is assigned to a CPU, for example
based on which one has the smallest load at that moment. This assignment of
processes to CPUs is the top level of the algorithm. As a result, each CPU
acquires its own collection of processes.

The actual scheduling of the processes is the bottom level of the algorithm.
It is done by each CPU separately, using priorities or some other means. By
trying to keep a process on the same CPU, cache affinity is maximized. However,
if a CPU has no processes to run, it takes one from another CPU rather than go
idle.

Two-level scheduling has three benefits. First, it distributes the load
roughly evenly over the available CPUs. Second, advantage is taken of cache
affinity where possible. Third, by giving each CPU its own ready list,
contention for the ready lists is minimized because attempts to use another
CPU's ready list are relatively infrequent.

Space Sharing

The other general approach to multiprocessor scheduling can be used when
processes are related to one another in some way. Earlier we mentioned the
example of parallel make as one case. It also often occurs that a single
process creates multiple threads that work together. For our purposes, a job
consisting of multiple related processes or a process consisting of multiple
kernel threads are essentially the same thing. We will refer to the schedulable
entities as threads here, but the material holds for processes as well.
Scheduling multiple threads at the same time across multiple CPUs is called
space sharing.

The simplest space sharing algorithm works like this. Assume that an entire
group of related threads is created at once. At the time it is created, the
scheduler checks to see if there are as many free CPUs as there are threads. If
there are, each thread is given its own dedicated (i.e., nonmultiprogrammed) CPU
and they all start. If there are not enough CPUs, none of the threads are
started until enough CPUs are available. Each thread holds onto its CPU until it
terminates, at which time the CPU is put back into the pool of available CPUs.
If a thread blocks on I/O, it continues to hold the CPU, which is simply idle
until the thread wakes up. When the next batch of threads appears, the same
algorithm is applied.

At any instant of time, the set of CPUs is statically partitioned into some
number of partitions, each one running the threads of one process. In Fig. 8-12,
we have partitions of sizes 4, 6, 8, and 12 CPUs, with 2 CPUs unassigned, for
example. As time goes on, the number and size of the partitions will change as
processes come and go.

Figure 8-12 A set of 32 CPUs split into four partitions, with two CPUs
available.

Periodically, scheduling decisions have to be made. In uniprocessor systems,
shortest job first is a well-known algorithm for batch scheduling. The analogous
algorithm for a multiprocessor is to choose the process needing the smallest
number of CPU cycles, that is the process whose CPU-count X run-time is the
smallest of the candidates. However, in practice, this information is rarely
available, so the algorithm is hard to carry out. In fact, studies have shown
that, in practice, beating first-come, first-served is hard to do (Krueger et
al., 1994).

In this simple partitioning model, a process just asks for some number of
CPUs and either gets them all or has to wait until they are available. A
different approach is for processes to actively manage the degree of
parallelism. One way to do manage the parallelism is to have a central server
that keeps track of which processes are running and want to run and what their
minimum and maximum CPU requirements are (Tucker and Gupta, 1989). Periodically,
each CPU polls the central server to ask how many CPUs it may use. It then
adjusts the number of processes or threads up or down to match what is
available. For example, a Web server can have 1, 2, 5, 10, 20, or any other
number of threads running in parallel. If it currently has 10 threads and there
is suddenly more demand for CPUs and it is told to drop to 5, when the next 5
threads finish their current work, they are told to exit instead of being given
new work. This scheme allows the partition sizes to vary dynamically to match
the current workload better than the fixed system of Fig. 8-12.

Gang Scheduling

A clear advantage of space sharing is the elimination of multiprogramming,
which eliminates the context switching overhead. However, an equally clear
disadvantage is the time wasted when a CPU blocks and has nothing at all to do
until it becomes ready again. Consequently, people have looked for algorithms
that attempt to schedule in both time and space together, especially for
processes that create multiple threads, which usually need to communicate with
one another.

To see the kind of problem that can occur when the threads of a process (or
processes of a job) are independently scheduled, consider a system with threads
A 0 and A 1 belonging to process A
and threads B 0 and B 1 belonging to
process B. threads A 0 and B 0
are timeshared on CPU 0; threads A 1 and B 1
are timeshared on CPU 1. threads A 0 and A
1 need to communicate often. The communication pattern is
that A 0 sends A 1 a message, with A
1 then sending back a reply to A 0,
followed by another such sequence. Suppose that luck has it that A
0 and B 1 start first, as shown in Fig.
8-13.

Figure 8-13 Communication between two threads belonging to process A that
are running out of phase.

In time slice 0, A 0 sends A 1 a
request, but A 1 does not get it until it runs in time
slice 1 starting at 100 msec. It sends the reply immediately, but A
0 does not get the reply until it runs again at 200 msec.
The net result is one request-reply sequence every 200 msec. Not very good.

The solution to this problem is gang scheduling, which is an outgrowth
of co-scheduling (Ousterhout, 1982). Gang scheduling has three parts:

Groups of related threads are scheduled as a unit, a gang.

All members of a gang run simultaneously, on different timeshared
CPUs.

All gang members start and end their time slices together.

The trick that makes gang scheduling work is that all CPUs are scheduled
synchronously. This means that time is divided into discrete quanta as we had in
Fig. 8-13. At the start of each new quantum, all the CPUs are
rescheduled, with a new thread being started on each one. At the start of the
following quantum, another scheduling event happens. In between, no scheduling
is done. If a thread blocks, its CPU stays idle until the end of the quantum.

An example of how gang scheduling works is given in Fig. 8-14. Here we have a
multiprocessor with six CPUs being used by five processes, A through
E, with a total of 24 ready threads. During time slot 0, threads
A 0 through A 6 are scheduled and
run. During time slot 1, Threads B 0, B
1, B 2, C 0, C
1, and C 2 are scheduled and run. During
time slot 2, D's five threads and E 0 get to
run. The remaining six threads belonging to process E run in time
slot 3. Then the cycle repeats, with slot 4 being the same as slot 0 and so on.

The idea of gang scheduling is to have all the threads of a process run
together, so that if one of them sends a request to another one, it will get the
message almost immediately and be able to reply almost immediately. In Fig.
8-14, since all the A threads are running together, during one quantum,
they may send and receive a very large number of messages in one quantum, thus
eliminating the problem of Fig. 8-13.