This chapter is from the book

This chapter is from the book

Chapter 3: Scheduling

The scheduler is the component of the kernel that selects which process to
run next. The scheduler (or process scheduler, as it is sometimes called) can
be viewed as the code that divides the finite resource of processor time between
the runnable processes on a system. The scheduler is the basis of a multitasking
operating system such as Linux. By deciding what process can run, the scheduler
is responsible for best utilizing the system and giving the impression that
multiple processes are simultaneously executing.

The idea behind the scheduler is simple. To best utilize processor time, assuming
there are runnable processes, a process should always be running. If there are
more processes than processors in a system, some processes will not always be
running. These processes are waiting to run. Deciding what process runs next,
given a set of runnable processes, is a fundamental decision the scheduler must
make.

Multitasking operating systems come in two flavors: cooperative multitasking
and preemptive multitasking. Linux, like all Unix variants and most modern operating
systems, provides preemptive multitasking. In preemptive multitasking, the scheduler
decides when a process is to cease running and a new process is to resume running.
The act of involuntarily suspending a running process is called preemption.
The time a process runs before it is preempted is predetermined, and is called
the timeslice of the process. The timeslice, in effect, gives each process a
slice of the processor's time. Managing the timeslice enables the scheduler
to make global scheduling decisions for the system. It also prevents any one
process from monopolizing the system. As we will see, this timeslice is dynamically
calculated in the Linux scheduler to provide some interesting benefits.

Conversely, in cooperative multitasking, a process does not stop running until
it voluntary decides to do so. The act of a process voluntarily suspending itself
is called yielding. The shortcomings of this approach are numerous: The scheduler
cannot make global decisions regarding how long processes run, processes can
monopolize the processor for longer than the user desires, and a hung process
that never yields can potentially bring down the entire system. Thankfully,
most operating systems designed in the last decade have provided preemptive
multitasking, with Mac OS 9 and earlier being the most notable exceptions. Of
course, Unix has been preemptively multitasked since the beginning.

During the 2.5 kernel series, the Linux kernel received a scheduler overhaul.
A new scheduler, commonly called the O(1) scheduler because of its algorithmic
behavior1, solved the shortcomings of the previous Linux scheduler
and introduced powerful new features and performance characteristics. In this
section, we will discuss the fundamentals of scheduler design and how they apply
to the new O(1) scheduler and its goals, design, implementation, algorithms,
and related system calls.

Policy

Policy is the behavior of the scheduler that determines what runs when. A
scheduler's policy often determines the overall feel of a system and is
responsible for optimally utilizing processor time. Therefore, it is very
important.

I/O-Bound Versus Processor-Bound Processes

Processes can be classified as either I/O-bound or processor-bound. The
former is characterized as a process that spends much of its time submitting and
waiting on I/O requests. Consequently, such a process is often runnable, but
only for short periods, because it will eventually block waiting on more I/O
(this is any type of I/O, such as keyboard activity, and not just disk I/O).
Conversely, processor-bound processes spend much of their time executing code.
They tend to run until they are preempted because they do not block on I/O
requests very often. Because they are not I/O-driven, however, system response
does not dictate that the scheduler run them often. The scheduler policy for
processor-bound processes, therefore, tends to run such processes less
frequently but for longer periods. Of course, these classifications are not
mutually exclusive. The scheduler policy in Unix variants tends to explicitly
favor I/O-bound processes.

The scheduling policy in a system must attempt to satisfy two conflicting
goals: fast process response time (low latency) and high process throughput. To
satisfy these requirements, schedulers often employ complex algorithms to
determine the most worthwhile process to run, while not compromising fairness to
other, lower priority, processes. Favoring I/O-bound processes provides improved
process response time, because interactive processes are I/O-bound. Linux, to
provide good interactive response, optimizes for process response (low latency),
thus favoring I/O-bound processes over processor-bound processors. As you will
see, this is done in a way that does not neglect processor-bound processes.

Process Priority

A common type of scheduling algorithm is priority-based scheduling. The idea
is to rank processes based on their worth and need for processor time. Processes
with a higher priority will run before those with a lower priority, while
processes with the same priority are scheduled round-robin (one after the next,
repeating). On some systems, Linux included, processes with a higher priority
also receive a longer timeslice. The runnable process with timeslice remaining
and the highest priority always runs. Both the user and the system may set a
processes priority to influence the scheduling behavior of the system.

Linux builds on this idea and provides dynamic priority-based scheduling.
This concept begins with the initial base priority, and then enables the
scheduler to increase or decrease the priority dynamically to fulfill scheduling
objectives. For example, a process that is spending more time waiting on I/O
than running is clearly I/O bound. Under Linux, it receives an elevated dynamic
priority. As a counterexample, a process that continually uses up its entire
timeslice is processor boundit would receive a lowered dynamic
priority.

The Linux kernel implements two separate priority ranges. The first is the
nice value, a number from 20 to 19 with a default of zero. Larger nice
values correspond to a lower priorityyou are being nice to the other
processes on the system. Processes with a lower nice value (higher priority) run
before processes with a higher nice value (lower priority). The nice value also
helps determine how long a timeslice the process receives. A process with a nice
value of 20 receives the maximum timeslice, whereas a process with a nice
value of 19 receives the minimum timeslice. Nice values are the standard
priority range used in all Unix systems.

The second range is the real-time priority, which will be discussed later. By
default, it ranges from zero to 99. All real-time processes are at a higher
priority than normal processes. Linux implements real-time priorities in
accordance with POSIX. Most modern Unix systems implement a similar scheme.

1O(1) is an example of big-o notation. Basically, it
means the scheduler can do its thing in constant time, regardless of the size of
the input. A full explanation of big-o notation is in Appendix D, for the
curious.

Timeslice

The timeslice2 is the numeric value that represents how long a
task can run until it is preempted. The scheduler policy must dictate a default
timeslice, which is not simple. A timeslice that is too long will cause the
system to have poor interactive performance; the system will no longer feel as
if applications are being concurrently executed. A timeslice that is too short
will cause significant amounts of processor time to be wasted on the overhead of
switching processes, as a significant percentage of the system's time will
be spent switching from one process with a short timeslice to the next.
Furthermore, the conflicting goals of I/O-bound versus processor-bound processes
again arise; I/O-bound processes do not need longer timeslices, whereas
processor-bound processes crave long timeslices (to keep their caches hot, for
example).

With this argument, it would seem that any long timeslice would result in
poor interactive performance. In many operating systems, this observation is
taken to heart, and the default timeslice is rather lowfor example, 20ms.
Linux, however, takes advantage of the fact that the highest priority process
always runs. The Linux scheduler bumps the priority of interactive tasks,
enabling them to run more frequently. Consequently, the Linux scheduler offers a
relatively high default timeslice (see Table 3.1). Furthermore, the Linux
scheduler dynamically determines the timeslice of a process based on priority.
This enables higher priority, allegedly more important, processes to run longer
and more often. Implementing dynamic timeslices and priorities provides robust
scheduling performance.

Note that a process does not have to use all its timeslice at once. For
example, a process with a 100 millisecond timeslice does not have to run for 100
milliseconds in one go or risk losing the remaining timeslice. Instead, the
process can run on five different reschedules for 20 milliseconds each. Thus, a
large timeslice also benefits interactive taskswhile they do not need such
a large timeslice all at once, it ensures they remain runnable for as long as
possible.

When a process's timeslice runs out, the process is considered expired.
A process with no timeslice is not eligible to run until all other processes
have exhausted their timeslice (that is, they all have zero timeslice
remaining). At that point, the timeslices for all processes are recalculated.
The Linux scheduler employs an interesting algorithm for handling timeslice
exhaustion that is discussed in a later section.

2 Timeslice is sometimes called quantum or processor slice in
other systems. Linux calls it timeslice.

Process Preemption

As mentioned, the Linux operating system is preemptive. When a process enters
the TASK_RUNNING state, the kernel checks whether its priority is
higher than the priority of the currently executing process. If it is, the
scheduler is invoked to pick a new process to run (presumably the process that
just became runnable). Additionally, when a process's timeslice reaches
zero, it is preempted, and the scheduler is invoked to select a new process.

The Scheduling Policy in Action

Consider a system with two runnable tasks: a text editor and a video encoder.
The text editor is I/O-bound because it spends nearly all its time waiting for
user key presses (no matter how fast the user types, it is not that fast).
Despite this, when it does receive a key press, the user expects the editor to
respond immediately. Conversely, the video encoder is processor-bound. Aside
from reading the raw data stream from the disk and later writing the resulting
video, the encoder spends all its time applying the video codec to the raw data.
It does not have any strong time constraints on when it runsif it started
running now or in half a second, the user could not tell. Of course, the sooner
it finishes the better.

In this system, the scheduler gives the text editor a higher priority and
larger timeslice than the video encoder, because the text editor is interactive.
The text editor has plenty of timeslice available. Furthermore, because the text
editor has a higher priority, it is capable of preempting the video encoder when
needed. This ensure the text editor is capable of responding to user key presses
immediately. This is to the detriment of the video encoder, but because the text
editor runs only intermittently, the video encoder can monopolize the remaining
time. This optimizes the performance of both applications.