I'm doing a CPU-Scheduler based on BFS by Con Kolivas with support for
multiple run-queues. BFS in itself uses only one run-queue for all
CPU's. This avoids the load-balancing overhead, but does not scale well.
One run-queue per CPU does scale well, but then the scheduler has
load-balancing overhead. The scheduler I'm developing supports every
possible run-queues configuration. You can have one single run-queue
like in BFS, or you can have one run-queue per CPU, or something
completely different like one run-queue every two CPU's. This, in theory
would allow the scheduler to be fine-tuned to the hardware and the
workload.

What state is it in?
Currently it is very unstable, CPU-Hotplug is broken, scheduling
statistics are broken, support for real-time tasks is broken. Load
balancing when having more than one run-queue is working, but is nothing
more than keeping the load on all run-queues equal. Associating a CPU
and a run-queue is currently done with a system call and there is no
access right checking. The source is in a very bad state.
Uni-processor build is broken.
It lacks proper Documentation.

Why allow the user to change the run-queue layout?
To optimize the scheduler to specific hardware and workloads.
You could use one run-queue for all CPU's if you want low latency and
low scheduling overhead.
You could use one run-queue per CPU if you want high scalability.
You could use one run-queue per n CPU's is these n CPU's share cache and
there is not much benefit in load balancing between them.

Benchmarks?
None, it is not stable enough to benchmark and the load balancing
algorithm that is currently used, delivers very bad performance.

What advantages does it have when compared to other schedulers?
It is more scalable than BFS.
It could in future have all features of BFS and of CFS, especially
throughput and low latency.
It has far less lines of code than CFS.

What disadvantages does it have when compared to other schedulers?
It is not stable.
It is not tested on anything else than kvm and more than 4 CPU's.
Many features are not yet working or not implemented at all (good load
balancing).

Implementation details:
All tasks that are runnable but not currently executing on a CPU, are
queued on one of the global run-queues. Every global run-queue has its
own spin-lock. When a task gets queued or dequeued this lock needs to be
taken. All global run-queues are protected by one global read-write
lock. When normal scheduling is done, this lock needs to be read_locked.
When any change to the layout of the global run-queues is done,
like adding new global run-queues or removing them, the global
read-write lock needs to be write-locked.
Fair time distribution among tasks is done via the deadline mechanism of
BFS.

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff -uprN linux-3.6.2/Documentation/scheduler/sched-BFS.txt
linux-3.6.2-bfs-multi-runqueue/Documentation/scheduler/sched-BFS.txt
--- linux-3.6.2/Documentation/scheduler/sched-BFS.txt 1970-01-01
01:00:00.000000000 +0100
+++ linux-3.6.2-bfs-multi-runqueue/Documentation/scheduler/sched-BFS.txt
2012-10-25 17:13:12.579060779 +0200
@@ -0,0 +1,347 @@
+BFS - The Brain Fuck Scheduler by Con Kolivas.
+
+Goals.
+
+The goal of the Brain Fuck Scheduler, referred to as BFS from here on,
is to
+completely do away with the complex designs of the past for the cpu
process
+scheduler and instead implement one that is very simple in basic
design.
+The main focus of BFS is to achieve excellent desktop interactivity and
+responsiveness without heuristics and tuning knobs that are difficult
to
+understand, impossible to model and predict the effect of, and when
tuned to
+one workload cause massive detriment to another.
+
+
+Design summary.
+
+BFS is best described as a single runqueue, O(n) lookup, earliest
effective
+virtual deadline first design, loosely based on EEVDF (earliest
eligible virtual
+deadline first) and my previous Staircase Deadline scheduler. Each
component
+shall be described in order to understand the significance of, and
reasoning for
+it. The codebase when the first stable version was released was
approximately
+9000 lines less code than the existing mainline linux kernel scheduler
(in
+2.6.31). This does not even take into account the removal of
documentation and
+the cgroups code that is not used.
+
+Design reasoning.
+
+The single runqueue refers to the queued but not running processes for
the
+entire system, regardless of the number of CPUs. The reason for going
back to
+a single runqueue design is that once multiple runqueues are
introduced,
+per-CPU or otherwise, there will be complex interactions as each
runqueue will
+be responsible for the scheduling latency and fairness of the tasks
only on its
+own runqueue, and to achieve fairness and low latency across multiple
CPUs, any
+advantage in throughput of having CPU local tasks causes other
disadvantages.
+This is due to requiring a very complex balancing system to at best
achieve some
+semblance of fairness across CPUs and can only maintain relatively low
latency
+for tasks bound to the same CPUs, not across them. To increase said
fairness
+and latency across CPUs, the advantage of local runqueue locking, which
makes
+for better scalability, is lost due to having to grab multiple locks.
+
+A significant feature of BFS is that all accounting is done purely
based on CPU
+used and nowhere is sleep time used in any way to determine entitlement
or
+interactivity. Interactivity "estimators" that use some kind of
sleep/run
+algorithm are doomed to fail to detect all interactive tasks, and to
falsely tag
+tasks that aren't interactive as being so. The reason for this is that
it is
+close to impossible to determine that when a task is sleeping, whether
it is
+doing it voluntarily, as in a userspace application waiting for input
in the
+form of a mouse click or otherwise, or involuntarily, because it is
waiting for
+another thread, process, I/O, kernel activity or whatever. Thus, such
an
+estimator will introduce corner cases, and more heuristics will be
required to
+cope with those corner cases, introducing more corner cases and failed
+interactivity detection and so on. Interactivity in BFS is built into
the design
+by virtue of the fact that tasks that are waking up have not used up
their quota
+of CPU time, and have earlier effective deadlines, thereby making it
very likely
+they will preempt any CPU bound task of equivalent nice level. See
below for
+more information on the virtual deadline mechanism. Even if they do not
preempt
+a running task, because the rr interval is guaranteed to have a bound
upper
+limit on how long a task will wait for, it will be scheduled within a
timeframe
+that will not cause visible interface jitter.
+
+
+Design details.
+
+Task insertion.
+
+BFS inserts tasks into each relevant queue as an O(1) insertion into a
double
+linked list. On insertion, *every* running queue is checked to see if
the newly
+queued task can run on any idle queue, or preempt the lowest running
task on the
+system. This is how the cross-CPU scheduling of BFS achieves
significantly lower
+latency per extra CPU the system has. In this case the lookup is, in
the worst
+case scenario, O(n) where n is the number of CPUs on the system.
+
+Data protection.
+
+BFS has one single lock protecting the process local data of every task
in the
+global queue. Thus every insertion, removal and modification of task
data in the
+global runqueue needs to grab the global lock. However, once a task is
taken by
+a CPU, the CPU has its own local data copy of the running process'
accounting
+information which only that CPU accesses and modifies (such as during a
+timer tick) thus allowing the accounting data to be updated lockless.
Once a
+CPU has taken a task to run, it removes it from the global queue. Thus
the
+global queue only ever has, at most,
+
+ (number of tasks requesting cpu time) - (number of logical CPUs) + 1
+
+tasks in the global queue. This value is relevant for the time taken to
look up
+tasks during scheduling. This will increase if many tasks with CPU
affinity set
+in their policy to limit which CPUs they're allowed to run on if they
outnumber
+the number of CPUs. The +1 is because when rescheduling a task, the
CPU's
+currently running task is put back on the queue. Lookup will be
described after
+the virtual deadline mechanism is explained.
+
+Virtual deadline.
+
+The key to achieving low latency, scheduling fairness, and "nice level"
+distribution in BFS is entirely in the virtual deadline mechanism. The
one
+tunable in BFS is the rr_interval, or "round robin interval". This is
the
+maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling
policy)
+tasks of the same nice level will be running for, or looking at it the
other
+way around, the longest duration two tasks of the same nice level will
be
+delayed for. When a task requests cpu time, it is given a quota
(time_slice)
+equal to the rr_interval and a virtual deadline. The virtual deadline
is
+offset from the current time in jiffies by this equation:
+
+ jiffies + (prio_ratio * rr_interval)
+
+The prio_ratio is determined as a ratio compared to the baseline of
nice -20
+and increases by 10% per nice level. The deadline is a virtual one only
in that
+no guarantee is placed that a task will actually be scheduled by this
time, but
+it is used to compare which task should go next. There are three
components to
+how a task is next chosen. First is time_slice expiration. If a task
runs out
+of its time_slice, it is descheduled, the time_slice is refilled, and
the
+deadline reset to that formula above. Second is sleep, where a task no
longer
+is requesting CPU for whatever reason. The time_slice and deadline are
_not_
+adjusted in this case and are just carried over for when the task is
next
+scheduled. Third is preemption, and that is when a newly waking task is
deemed
+higher priority than a currently running task on any cpu by virtue of
the fact
+that it has an earlier virtual deadline than the currently running
task. The
+earlier deadline is the key to which task is next chosen for the first
and
+second cases. Once a task is descheduled, it is put back on the queue,
and an
+O(n) lookup of all queued-but-not-running tasks is done to determine
which has
+the earliest deadline and that task is chosen to receive CPU next.
+
+The CPU proportion of different nice tasks works out to be
approximately the
+
+ (prio_ratio difference)^2
+
+The reason it is squared is that a task's deadline does not change
while it is
+running unless it runs out of time_slice. Thus, even if the time
actually
+passes the deadline of another task that is queued, it will not get CPU
time
+unless the current running task deschedules, and the time
"base" (jiffies) is
+constantly moving.
+
+Task lookup.
+
+BFS has 103 priority queues. 100 of these are dedicated to the static
priority
+of realtime tasks, and the remaining 3 are, in order of best to worst
priority,
+SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle
priority
+scheduling). When a task of these priorities is queued, a bitmap of
running
+priorities is set showing which of these priorities has tasks waiting
for CPU
+time. When a CPU is made to reschedule, the lookup for the next task to
get
+CPU time is performed in the following way:
+
+First the bitmap is checked to see what static priority tasks are
queued. If
+any realtime priorities are found, the corresponding queue is checked
and the
+first task listed there is taken (provided CPU affinity is suitable)
and lookup
+is complete. If the priority corresponds to a SCHED_ISO task, they are
also
+taken in FIFO order (as they behave like SCHED_RR). If the priority
corresponds
+to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n).
At this
+stage, every task in the runlist that corresponds to that priority is
checked
+to see which has the earliest set deadline, and (provided it has
suitable CPU
+affinity) it is taken off the runqueue and given the CPU. If a task has
an
+expired deadline, it is taken and the rest of the lookup aborted (as
they are
+chosen in FIFO order).
+
+Thus, the lookup is O(n) in the worst case only, where n is as
described
+earlier, as tasks may be chosen before the whole task list is looked
over.
+
+
+Scalability.
+
+The major limitations of BFS will be that of scalability, as the
separate
+runqueue designs will have less lock contention as the number of CPUs
rises.
+However they do not scale linearly even with separate runqueues as
multiple
+runqueues will need to be locked concurrently on such designs to be
able to
+achieve fair CPU balancing, to try and achieve some sort of nice-level
fairness
+across CPUs, and to achieve low enough latency for tasks on a busy CPU
when
+other CPUs would be more suited. BFS has the advantage that it requires
no
+balancing algorithm whatsoever, as balancing occurs by proxy simply
because
+all CPUs draw off the global runqueue, in priority and deadline order.
Despite
+the fact that scalability is _not_ the prime concern of BFS, it both
shows very
+good scalability to smaller numbers of CPUs and is likely a more
scalable design
+at these numbers of CPUs.
+
+It also has some very low overhead scalability features built into the
design
+when it has been deemed their overhead is so marginal that they're
worth adding.
+The first is the local copy of the running process' data to the CPU
it's running
+on to allow that data to be updated lockless where possible. Then there
is
+deference paid to the last CPU a task was running on, by trying that
CPU first
+when looking for an idle CPU to use the next time it's scheduled.
Finally there
+is the notion of "sticky" tasks that are flagged when they are
involuntarily
+descheduled, meaning they still want further CPU time. This sticky flag
is
+used to bias heavily against those tasks being scheduled on a different
CPU
+unless that CPU would be otherwise idle. When a cpu frequency governor
is used
+that scales with CPU load, such as ondemand, sticky tasks are not
scheduled
+on a different CPU at all, preferring instead to go idle. This means
the CPU
+they were bound to is more likely to increase its speed while the other
CPU
+will go idle, thus speeding up total task execution time and likely
decreasing
+power usage. This is the only scenario where BFS will allow a CPU to go
idle
+in preference to scheduling a task on the earliest available spare CPU.
+
+The real cost of migrating a task from one CPU to another is entirely
dependant
+on the cache footprint of the task, how cache intensive the task is,
how long
+it's been running on that CPU to take up the bulk of its cache, how big
the CPU
+cache is, how fast and how layered the CPU cache is, how fast a context
switch
+is... and so on. In other words, it's close to random in the real world
where we
+do more than just one sole workload. The only thing we can be sure of
is that
+it's not free. So BFS uses the principle that an idle CPU is a wasted
CPU and
+utilising idle CPUs is more important than cache locality, and cache
locality
+only plays a part after that.
+
+When choosing an idle CPU for a waking task, the cache locality is
determined
+according to where the task last ran and then idle CPUs are ranked from
best
+to worst to choose the most suitable idle CPU based on cache locality,
NUMA
+node locality and hyperthread sibling business. They are chosen in the
+following preference (if idle):
+
+* Same core, idle or busy cache, idle threads
+* Other core, same cache, idle or busy cache, idle threads.
+* Same node, other CPU, idle cache, idle threads.
+* Same node, other CPU, busy cache, idle threads.
+* Same core, busy threads.
+* Other core, same cache, busy threads.
+* Same node, other CPU, busy threads.
+* Other node, other CPU, idle cache, idle threads.
+* Other node, other CPU, busy cache, idle threads.
+* Other node, other CPU, busy threads.
+
+This shows the SMT or "hyperthread" awareness in the design as well
which will
+choose a real idle core first before a logical SMT sibling which
already has
+tasks on the physical CPU.
+
+Early benchmarking of BFS suggested scalability dropped off at the 16
CPU mark.
+However this benchmarking was performed on an earlier design that was
far less
+scalable than the current one so it's hard to know how scalable it is
in terms
+of both CPUs (due to the global runqueue) and heavily loaded machines
(due to
+O(n) lookup) at this stage. Note that in terms of scalability, the
number of
+_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual
(2x)
+quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer
benchmark
+results are very promising indeed, without needing to tweak any knobs,
features
+or options. Benchmark contributions are most welcome.
+
+
+Features
+
+As the initial prime target audience for BFS was the average desktop
user, it
+was designed to not need tweaking, tuning or have features set to
obtain benefit
+from it. Thus the number of knobs and features has been kept to an
absolute
+minimum and should not require extra user input for the vast majority
of cases.
+There are precisely 2 tunables, and 2 extra scheduling policies. The
rr_interval
+and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In
addition
+to this, BFS also uses sub-tick accounting. What BFS does _not_ now
feature is
+support for CGROUPS. The average user should neither need to know what
these
+are, nor should they need to be using them to have good desktop
behaviour.
+
+rr_interval
+
+There is only one "scheduler" tunable, the round robin interval. This
can be
+accessed in
+
+ /proc/sys/kernel/rr_interval
+
+The value is in milliseconds, and the default value is set to 6ms.
Valid values
+are from 1 to 1000. Decreasing the value will decrease latencies at the
cost of
+decreasing throughput, while increasing it will improve throughput, but
at the
+cost of worsening latencies. The accuracy of the rr interval is limited
by HZ
+resolution of the kernel configuration. Thus, the worst case latencies
are
+usually slightly higher than this actual value. BFS uses "dithering" to
try and
+minimise the effect the Hz limitation has. The default value of 6 is
not an
+arbitrary one. It is based on the fact that humans can detect jitter at
+approximately 7ms, so aiming for much lower latencies is pointless
under most
+circumstances. It is worth noting this fact when comparing the latency
+performance of BFS to other schedulers. Worst case latencies being
higher than
+7ms are far worse than average latencies not being in the microsecond
range.
+Experimentation has shown that rr intervals being increased up to 300
can
+improve throughput but beyond that, scheduling noise from elsewhere
prevents
+further demonstrable throughput.
+
+Isochronous scheduling.
+
+Isochronous scheduling is a unique scheduling policy designed to
provide
+near-real-time performance to unprivileged (ie non-root) users without
the
+ability to starve the machine indefinitely. Isochronous tasks (which
means
+"same time") are set using, for example, the schedtool application like
so:
+
+ schedtool -I -e amarok
+
+This will start the audio application "amarok" as SCHED_ISO. How
SCHED_ISO works
+is that it has a priority level between true realtime tasks and
SCHED_NORMAL
+which would allow them to preempt all normal tasks, in a SCHED_RR
fashion (ie,
+if multiple SCHED_ISO tasks are running, they purely round robin at
rr_interval
+rate). However if ISO tasks run for more than a tunable finite amount
of time,
+they are then demoted back to SCHED_NORMAL scheduling. This finite
amount of
+time is the percentage of _total CPU_ available across the machine,
configurable
+as a percentage in the following "resource handling" tunable (as
opposed to a
+scheduler tunable):
+
+ /proc/sys/kernel/iso_cpu
+
+and is set to 70% by default. It is calculated over a rolling 5 second
average
+Because it is the total CPU available, it means that on a multi CPU
machine, it
+is possible to have an ISO task running as realtime scheduling
indefinitely on
+just one CPU, as the other CPUs will be available. Setting this to 100
is the
+equivalent of giving all users SCHED_RR access and setting it to 0
removes the
+ability to run any pseudo-realtime tasks.
+
+A feature of BFS is that it detects when an application tries to obtain
a
+realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have
the
+appropriate privileges to use those policies. When it detects this, it
will
+give the task SCHED_ISO policy instead. Thus it is transparent to the
user.
+Because some applications constantly set their policy as well as their
nice
+level, there is potential for them to undo the override specified by
the user
+on the command line of setting the policy to SCHED_ISO. To counter
this, once
+a task has been set to SCHED_ISO policy, it needs superuser privileges
to set
+it back to SCHED_NORMAL. This will ensure the task remains ISO and all
child
+processes and threads will also inherit the ISO policy.
+
+Idleprio scheduling.
+
+Idleprio scheduling is a scheduling policy designed to give out CPU to
a task
+_only_ when the CPU would be otherwise idle. The idea behind this is to
allow
+ultra low priority tasks to be run in the background that have
virtually no
+effect on the foreground tasks. This is ideally suited to distributed
computing
+clients (like setiathome, folding, mprime etc) but can also be used to
start
+a video encode or so on without any slowdown of other tasks. To avoid
this
+policy from grabbing shared resources and holding them indefinitely, if
it
+detects a state where the task is waiting on I/O, the machine is about
to
+suspend to ram and so on, it will transiently schedule them as
SCHED_NORMAL. As
+per the Isochronous task management, once a task has been scheduled as
IDLEPRIO,
+it cannot be put back to SCHED_NORMAL without superuser privileges.
Tasks can
+be set to start as SCHED_IDLEPRIO with the schedtool command like so:
+
+ schedtool -D -e ./mprime
+
+Subtick accounting.
+
+It is surprisingly difficult to get accurate CPU accounting, and in
many cases,
+the accounting is done by simply determining what is happening at the
precise
+moment a timer tick fires off. This becomes increasingly inaccurate as
the
+timer tick frequency (HZ) is lowered. It is possible to create an
application
+which uses almost 100% CPU, yet by being descheduled at the right time,
records
+zero CPU usage. While the main problem with this is that there are
possible
+security implications, it is also difficult to determine how much CPU a
task
+really does use. BFS tries to use the sub-tick accounting from the TSC
clock,
+where possible, to determine real CPU usage. This is not entirely
reliable, but
+is far more likely to produce accurate CPU usage data than the existing
designs
+and will not show tasks as consuming no CPU usage when they actually
are. Thus,
+the amount of CPU reported as being used by BFS will more accurately
represent
+how much CPU the task itself is using (as is shown for example by the
'time'
+application), so the reported values may be quite different to other
schedulers.
+Values reported as the 'load' are more prone to problems with this
design, but
+per process values are closer to real usage. When comparing throughput
of BFS
+to other designs, it is important to compare the actual completed work
in terms
+of total wall clock time taken and total work done, rather than the
reported
+"cpu usage".
+
+
+Con Kolivas <kernel@xxxxxxxxxxx> Tue, 5 Apr 2011
diff -uprN linux-3.6.2/Documentation/sysctl/kernel.txt
linux-3.6.2-bfs-multi-runqueue/Documentation/sysctl/kernel.txt
--- linux-3.6.2/Documentation/sysctl/kernel.txt 2012-10-12
22:50:59.000000000 +0200
+++ linux-3.6.2-bfs-multi-runqueue/Documentation/sysctl/kernel.txt
2012-10-25 17:13:12.584060777 +0200
@@ -33,6 +33,7 @@ show up in /proc/sys/kernel:
- domainname
- hostname
- hotplug
+- iso_cpu
- kptr_restrict
- kstack_depth_to_print [ X86 only ]
- l2cr [ PPC only ]
@@ -59,6 +60,7 @@ show up in /proc/sys/kernel:
- randomize_va_space
- real-root-dev ==> Documentation/initrd.txt
- reboot-cmd [ SPARC only ]
+- rr_interval
- rtsig-max
- rtsig-nr
- sem
@@ -301,6 +303,16 @@ kernel stack.

+rr_interval: (BFS CPU scheduler only)
+
+This is the smallest duration that any cpu process scheduling unit
+will run for. Increasing this value can increase throughput of cpu
+bound tasks substantially but at the expense of increased latencies
+overall. Conversely decreasing it will decrease average and maximum
+latencies but at the expense of throughput. This value is in
+milliseconds and the default value chosen depends on the number of
+cpus available at scheduler initialisation with a minimum of 6.
+
+Valid values are from 1-1000.
+
+==============================================================
+
rtsig-max & rtsig-nr:

-choice
- prompt "Choose SLAB allocator"
- default SLUB
- help
- This option allows to select a slab allocator.
-
-config SLAB
- bool "SLAB"
- help
- The regular slab allocator that is established and known to work
- well in all environments. It organizes cache hot objects in
- per cpu and per node queues.
-
config SLUB
- bool "SLUB (Unqueued Allocator)"
- help
- SLUB is a slab allocator that minimizes cache line usage
- instead of managing queues of cached objects (SLAB approach).
- Per cpu caching is realized using slabs of objects instead
- of queues of objects. SLUB can use memory efficiently
- and has enhanced diagnostics. SLUB is the default choice for
- a slab allocator.
-
-config SLOB
- depends on EXPERT
- bool "SLOB (Simple Allocator)"
- help
- SLOB replaces the stock allocator with a drastically simpler
- allocator. SLOB is generally more space efficient but
- does not perform as well on large systems.
-
-endchoice
+ def_bool y