2.6.32-longterm review patch. If anyone has any objections, please let us know.

------------------

Commit: 43fa5460fe60dea5c610490a1d263415419c60f6 upstream

When first working on the RT scheduler design, we concentrated onkeeping all CPUs running RT tasks instead of having multiple RTtasks on a single CPU waiting for the migration thread to movethem. Instead we take a more proactive stance and push or pull RTtasks from one CPU to another on wakeup or scheduling.

When an RT task wakes up on a CPU that is running another RT task,instead of preempting it and killing the cache of the running RTtask, we look to see if we can migrate the RT task that is wakingup, even if the RT task waking up is of higher priority.

This may sound a bit odd, but RT tasks should be limited inmigration by the user anyway. But in practice, people do not dothis, which causes high prio RT tasks to bounce around the CPUs.This becomes even worse when we have priority inheritance, becausea high prio task can block on a lower prio task and boost itspriority. When the lower prio task wakes up the high prio task, ifit happens to be on the same CPU it will migrate off of it.

But in reality, the above does not happen much either, because thewake up of the lower prio task, which has already been boosted, ifit was on the same CPU as the higher prio task, it would thenmigrate off of it. But anyway, we do not want to migrate themeither.

To examine the scheduling, I created a test program and examined itunder kernelshark. The test program created CPU * 2 threads, whereeach thread had a different priority. The program takes differentoptions. The options used in this change log was to have priorityinheritance mutexes or not.

The busy_loop(ms) keeps the CPU spinning for ms milliseconds. Thems_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writesto the ftrace buffer to help analyze via ftrace.

The higher the id, the higher the prio, the shorter it does thebusy loop, but the longer it spins. This is usually the case withRT tasks, the lower priority tasks usually run longer than higherpriority tasks.

At the end of the test, it records the number of loops each threadtook, as well as the number of voluntary preemptions, non-voluntarypreemptions, and number of migrations each thread took, taking theinformation from /proc/$$/sched and /proc/$$/status.

Running this on a 4 CPU processor, the results without changes tothe kernel looked like this:

The total # of migrations did not change (several runs showed thedifference all within the noise). But we now see a dramaticimprovement to the higher priority tasks. (kernelshark showed thatthe watchdog timer bumped the highest priority task to give it the2 count. This was actually consistent with every run).

Notice that the # of iterations did not change either.

The above was with priority inheritance mutexes. That is, when thehigher prority task blocked on a lower priority task, the lowerpriority task would inherit the higher priority task (which showswhy task 6 was bumped so many times). When not using priorityinheritance mutexes, the current kernel shows this:

Which shows a even bigger change. The big difference between task 3and task 4 is because we have only 4 CPUs on the machine, causingthe 4 highest prio tasks to always have preference.

Although I did not measure cache misses, and I'm sure there wouldbe little to measure since the test was not data intensive, I couldimagine large improvements for higher priority tasks when dealingwith lower priority tasks. Thus, I'm satisfied with making thechange and agreeing with what Gregory Haskins argued a few yearsago when we first had this discussion.

One final note. All tasks in the above tests were RT tasks. Any RTtask will always preempt a non RT task that is running on the CPUthe RT task wants to run on.

--- a/kernel/sched_rt.c+++ b/kernel/sched_rt.c@@ -954,18 +954,18 @@ select_task_rq_rt(struct rq *rq, struct * runqueue. Otherwise simply start this RT task * on its current runqueue. *- * We want to avoid overloading runqueues. Even if- * the RT task is of higher priority than the current RT task.- * RT tasks behave differently than other tasks. If- * one gets preempted, we try to push it off to another queue.- * So trying to keep a preempting RT task on the same- * cache hot CPU will force the running RT task to- * a cold CPU. So we waste all the cache for the lower- * RT task in hopes of saving some of a RT task- * that is just being woken and probably will have- * cold cache anyway.+ * We want to avoid overloading runqueues. If the woken+ * task is a higher priority, then it will stay on this CPU+ * and the lower prio task should be moved to another CPU.+ * Even though this will probably make the lower prio task+ * lose its cache, we do not want to bounce a higher task+ * around just because it gave up its CPU, perhaps for a+ * lock?+ *+ * For equal prio tasks, we just let the scheduler sort it out. */ if (unlikely(rt_task(rq->curr)) &&+ rq->curr->prio < p->prio && (p->rt.nr_cpus_allowed > 1)) { int cpu = find_lowest_rq(p);