Currently in mainline the balancing of multiple RT threads is quite broken.That is to say that a high priority thread that is scheduled on a CPUwith a higher priority thread, may need to unnecessarily wait while itcan easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerationsmust be taken into account. Cache lines, NUMA and more. This is truewith general processes which expect high through put and migration canbe done in batch. But when it comes to RT tasks, we really need toput them off to a CPU that they can run on as soon as possible. Evenif it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduledto run. And perhaps even longer. The migration thread is not fast enoughto take care of RT tasks.

This test expects a parameter to pass in the number of threads to create.If you add the '-c' option (check) it will terminate if the test failsone of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

rt-migrate-test -c 5

What this test does is to create the number of threads specified (in thiscase 5). Each thread is set as an RT FIFO task starting at a specifiedprio (default 2) and each thread being one priority higher. So with thisexample the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest ofthe children (this example 7). It uses pthread_barrier_wait to synchronizethe threads. Then it takes a time stamp and starts all the threads.The threads when woken up take a time stamp and compares it to the parentthread to see how long it took to be awoken. It then runs for aninterval (20ms default) in a busy loop. The busy loop ends when it reachesthe interval delta from the start time stamp. So if it is preempted, itmay not actually run for the full interval. This is expected behaviorof the test.

The numbers recorded are the delta from the thread's time stamp from theparent time stamp. The number of iterations it ran the busy loop for, andthe delta from a thread time stamp taken at the end of the loop to theparent time stamp.

Sometimes a lower priority task might wake up before a higher priority,but this is OK, as long as the higher priority process gets the CPU whenit is awoken.

At the end of the test, the iteration data is printed to stdout. If ahigher priority task had to wait for a lower one to finish running, thenthis is considered a failure. Here's an example of the output froma run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

On iteration 1 (starts at 0) the third task started at 20ms after the parentwoke it up. We can see here that the first two tasks ran to completionbefore the higher priority task was even able to start. That is a20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping the priority, may be in for a surprise. Since some kernel threads (likethe migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passivemethod (migration thread) to an active method. This new method isto actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT taskof higher priority running on its runqueue, we initiate a push_rt_tasksalgorithm. This algorithm looks at the highest non-running RT taskand tries to find a CPU where it can run on. It only migrates the RTtask if it finds a CPU (of lowest priority) where the RT taskcan run on and can preempt the currently running task on that CPU.We continue pushing RT tasks until we can't push anymore.

If a RT task fails to be migrated we stop the pushing. This is possiblebecause we are always looking at the highest priority RT task on therun queue. And if it can't migrate, then most likely the lower RT taskscan not either.

There is one case that is not covered by this patch set. That is thatwhen the highest priority non running RT task has its CPU affinityin such a way that it can not preempt any tasks on the CPUs runningon CPUs of its affinity. But a lower priority task has a larger affinityto CPUs that it can run on. This is a case where the lower priority taskwill not be migrated to those CPUS (although those CPUs may pull that task).Currently this patch set ignores this scenario.

Another case where we push RT tasks is in the finish_task_switch. This isdone since the running RT task can not be migrated while it is running.So if an RT task is preempted by a higher priority RT task, we canmigrate the RT task being preempted at that moment.

We also actively pull RT tasks. Whenever a runqueue is about to lower itspriority (schedule a lower priority task) a check is done to see if thatrunqueue can pull RT tasks to it to run instead. A search is made of alloverloaded runqueues (runqueues with more than one RT task scheduled on it)and checked to see if they have an RT task that can run on the runqueue(affinity matches) and is of higher priority than the task the runqueueis about to schedule. The pull algorithm pulls all RT tasks that matchthis case.

With this patch set, I ran the rt-migrate-test over night in a whileloop with the -c option (which terminates upon failure) and it passedover 6500 tests (each doing 50 wakeups each).

Here's an example of the output from the patched kernel. This is just toexplain it a bit more.

The first iteration (really 2cd, since we start at zero), is a typical run.The lowest prio task didn't start executing until the other 4 tasks finishedand gave up the CPU.

The second iteration seems at first like a failure. But this is actually fine.The lowest priority task just happen to schedule onto a CPU before thehigher priority tasks were woken up. But as you can see from this example,the higher priority tasks still were able to get scheduled right away andin doing so preempted the lower priority task. This can be seen by thenumber of loops that the lower priority task was able to complete. Only 28.This is because the busy loop terminates when the time stamp reaches thetime interval (20ms here) from the start time stamp. Since the lower prioritytask was able to sneak in and start, it's time stamp was low. So after itgot preempted, and rescheduled, it was already past the run time intervalso it simply ended the loop.

Finally, the CFS RT balancing had to be removed in order for this to work.Testing showed that the CFS RT balancing would actually pull RT tasksfrom runqueues already assigned to the proper runqueues, and again causelatencies. With this new approach, the CFS RT balancing is not needed,and I suggest that these patches replace the current CFS RT balancing.

Also, let me stress, that I made a great attempt to have this causeas little overhead (practically none) to the non RT cases. Most of thesealgorithms only take place when more than one RT task is scheduled on thesame runqueue.

Special thanks goes to Gregory Haskins for his advice and back and forthon IRC with ideas. Although I didn't use his actual patches (his wereagainst -rt) it did help me clean up some of my code. Also, thanks go toIngo Molnar himself for taking some ideas from his RT balance code inthe -rt patch.