Scheduler

From Linux-VServer

The Scheduler in the 2.6 kernel is radically different from the 2.4.x series. It is an O(1) scheduler. This means that the job of scheduling does not get more and more complex as the number of running or inactive processes on the machine increases. This is important for SMP scalablity, and the case where there are lots of processes running. In addition, the newer scheduler has better real-time and interactive performance.

Basically, the way it works is this (based on my analysis of the code, read a kernel book and it might say something different :-): each CPU has two linked list run-queues, which hold a list of runnable processes (those waiting for CPU time). Slices are taken from the front of one of the queues, and when the process has finished its cycle, it is either inserted back on the one actively being processed, or moved to the second one - depending on factors such as the priority of the process, and how long it actually ran. What this means is that processes that have not been using much CPU time (ie, interactive processes) get an increased priority, because they are inserted on the end of the running list. When the first process on the second list start to become "starved" of CPU time, this action is canceled so that the first list can deplete. Once it is depleted, the lists are swapped around and the waiting list becomes the running list.

There have been two implementations of implementing the optional per-s_context scheduling for the O(1) scheduler. The approach taken by SamV is to use a "token bucket" for each s_context that assigns a number (N) of CPU "tokens" to each vserver every M cycles. If the process is caught running when the timer tick (which happens 100 times a second on the i386 port) happens, then a token is taken from the bucket. The process is then given a priority penalty or advantage based on the number of tokens left in the bucket. The penalty seems to work best when it is roughly quadratic - ie, giving more (3-4 times) of a maximum penalty for exceeding your quota of tokens than a boost for using nothing. This gives the following characteristics:

s_contexts that have been hogging the CPU get a very low priority - but still, they are not starved of CPU time, and if there are no other running processes on the system the only disadvantage is more frequent context switches.

s_contexts that do not use their quota up get very good interactive and general performance

It is still possible for a vserver assigned to a context to use more tokens than they should, if they start a very large number of processes - as each process is always assigned at least one CPU slice.

However, it is perfectly suitable for most use cases.

The approach taken by Alex Lyashkov when I last looked took a more thorough, but not O(1) approach, but I won't comment on its exact workings as I haven't analysed it thoroughly.

Another approach that would work and still be O(1) would be to assign each s_context its own pair of run-queues, and schedule the entire s_context as if it were a single process to the outside server. This would satisfy "hard" scheduling requirements, while increasing the complexity of the code marginally.