I'm glad to announce a working prototype of the basic algorithm Ialready suggested last time.As I already tried to explain previously CFS has a considerablealgorithmic and computational complexity. This patch should now make itclearer, why I could so easily skip over Ingo's long explanation of allthe tricks CFS uses to keep the computational overhead low - I simplydon't need them. The following numbers are based on a 2.6.23-rc3-git1 UPkernel, the first 3 runs are without patch, the last 3 runs are with thepatch:

Besides these numbers I can also provide a mathematical foundation forit, I tried the same for CFS, but IMHO it's not really sanely possible.This model is far more accurate than CFS is and doesn't add an errorover time, thus there are no more underflow/overflow anymore within thedescribed limits.The small example program also demonstrates how it can be easily scaleddown to millisecond resolution to completely get rid of the 64bit math.This may be interesting for a few archs even if they have a finer clockresolution as most scheduling decisions are done on a millisecond basis.

The basic idea of this scheduler is somewhat different than CFS. Wherethe old scheduler maintains fixed time slices, CFS still maintains adynamic per task time slice. This model does away with it completely,instead it puts the task on a virtual (normalized) time line, where onlythe relative distance between any two task is relevant.

So here all the mathematical details necessary to understand what thescheduler does, so anyone can judge for himself how solid this designis. First some basics:

(3) time_{t} = time * weight_{t} / weight_sumThis can be also written as:

(4) time_{t} / weight_{t} = time / weight_sumThis way we have the normalized time:

(5) time_norm = time / weight_sum(6) time_norm_{t} = time_{t} / weight_{t}If every task got its share they are all same. Using time_norm one cancalculate the time tasks should get based on their weight:

(7) sum_{t}^{T}(time_{t}) = sum_{t}^{T}(round(time / weight_sum) * weight_{t})This is bascially what CFS currently does and it demonstrates the basicproblem it faces. It rounds the normalized time and the rounding erroris also distributed to the time a task gets, so there is a differencebetween the time which is distributed to the tasks and the time consumedby them. On the upside the error is distributed equally to all tasksrelatively to their weight (so it isn't immediately visible via top). Onthe downside the error itself is weighted too and so a small error canbe become quite large and the higher the weight the more it contributesto the error and the more likely it hits one of the limits. Once it hitsa limit, the overflow/underflow time is simply thrown away and is lostfor accounting and the task doesn't get the time it's supposed to get.

An alternative approach is to not use time_norm at all to distribute thetime. Any task can be used to calculate the time any other task needsrelative to it. For this the normalized time per task is maintainedbased on (6):

(8) time_norm_{t} * 2^16 = time_{t} * round(2^16 / weight_{t})Using the difference between the normalized times of two tasks, one cancalculate the time it needs to equalize the normalized time.This has the advantage that round(2^16 / weight_{t}) is constant (unlessreniced) and thus also the error due to the rounding. The time one taskgets relative to another is based on these constants. As only the deltaof these times are needed, the absolute value can simply overflow andthe limit of maximum time delta is:(9) time_delta_max = KCLOCK_MAX / (2^16 / weight_min)

The global normalized time is still needed and useful (e.g. wakingtasks) and thus this faces the same issue as CFS right now - managingthe rounding error. This means one can't directly use the realweight_{t} value anymore without producing new errors, so either oneuses this approximate weight:(10) weight_app_{t} = 2^16 / round(2^16 / weight_{t})or even better would be to get rid of this completely.

Based on (5) and (6) one can calculate the global normalized time as:

(11) time_norm = sum_{t}^{T}(time_{t}) / sum_{t}^{T}(weight_app_{t}) = sum_{t}^{T}(time_norm_{t} * weight_app_{t}) / sum_{t}^{T}(weight_app_{t})This is now a weighted average and provides the possibility to get ridof weight_app_{t} by simply replacing it:(12) time_norm_app = sum_{t}^{T}(time_norm_{t} * weight_{t}) / sum_{t}^{T}(weight_{t})This produces only a approximate normalized time, but if alltime_norm_{t} are equal (i.e. all tasks got their share), the result isthe same, thus the error is only temporary. If one already approximatesthis value this means value other replacements are possible too. In theprevious example program I simply used 1:(13) time_norm_app = sum_{t}^{T}(time_norm_{t}) / TAnother approximation is to use a shift:

(14) time_norm_app = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) / sum_{t}^{T}(2^weight_shift_{t})This helps to avoid a possible full 64 bit multiply and makes otheroperations elsewhere simpler too and the result should be close enough.So by maintaining these two sums one can calculate an approximatenormalized time value:

time_norm_base is maintained incrementally be defining this increment:

(21) time_norm_inc = time_sum_max / weight_sum_app

Everytime time_sum_off exceeds time_sum_max, time_sum_off andtime_norm_base are adjusted appropriately. time_sum_max is scaled sothat the required update frequency is reduced to a minimum, but also sothat time_sum_off can be easily scaled down to a 32 bit value whenneeded.

This basically describes the static system, but to allow for sleepingand waking these sums need adjustments to preserve a proper average:

(22) weight_app_sum += 2^weight_shift_{new}(23) time_sum_max += time_norm_inc * 2^weight_shift_{new}(24) time_sum_off += (time_norm_{new} - time_norm_base) * 2^weight_shift_{new}The last one is a little less obvious, it can be derived from (15) and(19):

The average from (20) can now be used to calculate the normalized timefor the new task in (24). It can be a given a bonus relative to theother task or it might be still within a certain limit, as it hasn'tslept long enough. The limit (9) still applies here, so a simplegeneration counter may still be needed for long sleeps.The time_sum_off value used to calculate the average can be scaled downas mentioned above. As it contains far more resolution than needed forshort time scheduling decisions, the lower bits can be thrown away toget a 32 bit value. The scaling of time_sum_max makes sure that oneknows the location of the most significant bit, so that the 32 bit isused as much as possible.

Finally a few more notes about the patch. The current task is not keptin the tree (just this saves a lot of tree updates), so I faced asimiliar problem as FAIR_GROUP_SCHED, that enqueue_task/dequeue_task canbe called for the current task, so maintaining the current task pointerfor the class is interesting. Instead of adding a set_curr_task it wouldbe IMO simpler to further reduce the number of indirect calls, e.g. thedeactivate_task/put_prev_task sequence can be replaced with a singlecall (and I don't need the sleep/wake arguments anymore, so it can bereused for that).I disabled the usage of cpu load as its calculation is also rather 64bitheavy and I suspect it could be easily scaled down, but this way it'snot my immediate concern.

Ingo, from this point now I need your help, you have to explain to me,what is missing now relative to CFS. I tried to ask questions, but thatwasn't very successful...