This is an RFC about the CPU hard limits feature where I have explainedthe need for the feature, the proposed plan and the issues around it.Before I come up with an implementation for hard limits, I would like toknow community's thoughts on this scheduler enhancement and any feedbackand suggestions.

1. CPU hard limit-----------------CFS is a proportional share scheduler which tries to divide the CPU timeproportionately between tasks or groups of tasks (task group/cgroup) dependingon the priority/weight of the task or shares assigned to groups of tasks.In CFS, a task/task group can get more than its share of CPU if there areenough idle CPU cycles available in the system, due to the work conservingnature of the scheduler.However there are scenarios (Sec 2) where giving more than the desiredCPU share to a task/task group is not acceptable. In those scenarios, thescheduler needs to put a hard stop on the CPU resource consumption oftask/task group if it exceeds a preset limit. This is usually achieved bythrottling the task/task group when it fully consumes its allocated CPU time.

2. Need for hard limiting CPU resource--------------------------------------- Pay-per-use: In enterprise systems that cater to multiple clients/customers where a customer demands a certain share of CPU resources and pays only that, CPU hard limits will be useful to hard limit the customer's job to consume only the specified amount of CPU resource.- In container based virtualization environments running multiple containers, hard limits will be useful to ensure a container doesn't exceed its CPU entitlement.- Hard limits can be used to provide guarantees.3. Granularity of enforcing CPU hard limits-------------------------------------------Conceptually, hard limits can either be enforced for individual tasks orgroups of tasks. However enforcing limits per task would be too finegrained and would be a lot of work on the part of the system administratorin terms of setting limits for every task. Based on the current understandingof the users of this feature, it is felt that hard limiting is more usefulat task group level than the individual tasks level. Hence in the subsequentparagraphs, the concept of hard limit as applicable to task group/cgroupis discussed.4. Existing solutions---------------------- Both Linux-VServer and OpenVZ virtualization solutions support CPU hard limiting.- Per task limit can be enforced using rlimits, but it is not rate based.5. Specifying hard limits-------------------------CPU time consumed by a task group is generally measured over atime period (called bandwidth period) and the task group gets throttledwhen its CPU time reaches a limit (hard limit) within a bandwidth period.The task group remains throttled until the bandwidth period getsrenewed at which time additional CPU time becomes availableto the tasks in the system.When a task group's hard limit is specified as a ratio X/Y, it means thatthe group will get throttled if its CPU time consumption exceeds X secondsin a bandwidth period of Y seconds.

Specifying the hard limit as X/Y requires us to specify the bandwidthperiod also.

Is having a uniform/same bandwidth period for all the groups an option ?If so, we could even specify the hard limit as a percentage, like30% of a uniform bandwidth period.

6. Per task group vs global bandwidth period--------------------------------------------The bandwidth period can either be per task group or global. With globalbandwidth period, the runtimes of all the task groups need to bereplenished when the period ends. Though this appears conceptually simple,the implementation might not scale. Instead if every task group maintains itsbandwidth period separately, the refresh cycles of each group will happenindependent of each other. Moreover different groups might prefer differentbandwidth periods. Hence the first implementation will have per task groupbandwidth period.Timers can be used to trigger bandwidth refresh cycles. (similar to rt groupsched)

7. Configuring--------------- User could set the hard limit (X and/or Y) through the cgroup fs.- When the scheduler supports hard limiting, should it be enabled for all tasks groups of the system ? Or should user have an option to enable hard limiting per group ?- When hard limiting is enabled for a group, should the limit be set to a default to start with ? Or should the user set the limit and the bandwidth before enabling the hard limiting ?- What should be a sane default value for the bandwidth period ?8. Throttling of tasks----------------------Task group can be taken off the runqueue when it hits the limit and enqueuedback when the bandwidth period is refreshed. This method would require us tomaintain the throttled tasks list separately for every group.Under heavy throttling, there could be tasks being dequeued and enqueuedback at bandwidth refresh times leading to frequent variations in therunqueue load. This might unduly stress the load balancer.

Note: A group (entity) can't be dequeued unless all tasks under it aredequeued. So there can be false/failed attempts to run tasks of a throttledgroup until all the tasks from the throttled group are dequeued.

9. Group scheduler hierarchy considerations-------------------------------------------Since the group scheduler is hierarchical in nature, should there be anyrelation between the hard limit values of the parent task groupand the values of its child groups ? Should the hard limit values set forchild groups be compatible with the parent's hard limit ? For eg, considera group A having hard limit as X/Y has two children A1 and A2. Should thelimits for A1 (X1/Y) and A2 (X2/Y) be set so that X1/Y+X2/Y <= X/Y ?Or should child groups set their limits independently of parent ? In thiscase, even if the child still has CPU time left before it hits the limit,it could get throttled because its parent got throttled. I would think thatthis method will lead to easier implementation.

AFAICS, rt group scheduler needs EDF to support different bandwidth periodsfor different groups (Ref: Documentation/scheduler/sched-rt-group.txt). Idon't think the same requirement is applicable to non-rt groups. This isbecause with hard limits we are not guaranteeing the CPU time for a group,instead we are just specifying the max time which a group can run within abandwidth period.

10. SMP considerations----------------------Hard limits could be enforced for the system as a whole or for individualCPUS.When it is enforced per CPU, a task group on a CPU will be throttled ifit reaches its hard limit on that CPU. This can lead to unfairness ifthe same task group on other CPUs has runtimes still left and it is notbeing utilized.

If enforced system wide, then a task group will be throttled when sum of therun times of its tasks running on different CPUs reach the limit.

Could we use a hybrid method where a task group that reaches its limit on a CPUcould draw the group bandwidth from another CPU where there are no runnabletasks belonging to that group ?

RT group scheduling borrows runtime from other CPUs when runtimes are balanced.

11. Starvation---------------When a task group that holds a shared resource (like a lock) is throttled,another group which needs the same shared resource will not be able tomake progress even when the CPU has idle cycles to spare. This will leadto starvation and unfairness. This situation could be avoided by some ofthe methods like- Disabling throttling when a group is holding a lock.- Inheriting runtime from the group which faces starvation.

The first implementation will not address this problem of starvation.

12. Hard limits and fairness----------------------------Hard limits are set independent of group shares. The hard limit settingby the user may be such that it may not be possible for the scheduler tomeet fairness and also enforce hard limits. Hard limiting takes precedence.