This patch removes the process grouping code from the cpusets code,instead hooking it into the generic container system. This temporarilyadds cpuset-specific code in kernel/container.c, which is removed bythe next patch in the series.

/**- * cpuset_fork - attach newly forked task to its parents cpuset.- * @tsk: pointer to task_struct of forking parent process.- *- * Description: A task inherits its parent's cpuset at fork().- *- * A pointer to the shared cpuset was automatically copied in fork.c- * by dup_task_struct(). However, we ignore that copy, since it was- * not made under the protection of task_lock(), so might no longer be- * a valid cpuset pointer. attach_task() might have already changed- * current->cpuset, allowing the previously referenced cpuset to- * be removed and freed. Instead, we task_lock(current) and copy- * its present value of current->cpuset for our freshly forked child.- *- * At the point that cpuset_fork() is called, 'current' is the parent- * task, and the passed argument 'child' points to the child task.- **/--void cpuset_fork(struct task_struct *child)-{- task_lock(current);- child->cpuset = current->cpuset;- atomic_inc(&child->cpuset->count);- task_unlock(current);-}--/**- * cpuset_exit - detach cpuset from exiting task- * @tsk: pointer to task_struct of exiting process- *- * Description: Detach cpuset from @tsk and release it.- *- * Note that cpusets marked notify_on_release force every task in- * them to take the global manage_mutex mutex when exiting.- * This could impact scaling on very large systems. Be reluctant to- * use notify_on_release cpusets where very high task exit scaling- * is required on large systems.- *- * Don't even think about derefencing 'cs' after the cpuset use count- * goes to zero, except inside a critical section guarded by manage_mutex- * or callback_mutex. Otherwise a zero cpuset use count is a license to- * any other task to nuke the cpuset immediately, via cpuset_rmdir().- *- * This routine has to take manage_mutex, not callback_mutex, because- * it is holding that mutex while calling check_for_release(),- * which calls kmalloc(), so can't be called holding callback_mutex().- *- * We don't need to task_lock() this reference to tsk->cpuset,- * because tsk is already marked PF_EXITING, so attach_task() won't- * mess with it, or task is a failed fork, never visible to attach_task.- *- * the_top_cpuset_hack:- *- * Set the exiting tasks cpuset to the root cpuset (top_cpuset).- *- * Don't leave a task unable to allocate memory, as that is an- * accident waiting to happen should someone add a callout in- * do_exit() after the cpuset_exit() call that might allocate.- * If a task tries to allocate memory with an invalid cpuset,- * it will oops in cpuset_update_task_memory_state().- *- * We call cpuset_exit() while the task is still competent to- * handle notify_on_release(), then leave the task attached to- * the root cpuset (top_cpuset) for the remainder of its exit.- *- * To do this properly, we would increment the reference count on- * top_cpuset, and near the very end of the kernel/exit.c do_exit()- * code we would add a second cpuset function call, to drop that- * reference. This would just create an unnecessary hot spot on- * the top_cpuset reference count, to no avail.- *- * Normally, holding a reference to a cpuset without bumping its- * count is unsafe. The cpuset could go away, or someone could- * attach us to a different cpuset, decrementing the count on- * the first cpuset that we never incremented. But in this case,- * top_cpuset isn't going away, and either task has PF_EXITING set,- * which wards off any attach_task() attempts, or task is a failed- * fork, never visible to attach_task.- *- * Another way to do this would be to set the cpuset pointer- * to NULL here, and check in cpuset_update_task_memory_state()- * for a NULL pointer. This hack avoids that NULL check, for no- * cost (other than this way too long comment ;).- **/--void cpuset_exit(struct task_struct *tsk)-{- struct cpuset *cs;-- cs = tsk->cpuset;- tsk->cpuset = &top_cpuset; /* the_top_cpuset_hack - see above */-- if (notify_on_release(cs)) {- char *pathbuf = NULL;-- mutex_lock(&manage_mutex);- if (atomic_dec_and_test(&cs->count))- check_for_release(cs, &pathbuf);- mutex_unlock(&manage_mutex);- cpuset_release_agent(pathbuf);- } else {- atomic_dec(&cs->count);- }-}--/** * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset. * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. *@@ -2276,11 +1335,11 @@ cpumask_t cpuset_cpus_allowed(struct tas { cpumask_t mask;

config CONTAINERS- bool "Container support"- help- This option will let you create and manage process containers,- which can be used to aggregate multiple processes, e.g. for- the purposes of resource tracking.-- Say N if unsure+ bool

config CPUSETS bool "Cpuset support" depends on SMP+ select CONTAINERS help This option will let you create and manage CPUSETs which allow dynamically partitioning a system into sets of CPUs and@@ -257,6 +252,16 @@ config CPUSETS

CONTENTS: =========@@ -16,10 +17,9 @@ CONTENTS: 1.2 Why are cpusets needed ? 1.3 How are cpusets implemented ? 1.4 What are exclusive cpusets ?- 1.5 What does notify_on_release do ?- 1.6 What is memory_pressure ?- 1.7 What is memory spread ?- 1.8 How do I use cpusets ?+ 1.5 What is memory_pressure ?+ 1.6 What is memory spread ?+ 1.7 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus@@ -43,18 +43,19 @@ hierarchy visible in a virtual file syst hooks, beyond what is already present, required to manage dynamic job placement on large systems.

-Each task has a pointer to a cpuset. Multiple tasks may reference-the same cpuset. Requests by a task, using the sched_setaffinity(2)-system call to include CPUs in its CPU affinity mask, and using the-mbind(2) and set_mempolicy(2) system calls to include Memory Nodes-in its memory policy, are both filtered through that tasks cpuset,-filtering out any CPUs or Memory Nodes not in that cpuset. The-scheduler will not schedule a task on a CPU that is not allowed in-its cpus_allowed vector, and the kernel page allocator will not-allocate a page on a node that is not allowed in the requesting tasks-mems_allowed vector.+Cpusets use the generic container subsystem described in+Documentation/container.txt.

-User level code may create and destroy cpusets by name in the cpuset+Requests by a task, using the sched_setaffinity(2) system call to+include CPUs in its CPU affinity mask, and using the mbind(2) and+set_mempolicy(2) system calls to include Memory Nodes in its memory+policy, are both filtered through that tasks cpuset, filtering out any+CPUs or Memory Nodes not in that cpuset. The scheduler will not+schedule a task on a CPU that is not allowed in its cpus_allowed+vector, and the kernel page allocator will not allocate a page on a+node that is not allowed in the requesting tasks mems_allowed vector.++User level code may create and destroy cpusets by name in the container virtual file system, manage the attributes and permissions of these cpusets and which CPUs and Memory Nodes are assigned to each cpuset, specify and query to which cpuset a task is assigned, and list the@@ -117,7 +118,7 @@ Cpusets extends these two mechanisms as - Cpusets are sets of allowed CPUs and Memory Nodes, known to the kernel. - Each task in the system is attached to a cpuset, via a pointer- in the task structure to a reference counted cpuset structure.+ in the task structure to a reference counted container structure. - Calls to sched_setaffinity are filtered to just those CPUs allowed in that tasks cpuset. - Calls to mbind and set_mempolicy are filtered to just@@ -152,15 +153,10 @@ into the rest of the kernel, none in per - in page_alloc.c, to restrict memory to allowed nodes. - in vmscan.c, to restrict page recovery to the current cpuset.

-In addition a new file system, of type "cpuset" may be mounted,-typically at /dev/cpuset, to enable browsing and modifying the cpusets-presently known to the kernel. No new system calls are added for-cpusets - all support for querying and modifying cpusets is via-this cpuset file system.--Each task under /proc has an added file named 'cpuset', displaying-the cpuset name, as the path relative to the root of the cpuset file-system.+You should mount the "container" filesystem type in order to enable+browsing and modifying the cpusets presently known to the kernel. No+new system calls are added for cpusets - all support for querying and+modifying cpusets is via this cpuset file system.

The /proc/<pid>/status file for each task has two added lines, displaying the tasks cpus_allowed (on which CPUs it may be scheduled)@@ -170,16 +166,15 @@ in the format seen in the following exam Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Mems_allowed: ffffffff,ffffffff

-Each cpuset is represented by a directory in the cpuset file system-containing the following files describing that cpuset:+Each cpuset is represented by a directory in the container file system+containing (on top of the standard container files) the following+files describing that cpuset:

In addition, the root cpuset only has the following file:@@ -253,21 +248,7 @@ such as requests from interrupt handlers outside even a mem_exclusive cpuset.

-1.5 What does notify_on_release do ?---------------------------------------If the notify_on_release flag is enabled (1) in a cpuset, then whenever-the last task in the cpuset leaves (exits or attaches to some other-cpuset) and the last child cpuset of that cpuset is removed, then-the kernel runs the command /sbin/cpuset_release_agent, supplying the-pathname (relative to the mount point of the cpuset file system) of the-abandoned cpuset. This enables automatic removal of abandoned cpusets.-The default value of notify_on_release in the root cpuset at system-boot is disabled (0). The default value of other cpusets at creation-is the current value of their parents notify_on_release setting.---1.6 What is memory_pressure ?+1.5 What is memory_pressure ? ----------------------------- The memory_pressure of a cpuset provides a simple per-cpuset metric of the rate that the tasks in a cpuset are attempting to free up in@@ -324,7 +305,7 @@ the tasks in the cpuset, in units of rec times 1000.

-1.7 What is memory spread ?+1.6 What is memory spread ? --------------------------- There are two boolean flag files per cpuset that control where the kernel allocates pages for the file system buffers and related in@@ -395,7 +376,7 @@ data set, the memory allocation across t can become very uneven.

-1.8 How do I use cpusets ?+1.7 How do I use cpusets ? --------------------------

In order to minimize the impact of cpusets on critical kernel@@ -485,7 +466,7 @@ than stress the kernel. To start a new job that is to be contained within a cpuset, the steps are:

1) mkdir /dev/cpuset- 2) mount -t cpuset none /dev/cpuset+ 2) mount -t container none /dev/cpuset 3) Create the new cpuset by doing mkdir's and write's (or echo's) in the /dev/cpuset virtual file system. 4) Start a task that will be the "founding father" of the new job.@@ -497,7 +478,7 @@ For example, the following sequence of c named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, and then start a subshell 'sh' in that cpuset:

In the future, a C library interface to cpusets will likely be available. For now, the only way to query or modify cpusets is@@ -529,7 +510,7 @@ Creating, modifying, using the cpusets c virtual filesystem.

Then under /dev/cpuset you can find a tree that corresponds to the tree of the cpusets in the system. For instance, /dev/cpusetIndex: container-2.6.19-rc2/fs/super.c===================================================================--- container-2.6.19-rc2.orig/fs/super.c+++ container-2.6.19-rc2/fs/super.c@@ -39,11 +39,6 @@ #include <linux/mutex.h> #include <asm/uaccess.h>