Index: container-2.6.21-rc7-mm1/Documentation/containers.txt===================================================================--- /dev/null+++ container-2.6.21-rc7-mm1/Documentation/containers.txt@@ -0,0 +1,524 @@+ CONTAINERS+ -------++Written by Paul Menage <menage@google.com> based on Documentation/cpusets.txt++Original copyright statements from cpusets.txt:+Portions Copyright (C) 2004 BULL SA.+Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.+Modified by Paul Jackson <pj@sgi.com>+Modified by Christoph Lameter <clameter@sgi.com>++CONTENTS:+=========++1. Containers+ 1.1 What are containers ?+ 1.2 Why are containers needed ?+ 1.3 How are containers implemented ?+ 1.4 What does notify_on_release do ?+ 1.5 How do I use containers ?+2. Usage Examples and Syntax+ 2.1 Basic Usage+ 2.2 Attaching processes+3. Kernel API+ 3.1 Overview+ 3.2 Synchronization+ 3.3 Subsystem API+4. Questions++1. Containers+==========++1.1 What are containers ?+----------------------++Containers provide a mechanism for aggregating/partitioning sets of+tasks, and all their future children, into hierarchical groups with+specialized behaviour.++Definitions:++A *container* associates a set of tasks with a set of parameters for one+or more subsystems.++A *subsystem* is a module that makes use of the task grouping+facilities provided by containers to treat groups of tasks in+particular ways. A subsystem is typically a "resource controller" that+schedules a resource or applies per-container limits, but it may be+anything that wants to act on a group of processes, e.g. a+virtualization subsystem.++A *hierarchy* is a set of containers arranged in a tree, such that+every task in the system is in exactly one of the containers in the+hierarchy, and a set of subsystems; each subsystem has system-specific+state attached to each container in the hierarchy. Each hierarchy has+an instance of the container virtual filesystem associated with it.++At any one time there may be multiple active hierachies of task+containers. Each hierarchy is a partition of all tasks in the system.++User level code may create and destroy containers by name in an+instance of the container virtual file system, specify and query to+which container a task is assigned, and list the task pids assigned to+a container. Those creations and assignments only affect the hierarchy+associated with that instance of the container file system.++On their own, the only use for containers is for simple job+tracking. The intention is that other subsystems hook into the generic+container support to provide new attributes for containers, such as+accounting/limiting the resources which processes in a container can+access. For example, cpusets (see Documentation/cpusets.txt) allows+you to associate a set of CPUs and a set of memory nodes with the+tasks in each container.++1.2 Why are containers needed ?+----------------------------++There are multiple efforts to provide process aggregations in the+Linux kernel, mainly for resource tracking purposes. Such efforts+include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server+namespaces. These all require the basic notion of a+grouping/partitioning of processes, with newly forked processes ending+in the same group (container) as their parent process.++The kernel container patch provides the minimum essential kernel+mechanisms required to efficiently implement such groups. It has+minimal impact on the system fast paths, and provides hooks for+specific subsystems such as cpusets to provide additional behaviour as+desired.++Multiple hierarchy support is provided to allow for situations where+the division of tasks into containers is distinctly different for+different subsystems - having parallel hierarchies allows each+hierarchy to be a natural division of tasks, without having to handle+complex combinations of tasks that would be present if several+unrelated subsystems needed to be forced into the same tree of+containers.++At one extreme, each resource controller or subsystem could be in a+separate hierarchy; at the other extreme, all subsystems+would be attached to the same hierarchy.++As an example of a scenario (originally proposed by vatsa@in.ibm.com)+that can benefit from multiple hierarchies, consider a large+university server with various users - students, professors, system+tasks etc. The resource planning for this server could be along the+following lines:++ CPU : Top cpuset+ / \+ CPUSet1 CPUSet2+ | |+ (Profs) (Students)++ In addition (system tasks) are attached to topcpuset (so+ that they can run anywhere) with a limit of 20%++ Memory : Professors (50%), students (30%), system (20%)++ Disk : Prof (50%), students (30%), system (20%)++ Network : WWW browsing (20%), Network File System (60%), others (20%)+ / \+ Prof (15%) students (5%)++Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go+into NFS network class.++At the same time firefox/lynx will share an appropriate CPU/Memory class+depending on who launched it (prof/student).++With the ability to classify tasks differently for different resources+(by putting those resource subsystems in different hierarchies) then+the admin can easily set up a script which receives exec notifications+and depending on who is launching the browser he can++ # echo browser_pid > /mnt/<restype>/<userclass>/tasks++With only a single hierarchy, he now would potentially have to create+a separate container for every browser launched and associate it with+approp network and other resource class. This may lead to+proliferation of such containers.++Also lets say that the administrator would like to give enhanced network+access temporarily to a student's browser (since it is night and the user+wants to do online gaming :) OR give one of the students simulation+apps enhanced CPU power,++With ability to write pids directly to resource classes, its just a+matter of :++ # echo pid > /mnt/network/<new_class>/tasks+ (after some time)+ # echo pid > /mnt/network/<orig_class>/tasks++Without this ability, he would have to split the container into+multiple separate ones and then associate the new containers with the+new resource classes.++++1.3 How are containers implemented ?+---------------------------------++Containers extends the kernel as follows:++ - Each task in the system has a reference-counted pointer to a+ css_group.++ - A css_group contains a set of reference-counted pointers to+ container_subsys_state objects, one for each container subsystem+ registered in the system. There is no direct link from a task to+ the container of which it's a member in each hierarchy, but this+ can be determined by following pointers through the+ container_subsys_state objects. This is because accessing the+ subsystem state is something that's expected to happen frequently+ and in performance-critical code, whereas operations that require a+ task's actual container assignments (in particular, moving between+ containers) are less common.++ - A container hierarchy filesystem can be mounted for browsing and+ manipulation from user space.++ - You can list all the tasks (by pid) attached to any container.++The implementation of containers requires a few, simple hooks+into the rest of the kernel, none in performance critical paths:++ - in init/main.c, to initialize the root containers and initial+ css_group at system boot.++ - in fork and exit, to attach and detach a task from its css_group.++In addition a new file system, of type "container" may be mounted, to+enable browsing and modifying the containers presently known to the+kernel. When mounting a container hierarchy, you may specify a+comma-separated list of subsystems to mount as the filesystem mount+options. By default, mounting the container filesystem attempts to+mount a hierarchy containing all registered subsystems.++If an active hierarchy with exactly the same set of subsystems already+exists, it will be reused for the new mount. If no existing hierarchy+matches, and any of the requested subsystems are in use in an existing+hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy+is activated, associated with the requested subsystems.++It's not currently possible to bind a new subsystem to an active+container hierarchy, or to unbind a subsystem from an active container+hierarchy. This may be possible in future, but is fraught with nasty+error-recovery issues.++When a container filesystem is unmounted, if there are any+subcontainers created below the top-level container, that hierarchy+will remain active even though unmounted; if there are no+subcontainers then the hierarchy will be deactivated.++No new system calls are added for containers - all support for+querying and modifying containers is via this container file system.++Each task under /proc has an added file named 'container' displaying,+for each active hierarchy, the subsystem names and the container name+as the path relative to the root of the container file system.++Each container is represented by a directory in the container file system+containing the following files describing that container:++ - tasks: list of tasks (by pid) attached to that container+ - notify_on_release flag: run /sbin/container_release_agent on exit?++Other subsystems such as cpusets may add additional files in each+container dir++New containers are created using the mkdir system call or shell+command. The properties of a container, such as its flags, are+modified by writing to the appropriate file in that containers+directory, as listed above.++The named hierarchical structure of nested containers allows partitioning+a large system into nested, dynamically changeable, "soft-partitions".++The attachment of each task, automatically inherited at fork by any+children of that task, to a container allows organizing the work load+on a system into related sets of tasks. A task may be re-attached to+any other container, if allowed by the permissions on the necessary+container file system directories.++When a task is moved from one container to another, it gets a new+css_group pointer - if there's an already existing css_group with the+desired collection of containers then that group is reused, else a new+css_group is allocated. Note that the current implementation uses a+linear search to locate an appropriate existing css_group, so isn't+very efficient. A future version will use a hash table for better+performance.++The use of a Linux virtual file system (vfs) to represent the+container hierarchy provides for a familiar permission and name space+for containers, with a minimum of additional kernel code.++1.4 What does notify_on_release do ?+------------------------------------++*** notify_on_release is disabled in the current patch set. It may be+*** reactivated in a future patch in a less-intrusive manner++If the notify_on_release flag is enabled (1) in a container, then+whenever the last task in the container leaves (exits or attaches to+some other container) and the last child container of that container+is removed, then the kernel runs the command specified by the contents+of the "release_agent" file in that hierarchy's root directory,+supplying the pathname (relative to the mount point of the container+file system) of the abandoned container. This enables automatic+removal of abandoned containers. The default value of+notify_on_release in the root container at system boot is disabled+(0). The default value of other containers at creation is the current+value of their parents notify_on_release setting. The default value of+a container hierarchy's release_agent path is empty.++1.5 How do I use containers ?+--------------------------++To start a new job that is to be contained within a container, using+the "cpuset" container subsystem, the steps are something like:++ 1) mkdir /dev/container+ 2) mount -t container -ocpuset cpuset /dev/container+ 3) Create the new container by doing mkdir's and write's (or echo's) in+ the /dev/container virtual file system.+ 4) Start a task that will be the "founding father" of the new job.+ 5) Attach that task to the new container by writing its pid to the+ /dev/container tasks file for that container.+ 6) fork, exec or clone the job tasks from this founding father task.++For example, the following sequence of commands will setup a container+named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,+and then start a subshell 'sh' in that container:++ mount -t container cpuset -ocpuset /dev/container+ cd /dev/container+ mkdir Charlie+ cd Charlie+ /bin/echo $$ > tasks+ sh+ # The subshell 'sh' is now running in container Charlie+ # The next line should display '/Charlie'+ cat /proc/self/container++2. Usage Examples and Syntax+============================++2.1 Basic Usage+---------------++Creating, modifying, using the containers can be done through the container+virtual filesystem.++To mount a container hierarchy will all available subsystems, type:+# mount -t container xxx /dev/container++The "xxx" is not interpreted by the container code, but will appear in+/proc/mounts so may be any useful identifying string that you like.++To mount a container hierarchy with just the cpuset and numtasks+subsystems, type:+# mount -t container -o cpuset,numtasks hier1 /dev/container++To change the set of subsystems bound to a mounted hierarchy, just+remount with different options:++# mount -o remount,cpuset,ns /dev/container++Note that changing the set of subsystems is currently only supported+when the hierarchy consists of a single (root) container. Supporting+the ability to arbitrarily bind/unbind subsystems from an existing+container hierarchy is intended to be implemented in the future.++Then under /dev/container you can find a tree that corresponds to the+tree of the containers in the system. For instance, /dev/container+is the container that holds the whole system.++If you want to create a new container under /dev/container:+# cd /dev/container+# mkdir my_container++Now you want to do something with this container.+# cd my_container++In this directory you can find several files:+# ls+notify_on_release release_agent tasks+(plus whatever files are added by the attached subsystems)++Now attach your shell to this container:+# /bin/echo $$ > tasks++You can also create containers inside your container by using mkdir in this+directory.+# mkdir my_sub_cs++To remove a container, just use rmdir:+# rmdir my_sub_cs++This will fail if the container is in use (has containers inside, or+has processes attached, or is held alive by other subsystem-specific+reference).++2.2 Attaching processes+-----------------------++# /bin/echo PID > tasks++Note that it is PID, not PIDs. You can only attach ONE task at a time.+If you have several tasks to attach, you have to do it one after another:++# /bin/echo PID1 > tasks+# /bin/echo PID2 > tasks+ ...+# /bin/echo PIDn > tasks++3. Kernel API+=============++3.1 Overview+------------++Each kernel subsystem that wants to hook into the generic container+system needs to create a container_subsys object. This contains+various methods, which are callbacks from the container system, along+with a subsystem id which will be assigned by the container system.++Other fields in the container_subsys object include:++- subsys_id: a unique array index for the subsystem, indicating which+ entry in container->subsys[] this subsystem should be+ managing. Initialized by container_register_subsys(); prior to this+ it should be initialized to -1++- hierarchy: an index indicating which hierarchy, if any, this+ subsystem is currently attached to. If this is -1, then the+ subsystem is not attached to any hierarchy, and all tasks should be+ considered to be members of the subsystem's top_container. It should+ be initialized to -1.++- name: should be initialized to a unique subsystem name prior to+ calling container_register_subsystem. Should be no longer than+ MAX_CONTAINER_TYPE_NAMELEN++Each container object created by the system has an array of pointers,+indexed by subsystem id; this pointer is entirely managed by the+subsystem; the generic container code will never touch this pointer.++3.2 Synchronization+-------------------++There is a global mutex, container_mutex, used by the container+system. This should be taken by anything that wants to modify a+container. It may also be taken to prevent containers from being+modified, but more specific locks may be more appropriate in that+situation.++See kernel/container.c for more details.++Subsystems can take/release the container_mutex via the functions+container_lock()/container_unlock(), and can+take/release the callback_mutex via the functions+container_lock()/container_unlock().++Accessing a task's container pointer may be done in the following ways:+- while holding container_mutex+- while holding the task's alloc_lock (via task_lock())+- inside an rcu_read_lock() section via rcu_dereference()++3.3 Subsystem API+--------------------------++Each subsystem should:++- add an entry in linux/container_subsys.h+- define a container_subsys object called <name>_subsys++Each subsystem may export the following methods. The only mandatory+methods are create/destroy. Any others that are null are presumed to+be successful no-ops.++int create(struct container *cont)+LL=container_mutex++Called to create a subsystem state object for a container. The+subsystem should set its subsystem pointer for the passed container,+returning 0 on success or a negative error code. On success, the+subsystem pointer should point to a structure of type+container_subsys_state (typically embedded in a larger+subsystem-specific object), which will be initialized by the container+system. Note that this will be called at initialization to create the+root subsystem state for this subsystem; this case can be identified+by the passed container object having a NULL parent (since it's the+root of the hierarchy) and may be an appropriate place for+initialization code.++void destroy(struct container *cont)+LL=container_mutex++The container system is about to destroy the passed container; the+subsystem should do any necessary cleanup++int can_attach(struct container_subsys *ss, struct container *cont,+ struct task_struct *task)+LL=container_mutex++Called prior to moving a task into a container; if the subsystem+returns an error, this will abort the attach operation. If a NULL+task is passed, then a successful result indicates that *any*+unspecified task can be moved into the container. Note that this isn't+called on a fork. If this method returns 0 (success) then this should+remain valid while the caller holds container_mutex.++void attach(struct container_subsys *ss, struct container *cont,+ struct container *old_cont, struct task_struct *task)+LL=container_mutex+++Called after the task has been attached to the container, to allow any+post-attachment activity that requires memory allocations or blocking.++void fork(struct container_subsy *ss, struct task_struct *task)+LL=callback_mutex, maybe read_lock(tasklist_lock)++Called when a task is forked into a container. Also called during+registration for all existing tasks.++void exit(struct container_subsys *ss, struct task_struct *task)+LL=callback_mutex++Called during task exit++int populate(struct container_subsys *ss, struct container *cont)+LL=none++Called after creation of a container to allow a subsystem to populate+the container directory with file entries. The subsystem should make+calls to container_add_file() with objects of type cftype (see+include/linux/container.h for details). Note that although this+method can return an error code, the error code is currently not+always handled well.++void bind(struct container_subsys *ss, struct container *root)+LL=callback_mutex++Called when a container subsystem is rebound to a different hierarchy+and root container. Currently this will only involve movement between+the default hierarchy (which never has sub-containers) and a hierarchy+that is being created/destroyed (and hence has no sub-containers).++4. Questions+============++Q: what's up with this '/bin/echo' ?+A: bash's builtin 'echo' command does not check calls to write() against+ errors. If you use it in the container file system, you won't be+ able to tell whether a command succeeded or failed.++Q: When I attach processes, only the first of the line gets really attached !+A: We can only return one error code per call to write(). So you should also+ put only ONE pid.+Index: container-2.6.21-rc7-mm1/include/linux/container.h===================================================================--- /dev/null+++ container-2.6.21-rc7-mm1/include/linux/container.h@@ -0,0 +1,198 @@+#ifndef _LINUX_CONTAINER_H+#define _LINUX_CONTAINER_H+/*+ * container interface+ *+ * Copyright (C) 2003 BULL SA+ * Copyright (C) 2004-2006 Silicon Graphics, Inc.+ *+ */++#include <linux/sched.h>+#include <linux/kref.h>+#include <linux/cpumask.h>+#include <linux/nodemask.h>++#ifdef CONFIG_CONTAINERS++extern int container_init_early(void);+extern int container_init(void);+extern void container_init_smp(void);++extern struct file_operations proc_container_operations;++extern void container_lock(void);+extern void container_unlock(void);++struct containerfs_root;++/* Per-subsystem/per-container state maintained by the system. */+struct container_subsys_state {+ /* The container that this subsystem is attached to. Useful+ * for subsystems that want to know about the container+ * hierarchy structure */+ struct container *container;++ /* State maintained by the container system to allow+ * subsystems to be "busy". Should be accessed via css_get()+ * and css_put() */++ atomic_t refcnt;+};++/*+ * Call css_get() to hold a reference on the container;+ *+ */++static inline void css_get(struct container_subsys_state *css)+{+ atomic_inc(&css->refcnt);+}+/*+ * css_put() should be called to release a reference taken by+ * css_get()+ */++static inline void css_put(struct container_subsys_state *css)+{+ atomic_dec(&css->refcnt);+}++struct container {+ unsigned long flags; /* "unsigned long" so bitops work */++ /* count users of this container. >0 means busy, but doesn't+ * necessarily indicate the number of tasks in the+ * container */+ atomic_t count;++ /*+ * We link our 'sibling' struct into our parent's 'children'.+ * Our children link their 'sibling' into our 'children'.+ */+ struct list_head sibling; /* my parent's children */+ struct list_head children; /* my children */++ struct container *parent; /* my parent */+ struct dentry *dentry; /* container fs entry */++ /* Private pointers for each registered subsystem */+ struct container_subsys_state *subsys[CONTAINER_SUBSYS_COUNT];++ struct containerfs_root *root;+ struct container *top_container;+};++/* struct cftype:+ *+ * The files in the container filesystem mostly have a very simple read/write+ * handling, some common function will take care of it. Nevertheless some cases+ * (read tasks) are special and therefore I define this structure for every+ * kind of file.+ *+ *+ * When reading/writing to a file:+ * - the container to use in file->f_dentry->d_parent->d_fsdata+ * - the 'cftype' of the file is file->f_dentry->d_fsdata+ */++struct inode;+#define MAX_CFTYPE_NAME 64+struct cftype {+ /* By convention, the name should begin with the name of the+ * subsystem, followed by a period */+ char name[MAX_CFTYPE_NAME];+ int private;+ int (*open) (struct inode *inode, struct file *file);+ ssize_t (*read) (struct container *cont, struct cftype *cft,+ struct file *file,+ char __user *buf, size_t nbytes, loff_t *ppos);+ u64 (*read_uint) (struct container *cont, struct cftype *cft);+ ssize_t (*write) (struct container *cont, struct cftype *cft,+ struct file *file,+ const char __user *buf, size_t nbytes, loff_t *ppos);+ int (*release) (struct inode *inode, struct file *file);+};++/* Add a new file to the given container directory. Should only be+ * called by subsystems from within a populate() method */+int container_add_file(struct container *cont, const struct cftype *cft);++/* Add a set of new files to the given container directory. Should+ * only be called by subsystems from within a populate() method */+int container_add_files(struct container *cont, const struct cftype cft[],+ int count);++int container_is_removed(const struct container *cont);++int container_path(const struct container *cont, char *buf, int buflen);++/* Return true if the container is a descendant of the current container */+int container_is_descendant(const struct container *cont);++/* Container subsystem type. See Documentation/containers.txt for details */++struct container_subsys {+ int (*create)(struct container_subsys *ss,+ struct container *cont);+ void (*destroy)(struct container_subsys *ss, struct container *cont);+ int (*can_attach)(struct container_subsys *ss,+ struct container *cont, struct task_struct *tsk);+ void (*attach)(struct container_subsys *ss, struct container *cont,+ struct container *old_cont, struct task_struct *tsk);+ void (*fork)(struct container_subsys *ss, struct task_struct *task);+ void (*exit)(struct container_subsys *ss, struct task_struct *task);+ int (*populate)(struct container_subsys *ss,+ struct container *cont);+ void (*bind)(struct container_subsys *ss, struct container *root);+ int subsys_id;+ int active;+ int early_init;+#define MAX_CONTAINER_TYPE_NAMELEN 32+ const char *name;++ /* Protected by RCU */+ struct containerfs_root *root;++ struct list_head sibling;++ void *private;+};++#define SUBSYS(_x) extern struct container_subsys _x ## _subsys;+#include <linux/container_subsys.h>+#undef SUBSYS++static inline struct container_subsys_state *container_subsys_state(+ struct container *cont, int subsys_id)+{+ return cont->subsys[subsys_id];+}++static inline struct container_subsys_state *task_subsys_state(+ struct task_struct *task, int subsys_id)+{+ return rcu_dereference(task->containers.subsys[subsys_id]);+}++static inline struct container* task_container(struct task_struct *task,+ int subsys_id)+{+ return task_subsys_state(task, subsys_id)->container;+}++int container_path(const struct container *cont, char *buf, int buflen);++#else /* !CONFIG_CONTAINERS */++static inline int container_init_early(void) { return 0; }+static inline int container_init(void) { return 0; }+static inline void container_init_smp(void) {}++static inline void container_lock(void) {}+static inline void container_unlock(void) {}++#endif /* !CONFIG_CONTAINERS */++#endif /* _LINUX_CONTAINER_H */Index: container-2.6.21-rc7-mm1/include/linux/container_subsys.h===================================================================--- /dev/null+++ container-2.6.21-rc7-mm1/include/linux/container_subsys.h@@ -0,0 +1,10 @@+/* Add subsystem definitions of the form SUBSYS(<name>) in this+ * file. Surround each one by a line of comment markers so that+ * patches don't collide+ */++/* */++/* */++/* */Index: container-2.6.21-rc7-mm1/include/linux/sched.h===================================================================--- container-2.6.21-rc7-mm1.orig/include/linux/sched.h+++ container-2.6.21-rc7-mm1/include/linux/sched.h@@ -820,6 +820,34 @@ struct uts_namespace;