lse-tech

What I propose here is Process Notification (pnotify). This is derived from
PAGG. It's been re-worked to have some better documentation (below) and
variable names that better reflect what is really happening.
My hope is that people will take a fresh look at this. This has been
hashed in the community before, and was even in Andrew's tree at one time.
Here, I've made an effort to better describe what I'm trying to do in the
hopes that pnotify or something that provides similar functionality can be
made available in the kernel.
I'll also be posting one user of this: Linux Job. SGI has other opensource
projects that we haven't pushed to the community that make use of this too.
There are two Job patches - one using a jobfs interface to userland that
was suggested by the community, and one that is heavier on the kernel
side (but faster and more stable). Details on those in the Job emails
to follow later.
CSA (comprehensive system accounting) can make use of Job too.
I'm hoping we can get this, or something that provides similar functionality,
accepted in to the kernel.
Information about pnotify including usage docs, justification, future
ideas, and the patch itself follow.
Process Notification (pnotify)
--------------------
pnotify provides a method (service) for kernel modules to be notified when
certain events happen in the life of a process. Events we support include
fork, exit, and exec. A special init event is also supported (see events
below). More events could be added. pnotify also provides a generic data
pointer for the modules to work with so that data can be associated per
process.
A kernel module will register (pnotify_register) a service request describing
events it cares about (pnotify_events) with pnotify_register. The request
tells pnotify which notifications the kernel module wants. The kernel module
passes along function pointers to be called for these events (exit, fork, exec)
in the pnotify_events service request.
From the process point of view, each process has a kernel module subscriber
list (pnotify_subscriber_list). These kernel modules are the ones who want
notification about the life of the process. As described above, each kernel
module subscriber on the list has a generic data pointer to point to data
associated with the process.
In the case of fork, pnotify will allocate the same kernel module subscriber
list for the new child that existed for the parent. The kernel module's
function pointer for fork is also called for the child being constructed so
the kernel module can do what ever it needs to do when a parent forks this
child. Special return values apply for the fork and init event that don't to
others. They are described in the fork and init example below.
For exit, similar things happen but the exit function pointer for each
kernel module subscriber is called and the kernel module subscriber entry for
that process is deleted.
Events
------
Events are stages of a processes life that kernel modules care about. The
fork event is triggered in a certain location in copy_process when a parent
forks. The exit event happens when a process is going away. We also support
an exec event, which happens when a process execs. Finally, there is an init
event. This special event makes it so this kernel module will be associated
with all current processes in the system at the time of registration. This is
used when a kernel module wants to keep track of all current processes as
opposed to just those it associates by itself (and children that follow). The
events a kernel module cares about are set up in the pnotify_events
structure - see usage below.
When setting up a pnotify_events, you designate which events you care about
by either associating NULL (meaning you don't care about that event) or a
pointer to the function to run when the event is triggered. The fork event
and the exit event is currently required.
How do processes become associated with kernel modules?
-------------------------------------------------------
Your kernel module itself can use the pnotify_subscribe function to associate
a given process with a given pnotify_events structure. This adds
your kernel module to the subscriber list of the process. In the case
of inescapable job containers making use of PAM, when PAM allows a person to
log in, PAM contacts job (via a PAM job module which uses the job userland
library) and the kernel Job code will call pnotify_subscribe to associate the
process with pnotify. From that point on, the kernel module will be notified
about events in the process's life that the module cares about (as well,
as any children that process may later have).
Likewise, your kernel module can remove an association between it and
a given process by using pnotify_unsubscribe.
Example Usage
-------------
=== filling out the pnotify_events structure ===
A kernel module wishing to use pnotify needs to set up a pnotify_events
structure. This structure tells pnotify which events you care about and what
functions to call when those events are triggered. In addition, you supply a
name (usually the kernel module name). The entry is always filled out as
shown below. .module is usually set to THIS_MODULE. data can be optionally
used to store a pointer with the pnotify_events structure.
Example of a filled out pnotify_events:
static struct pnotify_events pnotify_events = {
.module = THIS_MODULE,
.name = "test_module",
.data = NULL,
.entry = LIST_HEAD_INIT(pnotify_events.entry),
.init = test_init,
.fork = test_attach,
.exit = test_detach,
.exec = test_exec,
};
The above pnotify_events structure says the kernel module "test_module" cares
about events fork, exit, exec, and init. In fork, call the kernel module's
test_attach function. In exec, call test_exec. In exit, call test_detach.
The init event is specified, so all processes on the system will be associated
with this kernel module during registration and the test_init function will
be run for each.
=== Registering with pnotify ===
You will likely register with pnotify in your kernel module's module_init
function. Here is an example:
static int __init test_module_init(void)
{
int rc = pnotify_register(&pnotify_events);
if (rc < 0) {
return -1;
}
return 0;
}
=== Example init event function ====
Since the init event is defined, it means this kernel module is added
to the subscriber list of all processes -- it will receive notification
about events it cares about for all processes and all children that
follow.
Of course, if a kernel module doesn't need to know about all current
processes, that module shouldn't implement this and '.init' in the
pnotify_events structure would be NULL.
This is as opposed to the normal method where the kernel module adds itself
to the subscriber list of a process using pnotify_subscribe.
Important:
Note: The implementation of pnotify_register causes us to evaluate some tasks
more than once in some cases. See the comments in pnotify_register for why.
Therefore, if the init function pointer returns PNOTIFY_NOSUB, which means
that it doesn't want a process association, that init function must be
prepared to possibly look at the same "skipped" task more than once.
Note that the return value here is similar to the fork function pointer
below except there is no notion of failing the fork since existing processes
aren't forking.
PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber)
{
if (pnotify_get_subscriber(tsk, "test_module") == NULL)
dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid);
dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid);
atomic_inc(&init_count);
return 0;
}
=== Example fork (test_attach) function ===
This function is executed when a process forks - this is associated
with the pnotify_callout callout in copy_process. There would be a very
similar test_detach function (not shown).
pnotify will add the kernel module to the notification list for the child
process automatically and then execute this fork function pointer (test_attach
in this example). However, the kernel module can control whether the kernel
module stays on the process's subscriber list and wants notification by the
return value.
PNOTIFY_ERROR - prevent the process from continuing - failing the fork
PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp)
{
dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid);
atomic_inc(&attach_count);
return PNOTIFY_OK;
}
=== Example exec event function ===
And here is an example function to run when a task gets to exec. So any
time a "tracked" process gets to exec, this would execute.
static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber)
{
dprintk("pnotify exec hook fired for PID %d\n", tsk->pid);
atomic_inc(&exec_count);
}
=== Unregistering with pnotify ===
You will likely wish to unregister with pnotify in the kernel module's
module_exit function. Here is an example:
static void __exit test_module_cleanup(void)
{
pnotify_unregister(&pnotify_events);
printk("detach called %d times...\n", atomic_read(&detach_count));
printk("attach called %d times...\n", atomic_read(&attach_count));
printk("init called %d times...\n", atomic_read(&init_count));
printk("exec called %d times ...\n", atomic_read(&exec_count));
if (atomic_read(&attach_count) + atomic_read(&init_count) !=
atomic_read(&detach_count))
printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n");
else
printk("Good - attach count + init count equals detach count.\n");
}
=== Actually using data associated with the process in your module ===
The above examples show you how to create an example kernel module using
pnotify, but they didn't show what you might do with the data pointer
associated with a given process. Below, find an example of accessing
the data pointer for a given process from within a kernel module making use
of pnotify.
pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given
process and kernel module. Like this:
subscriber = pnotify_get_subscriber(task, name);
Where name is your kernel module's name (as provided in the pnotify_events
structure) and task is the process you're interested
in.
Please be careful about locking. The task structure has a
pnotify_subscriber_list_sem to be used for locking. This example retrieves
a given task in a way that ensures it doesn't disappear while we try to
access it (that's why we do locking for the tasklist_lock and task). The
pnotify subscriber list is locked to ensure the list doesn't change as we
search it with pnotify_get_subscriber.
read_lock(&tasklist_lock);
get_task_struct(task); /* Ensure the task doesn't vanish on us */
read_unlock(&tasklist_lock); /* Unlock the tasklist */
down_read(&task->pnotify_subscriber_list_sem); /* readlock subscriber list */
subscriber = pnotify_get_subscriber(task, name);
if (subscriber) {
/* Get the widgitId associated with this task */
widgitId = ((widgitId_t *)subscriber->data);
}
put_task_struct(task); /* Done accessing the task */
up_read(&task->pnotify_subscriber_list_sem); /* unlock subscriber list */
Future Events
-------------
Kingsley Cheung suggested that we add events for uid and gid changes and this
may inspire broader use. Depending on how the discussion goes, I'll post a
patch to add this functionality in the next day or two.
History
-------
Process Notification used to be known as PAGG (Process Aggregates).
It was re-written to be called Process Notification because we believe this
better describes its purpose. Structures and functions were re-named to
be more clear and to reflect the new name.
Why Not Notifier Lists?
-----------------------
We investigated the use of notifier lists, available in newer kernels.
Notifier lists would not be as efficient as pnotify for kernel modules
wishing to associate data with processes. With pnotify, if the
pnotify_subscriber_list of a given task is NULL, we can instantly know
there are no kernel modules that care about the process. Further, the
callbacks happen in places were the task struct is likely to be cached.
So this is a quick operation. With notifier lists, the scope is system
wide rather than per process. As long as one kernel module wants to be
notified, we have to walk the notifier list and potentially waste cycles.
In the case of pnotify, we only walk lists if we're interested about
a specific task.
On a system where pnotify is used to track only a few processes, the
overhead of walking the notifier list is high compared to the overhead
of walking the kernel module subscriber list only when a kernel module
is interested in a given process.
I don't believe this is easily solved in notifier lists themselves as
they are meant to be global resources, not per-task resources.
Overlooking performance issues, notifier lists in and of themselves wouldn't
solve the problem pnotify solves anyway. Although you could argue notifier
lists can implement the callback portion of pnotify, there is no association
of data with a given process. This is a needed for kernel modules to
efficiently associate a task with a data pointer without cluttering up
the task struct.
In addition to data associated with a process, we desire the ability for
kernel modules to add themselves to the subscriber list for any arbitrary
process - not just current or a child of current.
Some Justification
------------------
We feel that pnotify could be used to reduce the size of the task struct or
the number of functions in copy_process. For example, if another part of the
kernel needs to know when a process is forking or exiting, they could use
pnotify instead of adding additional code to task struct, copy_process, or
exit.
Some have argued that PAGG in the past shouldn't be used because it will
allow interesting things to be implemented outside of the kernel. While this
might be a small risk, having these in place allows customers and users to
implement kernel components that you don't want to see in the kernel anyway.
For example, a certain vendor may have an urgent need to implement kernel
functionality or special types of accounting that nobody else is interested
in. That doesn't mean the code isn't open-source, it just means it isn't
applicable to all of Linux because it satisfies a niche.
All of pnotify's functionality that needs to be exported is exported with
EXPORT_SYMBOL_GPL to discourage abuse.
The risk already exists in the kernel for people to implement modules outside
the kernel that suffer from less peer review and possibly bad programming
practice. pnotify could add more opportunities for out-of-tree kernel module
authors to make new modules. I believe this is somewhat mitigated by the
already-existing 'tainted' warnings in the kernel.
Other Ideas?
------------
There have been similar proposals to provide pieces of the pnotify
functionality. If there is a better proposal out there, let's explore it.
Here are some key functions I hope to see in any proposal:
- Ability to have notification for exec, fork, exit at minimum
- Ability to extend to other callouts
- Ability for pnotify user modules to implement code that ends up adding
a kernel module subscriber to any arbitrary process (not just current and
its children).
I believe, if the above are more or less met, we should be in good shape for
our other open source projects such as linux job.
Variable Name Changes from PAGG to pnotify
------------------------------------------
PAGG_NAMELEN -> PNOTIFY_NAMELEN
struct pagg -> pnotify_subscriber
pagg_get -> pnotify_get_subscriber
pagg_alloc -> pnotify_subscribe
pagg_free -> pnotify_unsubscribe
pagg_hook_register -> pnotify_register
pagg_hook_unregister -> pnotify_unregister
pagg_attach -> pnotify_fork
pagg_detach -> pnotify_exit
pagg_exec -> pnotify_exec
struct pagg_hook -> pnotify_events
With pnotify_events (formerly pagg_hook):
attach -> fork
detach -> exit
Return codes for the init and fork function pointers should use:
PNOTIFY_ERROR - prevent the process from continuing - failing the fork
PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
Signed-off-by: Erik Jacobson <erikj@...>
--
Documentation/pnotify.txt | 388 +++++++++++++++++++++++++++++++++++
fs/exec.c | 2
include/linux/init_task.h | 2
include/linux/pnotify.h | 227 ++++++++++++++++++++
include/linux/sched.h | 5
init/Kconfig | 8
kernel/Makefile | 1
kernel/exit.c | 4
kernel/fork.c | 14 +
kernel/pnotify.c | 501 ++++++++++++++++++++++++++++++++++++++++++++++ 10 files changed, 1152 insertions(+)
Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c 2005-09-19 17:32:02.821482784 -0500
+++ linux/fs/exec.c 2005-09-19 17:32:18.483958530 -0500
@@ -48,6 +48,7 @@
#include <linux/syscalls.h>
#include <linux/rmap.h>
#include <linux/acct.h>
+#include <linux/pnotify.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1207,6 +1208,7 @@
retval = search_binary_handler(bprm,regs);
if (retval >= 0) {
free_arg_pages(bprm);
+ pnotify_exec(current);
/* execve success */
security_bprm_free(bprm);
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h 2005-09-19 17:31:53.108599611 -0500
+++ linux/include/linux/init_task.h 2005-09-19 17:32:18.487864384 -0500
@@ -2,6 +2,7 @@
#define _LINUX__INIT_TASK_H
#include <linux/file.h>
+#include <linux/pnotify.h>
#define INIT_FILES \
{ \
@@ -111,6 +112,7 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
+ INIT_TASK_PNOTIFY(tsk) \
.fs_excl = ATOMIC_INIT(0), \
}
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h 2005-09-19 17:32:03.339984942 -0500
+++ linux/include/linux/sched.h 2005-09-19 17:32:18.488840848 -0500
@@ -764,6 +764,11 @@
struct mempolicy *mempolicy;
short il_next;
#endif
+#ifdef CONFIG_PNOTIFY
+/* List of pnotify kernel module subscribers */
+ struct list_head pnotify_subscriber_list;
+ struct rw_semaphore pnotify_subscriber_list_sem;
+#endif
#ifdef CONFIG_CPUSETS
struct cpuset *cpuset;
nodemask_t mems_allowed;
Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig 2005-09-19 17:32:03.809663921 -0500
+++ linux/init/Kconfig 2005-09-20 10:22:42.258859757 -0500
@@ -146,6 +146,14 @@
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/&gt;.
+config PNOTIFY
+ bool "Support for Process Notification"
+ help
+ Say Y here if you will be loading modules which provide support
+ for process notification. Examples of such modules include the
+ Linux Jobs module and the Linux Array Sessions module. If you will not
+ be using such modules, say N.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile 2005-09-19 17:31:54.645553277 -0500
+++ linux/kernel/Makefile 2005-09-20 10:22:42.259836221 -0500
@@ -19,6 +19,7 @@
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_COMPAT) += compat.o
+obj-$(CONFIG_PNOTIFY) += pnotify.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_IKCONFIG_PROC) += configs.o
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c 2005-09-19 17:32:03.818452094 -0500
+++ linux/kernel/fork.c 2005-09-20 10:22:42.259836221 -0500
@@ -41,6 +41,7 @@
#include <linux/profile.h>
#include <linux/rmap.h>
#include <linux/acct.h>
+#include <linux/pnotify.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -150,6 +151,9 @@
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];
+
+ /* Initialize the pnotify list in pid 0 before it can clone itself. */
+ INIT_PNOTIFY_LIST(current);
}
static struct task_struct *dup_task_struct(struct task_struct *orig)
@@ -1006,6 +1010,15 @@
p->exit_state = 0;
/*
+ * Call pnotify kernel module subscribers and add the same subscribers the
+ * parent has to the new process.
+ * Fail the fork on error.
+ */
+ retval = pnotify_fork(p, current);
+ if (retval)
+ goto bad_fork_cleanup_namespace;
+
+ /*
* Ok, make it visible to the rest of the system.
* We dont wake it up yet.
*/
@@ -1123,6 +1136,7 @@
return p;
bad_fork_cleanup_namespace:
+ pnotify_exit(p);
exit_namespace(p);
bad_fork_cleanup_keys:
exit_keys(p);
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c 2005-09-19 17:31:54.749058416 -0500
+++ linux/kernel/exit.c 2005-09-20 10:22:43.024407599 -0500
@@ -26,6 +26,7 @@
#include <linux/proc_fs.h>
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
+#include <linux/pnotify.h>
#include <linux/syscalls.h>
#include <linux/signal.h>
@@ -849,6 +850,9 @@
module_put(tsk->binfmt->module);
tsk->exit_code = code;
+
+ pnotify_exit(tsk);
+
exit_notify(tsk);
#ifdef CONFIG_NUMA
mpol_free(tsk->mempolicy);
Index: linux/kernel/pnotify.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/kernel/pnotify.c 2005-09-19 18:07:34.753215400 -0500
@@ -0,0 +1,501 @@
+/*
+ * Process Notification (pnotify) interface
+ *
+ *
+ * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane,
+ * Mountain View, CA 94043, or:
+ *
+ * http://www.sgi.com
+ */
+
+#include <linux/config.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <linux/pnotify.h>
+#include <asm/semaphore.h>
+
+/* list of pnotify event list entries that reference the "module"
+ * implementations */
+static LIST_HEAD(pnotify_event_list);
+static DECLARE_RWSEM(pnotify_event_list_sem);
+
+
+/**
+ * pnotify_get_subscriber - get a pnotify subscriber given a search key
+ * @task: We examine the pnotify_subscriber_list from the given task
+ * @key: Key name of kernel module subscriber we wish to retrieve
+ *
+ * Given a pnotify_subscriber_list structure, this function will return
+ * a pointer to the kernel module pnotify_subsciber struct that matches the
+ * search key. If the key is not found, the function will return NULL.
+ *
+ * The caller should hold at least a read lock on the pnotify_subscriber_list
+ * for task using down_read(&task->pnotify_subscriber_list_sem).
+ *
+ */
+struct pnotify_subscriber *
+pnotify_get_subscriber(struct task_struct *task, char *key)
+{
+ struct pnotify_subscriber *subscriber;
+
+ list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) {
+ if (!strcmp(subscriber->events->name,key))
+ return subscriber;
+ }
+ return NULL;
+}
+
+
+/**
+ * pnotify_subscribe - Add kernel module to the subscriber list for process
+ * @task: Task that gets the new kernel module subscriber added to the list
+ * @events: pnotify_events structure to associate with kernel module
+ *
+ * Given a task and a pnotify_events structure, this function will allocate
+ * a new pnotify_subscriber, initialize the settings, and insert it into
+ * the pnotify_subscriber_list for the task.
+ *
+ * The caller for this function should hold at least a read lock on the
+ * pnotify_event_list_sem - or ensure that the pnotify_events entry cannot be
+ * removed. If this function was called from the pnotify module (usually the
+ * case), then the caller need not hold this lock. The caller should hold
+ * a write lock on for the tasks pnotify_subscriber_list_sem. This can be
+ * locked using down_write(&task->pnotify_subscriber_list_sem)
+ *
+ */
+struct pnotify_subscriber *
+pnotify_subscribe(struct task_struct *task, struct pnotify_events *events)
+{
+ struct pnotify_subscriber *subscriber;
+
+ subscriber = kmalloc(sizeof(struct pnotify_subscriber), GFP_KERNEL);
+ if (!subscriber)
+ return NULL;
+
+ subscriber->events = events;
+ subscriber->data = NULL;
+ atomic_inc(&events->refcnt); /* Increase hook's reference count */
+ list_add_tail(&subscriber->entry, &task->pnotify_subscriber_list);
+ return subscriber;
+}
+
+
+/**
+ * pnotify_unsubscribe - Remove kernel module assocation from process
+ * @subscriber: The subscriber to remove
+ *
+ * This function will ensure the subscriber is deleted form
+ * the list of subscribers for the task. Finally, the memory for the
+ * subscriber is discarded.
+ *
+ * The caller of this function should hold a write lock on the
+ * pnotify_subscribe_list_sem for the task. This can be locked using
+ * down_write(&task->pnotify_subscriber_list_sem).
+ *
+ * Prior to calling pnotify_unsubscribe, the subscriber should have been
+ * detached from any uses the kernel module may have. This is often done using
+ * p->events->exit(task, subscriber);
+ *
+ */
+void
+pnotify_unsubscribe(struct pnotify_subscriber *subscriber)
+{
+ atomic_dec(&subscriber->events->refcnt); /* decr the ref cnt on events */
+ list_del(&subscriber->entry);
+ kfree(subscriber);
+}
+
+
+/**
+ * pnotify_get_events - Get the pnotify_events struct matching requested name
+ * @key: The name of the events structure to get
+ *
+ * Given a pnotify_events struct name that represents the kernel module name,
+ * this functil will return a pointer to the pnotify_events structure that
+ * matches the name.
+ *
+ * You should hold either the write or read lock for pnotify_event_list_sem
+ * before using this function. This will ensure that the pnotify_event_list
+ * does not change while iterating through the list entries.
+ *
+ */
+static struct pnotify_events *
+pnotify_get_events(char *key)
+{
+ struct pnotify_events *events;
+
+ list_for_each_entry(events, &pnotify_event_list, entry) {
+ if (!strcmp(events->name, key)) {
+ return events;
+ }
+ }
+ return NULL;
+}
+
+/**
+ * remove_subscriber_from_all_tasks - Remove subscribers for given events struct
+ * @events: pnotify_events struct for subscribers to remove
+ *
+ * Given a kernel module events struct registered with pnotify,
+ * this functil will remove all subscribers matching the events struct from
+ * all tasks.
+ *
+ * If there is a exit function associated with the subscriber, it is called
+ * before the subscriber is unsubscribed/freed.
+ *
+ * This is meant to be used by pnotify_register and pnotify_unregister
+ *
+ */
+static void
+remove_subscriber_from_all_tasks(struct pnotify_events *events)
+{
+ if (events == NULL)
+ return;
+
+ /* Because of internal race conditions we can't gaurantee
+ * getting every task in just one pass so we just keep going
+ * until there are no tasks with subscribers from this events struct
+ * attached. The inefficiency of this should be tempered by the fact that
+ * this happens at most once for each registered client.
+ */
+ while (atomic_read(&events->refcnt) != 0) {
+ struct task_struct *g = NULL, *p = NULL;
+
+ read_lock(&tasklist_lock);
+ do_each_thread(g, p) {
+ struct pnotify_subscriber *subscriber;
+ int task_exited;
+
+ get_task_struct(p);
+ read_unlock(&tasklist_lock);
+ down_write(&p->pnotify_subscriber_list_sem);
+ subscriber = pnotify_get_subscriber(p, events->name);
+ if (subscriber != NULL) {
+ (void)events->exit(p, subscriber);
+ pnotify_unsubscribe(subscriber);
+ }
+ up_write(&p->pnotify_subscriber_list_sem);
+ read_lock(&tasklist_lock);
+
+ /* If a subscriber got removed from the list while we're going through
+ * each process, the tasks list for the process would be empty. In
+ * that case, break out of this for_each_thread so we can do it
+ * again. */
+ task_exited = list_empty(&p->sibling);
+ put_task_struct(p);
+ if (task_exited)
+ goto endloop;
+ } while_each_thread(g, p);
+ endloop:
+ read_unlock(&tasklist_lock);
+ }
+}
+
+/**
+ * pnotify_register - Register a new module subscriber and enter it in the list
+ * @events_new: The new pnotify events structure to register.
+ *
+ * Used to register a new module subscriber pnotify_events structure and enter
+ * it into the pnotify_event_list. The service name for a pnotify_events
+ * struct is restricted to 32 characters.
+ *
+ * If an "init()" function is supplied in the events struct being registered
+ * then the kernel module will be subscribed to all existing tasks and the
+ * supplied "init()" function will be applied to it. If any call to the
+ * supplied "init()" function returns a non zero result, the registration will
+ * be aborted. As part of the abort process, all subscribers belonging to the
+ * new client will be removed from all tasks and the supplied "detach()"
+ * function will be called on them.
+ *
+ * If a memory error is encountered, the module (pnotify_events structure)
+ * is unregistered and any tasks we became subscribed to are detached.
+ *
+ */
+int
+pnotify_register(struct pnotify_events *events_new)
+{
+ struct pnotify_events *events = NULL;
+
+ /* Add new pnotify module to access list */
+ if (!events_new)
+ return -EINVAL; /* error */
+ if (!list_empty(&events_new->entry))
+ return -EINVAL; /* error */
+ if (events_new->name == NULL || strlen(events_new->name) > PNOTIFY_NAMELN)
+ return -EINVAL; /* error */
+ if (!events_new->fork || !events_new->exit)
+ return -EINVAL; /* error */
+
+ /* Try to insert new events entry into the events list */
+ down_write(&pnotify_event_list_sem);
+
+ events = pnotify_get_events(events_new->name);
+
+ if (events) {
+ up_write(&pnotify_event_list_sem);
+ printk(KERN_WARNING "Attempt to register duplicate"
+ " pnotify support (name=%s)\n", events_new->name);
+ return -EBUSY;
+ }
+
+ /* Okay, we can insert into the events list */
+ list_add_tail(&events_new->entry, &pnotify_event_list);
+ /* set the ref count to zero */
+ atomic_set(&events_new->refcnt, 0);
+
+ /* Now we can call the initializer function (if present) for each task */
+ if (events_new->init != NULL) {
+ struct task_struct *g = NULL, *p = NULL;
+ int init_result = 0;
+
+ /* Because of internal race conditions we can't guarantee
+ * getting every task in just one pass so we just keep going
+ * until we don't find any unitialized tasks. The inefficiency
+ * of this should be tempered by the fact that this happens
+ * at most once for each registered client.
+ */
+ read_lock(&tasklist_lock);
+ repeat:
+ do_each_thread(g, p) {
+ struct pnotify_subscriber *subscriber;
+ int task_exited;
+
+ get_task_struct(p);
+ read_unlock(&tasklist_lock);
+ down_write(&p->pnotify_subscriber_list_sem);
+ subscriber = pnotify_get_subscriber(p, events_new->name);
+ if (!subscriber && !(p->flags & PF_EXITING)) {
+ subscriber = pnotify_subscribe(p, events_new);
+ if (subscriber != NULL) {
+ init_result = events_new->init(p, subscriber);
+
+ /* Success, but init function pointer doesn't want this funct.
+ * on the subscriber list. */
+ if (init_result > 0)
+ pnotify_unsubscribe(subscriber);
+ }
+ else
+ init_result = -ENOMEM;
+ }
+ up_write(&p->pnotify_subscriber_list_sem);
+ read_lock(&tasklist_lock);
+ /* Like in remove_subscriber_from_all_tasks, if the task
+ * disappeared on us while we were going through the
+ * for_each_thread loop, we need to start over with that loop.
+ * That's why we have the list_empty here */
+ task_exited = list_empty(&p->sibling);
+ put_task_struct(p);
+ if (init_result < 0)
+ goto endloop;
+ if (task_exited)
+ goto repeat;
+ } while_each_thread(g, p);
+ endloop:
+ read_unlock(&tasklist_lock);
+
+ /*
+ * if anything went wrong during initialisation abandon the
+ * registration process
+ */
+ if (init_result < 0) {
+ remove_subscriber_from_all_tasks(events_new);
+ list_del_init(&events_new->entry);
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_WARNING "Registering pnotify support for"
+ " (name=%s) failed\n", events_new->name);
+
+ return init_result; /* hook init function error result */
+ }
+ }
+
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_INFO "Registering pnotify support for (name=%s)\n",
+ events_new->name);
+
+ return 0; /* success */
+
+}
+
+/**
+ * pnotify_unregister - Unregister kernel module/pnotify_event struct
+ * @event_old: pnotify_event struct for the kernel module we're unregistering
+ *
+ * Used to unregister kernel module subscribers indicated by
+ * pnotify_events struct. Removes them from the list of kernel modules
+ * in pnotify_event_list.
+ *
+ * Once the events entry in the pnotify_event_list is found, subscribers for
+ * this kernel module have their exit functions called and will then be
+ * removed from the list.
+ *
+ */
+int
+pnotify_unregister(struct pnotify_events *events_old)
+{
+ struct pnotify_events *events;
+
+ /* Check the validity of the arguments */
+ if (!events_old)
+ return -EINVAL; /* error */
+ if (list_empty(&events_old->entry))
+ return -EINVAL; /* error */
+ if (events_old->name == NULL)
+ return -EINVAL; /* error */
+
+ down_write(&pnotify_event_list_sem);
+
+ events = pnotify_get_events(events_old->name);
+
+ if (events && events == events_old) {
+ remove_subscriber_from_all_tasks(events);
+ list_del_init(&events->entry);
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_INFO "Unregistering pnotify support for"
+ " (name=%s)\n", events_old->name);
+
+ return 0; /* success */
+ }
+
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_WARNING "Attempt to unregister pnotify support (name=%s)"
+ " failed - not found\n", events_old->name);
+
+ return -EINVAL; /* error */
+}
+
+
+/**
+ * __pnotify_fork - Add kernel module subscribe to same subscribers as parent
+ * @to_task: The child task that will inherit the parent's subscribers
+ * @from_task: The parent task
+ *
+ * Used to attach a new task to the same subscribers the parent has in its
+ * subscriber list.
+ *
+ * The "from" argument is the parent task. The "to" argument is the child
+ * task.
+ *
+ * See Documentation/pnotify.txt * for details on
+ * how to handle return codes from the attach function pointer.
+ *
+ */
+int
+__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task)
+{
+ struct pnotify_subscriber *from_subscriber;
+ int ret;
+
+ /* lock the parents subscriber list we are copying from */
+ down_read(&from_task->pnotify_subscriber_list_sem);
+
+ list_for_each_entry(from_subscriber, &from_task->pnotify_subscriber_list, entry) {
+ struct pnotify_subscriber *to_subscriber = NULL;
+
+ to_subscriber = pnotify_subscribe(to_task, from_subscriber->events);
+ if (!to_subscriber) {
+ ret=-ENOMEM;
+ goto error_return;
+ }
+ ret = to_subscriber->events->fork(to_task, to_subscriber,
+ from_subscriber->data);
+
+ if (ret < 0) {
+ /* Propagates to copy_process as a fork failure */
+ goto error_return;
+ }
+ else if (ret > 0) {
+ /* Success, but attach function pointer doesn't want grouping */
+ pnotify_unsubscribe(to_subscriber);
+ }
+ }
+
+ up_read(&from_task->pnotify_subscriber_list_sem); /* unlock the subsr list */
+
+ return 0; /* success */
+
+ error_return:
+ /*
+ * Clean up all the subscriber attachments made on behalf of the new
+ * task.
+ */
+ up_read(&from_task->pnotify_subscriber_list_sem);
+ __pnotify_exit(to_task);
+ return ret; /* failure */
+}
+
+/**
+ * __pnotify_exit - Remove all subscribers from given task
+ * @task: Task to remove subscribers from
+ *
+ */
+void
+__pnotify_exit(struct task_struct *task)
+{
+ struct pnotify_subscriber *subscriber;
+ struct pnotify_subscriber *subscribertmp;
+
+ /* Remove ref. to subscribers from task immediately */
+ down_write(&task->pnotify_subscriber_list_sem);
+
+ list_for_each_entry_safe(subscriber, subscribertmp,
+ &task->pnotify_subscriber_list, entry) {
+ subscriber->events->exit(task, subscriber);
+ pnotify_unsubscribe(subscriber);
+ }
+
+ up_write(&task->pnotify_subscriber_list_sem);
+
+ return; /* 0 = success, else return last code for failure */
+}
+
+
+/**
+ * __pnotify_exec - Execute exec callback for each subscriber in this task
+ * @task: We go through the subscriber list in the given task
+ *
+ * Used to when a process that has a subscriber list does an exec.
+ *
+ */
+int
+__pnotify_exec(struct task_struct *task)
+{
+ struct pnotify_subscriber *subscriber;
+
+ down_read(&task->pnotify_subscriber_list_sem);
+
+ list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) {
+ if (subscriber->events->exec) /* conditional because it's optional */
+ subscriber->events->exec(task, subscriber);
+ }
+
+ up_read(&task->pnotify_subscriber_list_sem);
+ return 0;
+}
+
+
+EXPORT_SYMBOL_GPL(pnotify_get_subscriber);
+EXPORT_SYMBOL_GPL(pnotify_subscribe);
+EXPORT_SYMBOL_GPL(pnotify_unsubscribe);
+EXPORT_SYMBOL_GPL(pnotify_register);
+EXPORT_SYMBOL_GPL(pnotify_unregister);
Index: linux/include/linux/pnotify.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/linux/pnotify.h 2005-09-19 18:05:34.770234880 -0500
@@ -0,0 +1,227 @@
+/*
+ * Process Notification (pnotify) interface
+ *
+ *
+ * Copyright (c) 2000-2002, 2004-2005 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *
+ * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane,
+ * Mountain View, CA 94043, or:
+ *
+ * http://www.sgi.com
+ *
+ * For further information regarding this notice, see:
+ *
+ * http://oss.sgi.com/projects/GenInfo/NoticeExplan
+ */
+
+/*
+ * Data structure definitions and function prototypes used to implement
+ * process notification (pnotify).
+ *
+ * pnotify provides a method (service) for kernel modules to be notified when
+ * certain events happen in the life of a process. It also provides a
+ * data pointer that is associated with a given process. See
+ * Documentation/pnotify.txt for a full description.
+ */
+
+#ifndef _LINUX_PNOTIFY_H
+#define _LINUX_PNOTIFY_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_PNOTIFY
+
+#define PNOTIFY_NAMELN 32 /* Max chars in PNOTIFY kernel module name */
+
+#define PNOTIFY_ERROR -1 /* Error. Fork fail for pnotify_fork */
+#define PNOTIFY_OK 0 /* All is well, stay subscribed */
+#define PNOTIFY_NOSUB 1 /* All is well but don't subscribe module
+ * to subscriber list for the process */
+
+
+/**
+ * INIT_PNOTIFY_LIST - init a pnotify subscriber list struct after declaration
+ * @_l: Task struct to init the pnotify_module_subscriber_list and semaphore
+ *
+ */
+#define INIT_PNOTIFY_LIST(_l) \
+do { \
+ INIT_LIST_HEAD(&(_l)->pnotify_subscriber_list); \
+ init_rwsem(&(_l)->pnotify_subscriber_list_sem); \
+} while(0)
+
+/*
+ * Used by task_struct to manage list of subscriber kernel modules for the
+ * process. Each pnotify_subscriber provides the link between the process
+ * and the correct kernel module subscriber.
+ *
+ * STRUCT MEMBERS:
+ * pnotify_events: events: Reference to pnotify_events structure, which
+ * holds the name key and function pointers.
+ * data: Opaque data pointer - defined by pnotify kernel modules.
+ * entry: List pointers
+ */
+struct pnotify_subscriber {
+ struct pnotify_events *events;
+ void *data;
+ struct list_head entry;
+};
+
+/*
+ * Used by pnotify modules to define the callback functions into the
+ * module. See Documentation/pnotify.txt for details.
+ *
+ * STRUCT MEMBERS:
+ * name: The name of the pnotify container type provided by
+ * the module. This will be set by the pnotify module.
+ * fork: Function pointer to function used when associating
+ * a forked process with a kernel module referenced by
+ * this struct. pnotify.txt will provide details on
+ * special return codes interpreted by pnotify.
+ *
+ * exit: Function pointer to function used when a process
+ * associated with the kernel module owning this struct
+ * exits.
+ *
+ * init: Function pointer to initialization function. This
+ * function is used when the module registers with pnotify
+ * to associate existing processes with the referring
+ * kernel module. This is optional and may be set to NULL
+ * if it is not needed by the pnotify kernel module.
+ *
+ * Note: The return values are managed the same way as in
+ * attach above. Except, of course, an error doesn't
+ * result in a fork failure.
+ *
+ * Note: The implementation of pnotify_register causes
+ * us to evaluate some tasks more than once in some cases.
+ * See the comments in pnotify_register for why.
+ * Therefore, if the init function pointer returns
+ * PNOTIFY_NOSUB, which means that it doesn't want this
+ * process associated with the kernel module, that init
+ * function must be prepared to possibly look at the same
+ * "skipped" task more than once.
+ *
+ * data: Opaque data pointer - defined by pnotify modules.
+ * module: Pointer to kernel module struct. Used to increment &
+ * decrement the use count for the module.
+ * entry: List pointers
+ * exec: Function pointer to function used when a process
+ * this kernel module is subscribed to execs. This
+ * is optional and may be set to NULL if it is not
+ * needed by the pnotify module.
+ * refcnt: Keep track of user count of pnotify_events
+ */
+struct pnotify_events {
+ struct module *module;
+ char *name; /* Name Key - restricted to 32 chars */
+ void *data; /* Opaque module specific data */
+ struct list_head entry; /* List pointers */
+ atomic_t refcnt; /* usage counter */
+ int (*init)(struct task_struct *, struct pnotify_subscriber *);
+ int (*fork)(struct task_struct *, struct pnotify_subscriber *, void*);
+ void (*exit)(struct task_struct *, struct pnotify_subscriber *);
+ void (*exec)(struct task_struct *, struct pnotify_subscriber *);
+};
+
+
+/* Kernel service functions for providing pnotify support */
+extern struct pnotify_subscriber *pnotify_get_subscriber(struct task_struct
+ *task, char *key);
+extern struct pnotify_subscriber *pnotify_subscribe(struct task_struct *task,
+ struct pnotify_events *pt);
+extern void pnotify_unsubscribe(struct pnotify_subscriber *subscriber);
+extern int pnotify_register(struct pnotify_events *pt_new);
+extern int pnotify_unregister(struct pnotify_events *pt_old);
+extern int __pnotify_fork(struct task_struct *to_task,
+ struct task_struct *from_task);
+extern void __pnotify_exit(struct task_struct *task);
+extern int __pnotify_exec(struct task_struct *task);
+
+/**
+ * pnotify_fork - child inherits subscriber list associations of its parent
+ * @child: child task - to inherit
+ * @parent: parenet task - child inherits subscriber list from this parent
+ *
+ * function used when a child process must inherit subscriber list assocation
+ * from the parent. Return code is propagated as a fork fail.
+ *
+ */
+static inline int pnotify_fork(struct task_struct *child,
+ struct task_struct *parent)
+{
+ INIT_PNOTIFY_LIST(child);
+ if (!list_empty(&parent->pnotify_subscriber_list))
+ return __pnotify_fork(child, parent);
+
+ return 0;
+}
+
+
+/**
+ * pnotify_exit - Detach subscriber kernel modules from this process
+ * @task: The task the subscribers will be detached from
+ *
+ */
+static inline void pnotify_exit(struct task_struct *task)
+{
+ if (!list_empty(&task->pnotify_subscriber_list))
+ __pnotify_exit(task);
+}
+
+/**
+ * pnotify_exec - Used when a process exec's
+ * @task: The process doing the exec
+ *
+ */
+static inline void pnotify_exec(struct task_struct *task)
+{
+ if (!list_empty(&task->pnotify_subscriber_list))
+ __pnotify_exec(task);
+}
+
+/**
+ * INIT_TASK_PNOTIFY - Used in INIT_TASK to set head and sem of subscriber list
+ * @tsk: The task work with
+ *
+ * Marco Used in INIT_TASK to set the head and sem of pnotify_subscriber_list
+ * If CONFIG_PNOTIFY is off, it is defined as an empty macro below.
+ *
+ */
+#define INIT_TASK_PNOTIFY(tsk) \
+ .pnotify_subscriber_list = LIST_HEAD_INIT(tsk.pnotify_subscriber_list),\
+ .pnotify_subscriber_list_sem = \
+ __RWSEM_INITIALIZER(tsk.pnotify_subscriber_list_sem),
+
+#else /* CONFIG_PNOTIFY */
+
+/*
+ * Replacement macros used when pnotify (Process Notification) support is not
+ * compiled into the kernel.
+ */
+#define INIT_TASK_PNOTIFY(tsk)
+#define INIT_PNOTIFY_LIST(l) do { } while(0)
+#define pnotify_fork(ct, pt) ({ 0; })
+#define pnotify_exit(t) do { } while(0)
+#define pnotify_exec(t) do { } while(0)
+#define pnotify_unsubscribe(t) do { } while(0)
+
+#endif /* CONFIG_PNOTIFY */
+
+#endif /* _LINUX_NOTIFY_H */
Index: linux/Documentation/pnotify.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/Documentation/pnotify.txt 2005-09-20 10:33:07.412646211 -0500
@@ -0,0 +1,388 @@
+What I propose here is Process Notification (pnotify). This is derived from
+PAGG. It's been re-worked to have some better documentation (below) and
+variable names that better reflect what is really happening.
+
+My hope is that people will take a fresh look at this. This has been
+hashed in the community before, and was even in Andrew's tree at one time.
+
+Here, I've made an effort to better describe what I'm trying to do in the
+hopes that pnotify or something that provides similar functionality can be
+made available in the kernel.
+
+I'll also be posting one user of this: Linux Job. SGI has other opensource
+projects that we haven't pushed to the community that make use of this too.
+
+CSA (comprehensive system accounting) can make use of Job
+
+I'm hoping we can get this, or something that provides similar functionality,
+accepted in to the kernel.
+
+
+Process Notification (pnotify)
+--------------------
+pnotify provides a method (service) for kernel modules to be notified when
+certain events happen in the life of a process. Events we support include
+fork, exit, and exec. A special init event is also supported (see events
+below). More events could be added. pnotify also provides a generic data
+pointer for the modules to work with so that data can be associated per
+process.
+
+A kernel module will register (pnotify_register) a service request describing
+events it cares about (pnotify_events) with pnotify_register. The request
+tells pnotify which notifications the kernel module wants. The kernel module
+passes along function pointers to be called for these events (exit, fork, exec)
+in the pnotify_events service request.
+
+From the process point of view, each process has a kernel module subscriber
+list (pnotify_subscriber_list). These kernel modules are the ones who want
+notification about the life of the process. As described above, each kernel
+module subscriber on the list has a generic data pointer to point to data
+associated with the process.
+
+In the case of fork, pnotify will allocate the same kernel module subscriber
+list for the new child that existed for the parent. The kernel module's
+function pointer for fork is also called for the child being constructed so
+the kernel module can do what ever it needs to do when a parent forks this
+child. Special return values apply for the fork and init event that don't to
+others. They are described in the fork and init example below.
+
+For exit, similar things happen but the exit function pointer for each
+kernel module subscriber is called and the kernel module subscriber entry for
+that process is deleted.
+
+
+Events
+------
+Events are stages of a processes life that kernel modules care about. The
+fork event is triggered in a certain location in copy_process when a parent
+forks. The exit event happens when a process is going away. We also support
+an exec event, which happens when a process execs. Finally, there is an init
+event. This special event makes it so this kernel module will be associated
+with all current processes in the system at the time of registration. This is
+used when a kernel module wants to keep track of all current processes as
+opposed to just those it associates by itself (and children that follow). The
+events a kernel module cares about are set up in the pnotify_events
+structure - see usage below.
+
+When setting up a pnotify_events, you designate which events you care about
+by either associating NULL (meaning you don't care about that event) or a
+pointer to the function to run when the event is triggered. The fork event
+and the exit event is currently required.
+
+
+How do processes become associated with kernel modules?
+-------------------------------------------------------
+Your kernel module itself can use the pnotify_subscribe function to associate
+a given process with a given pnotify_events structure. This adds
+your kernel module to the subscriber list of the process. In the case
+of inescapable job containers making use of PAM, when PAM allows a person to
+log in, PAM contacts job (via a PAM job module which uses the job userland
+library) and the kernel Job code will call pnotify_subscribe to associate the
+process with pnotify. From that point on, the kernel module will be notified
+about events in the process's life that the module cares about (as well,
+as any children that process may later have).
+
+Likewise, your kernel module can remove an association between it and
+a given process by using pnotify_unsubscribe.
+
+
+Example Usage
+-------------
+
+=== filling out the pnotify_events structure ===
+
+A kernel module wishing to use pnotify needs to set up a pnotify_events
+structure. This structure tells pnotify which events you care about and what
+functions to call when those events are triggered. In addition, you supply a
+name (usually the kernel module name). The entry is always filled out as
+shown below. .module is usually set to THIS_MODULE. data can be optionally
+used to store a pointer with the pnotify_events structure.
+
+Example of a filled out pnotify_events:
+
+static struct pnotify_events pnotify_events = {
+ .module = THIS_MODULE,
+ .name = "test_module",
+ .data = NULL,
+ .entry = LIST_HEAD_INIT(pnotify_events.entry),
+ .init = test_init,
+ .fork = test_attach,
+ .exit = test_detach,
+ .exec = test_exec,
+};
+
+The above pnotify_events structure says the kernel module "test_module" cares
+about events fork, exit, exec, and init. In fork, call the kernel module's
+test_attach function. In exec, call test_exec. In exit, call test_detach.
+The init event is specified, so all processes on the system will be associated
+with this kernel module during registration and the test_init function will
+be run for each.
+
+
+=== Registering with pnotify ===
+
+You will likely register with pnotify in your kernel module's module_init
+function. Here is an example:
+
+static int __init test_module_init(void)
+{
+ int rc = pnotify_register(&pnotify_events);
+ if (rc < 0) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
+=== Example init event function ====
+
+Since the init event is defined, it means this kernel module is added
+to the subscriber list of all processes -- it will receive notification
+about events it cares about for all processes and all children that
+follow.
+
+Of course, if a kernel module doesn't need to know about all current
+processes, that module shouldn't implement this and '.init' in the
+pnotify_events structure would be NULL.
+
+This is as opposed to the normal method where the kernel module adds itself
+to the subscriber list of a process using pnotify_subscribe.
+
+Important:
+Note: The implementation of pnotify_register causes us to evaluate some tasks
+more than once in some cases. See the comments in pnotify_register for why.
+Therefore, if the init function pointer returns PNOTIFY_NOSUB, which means
+that it doesn't want a process association, that init function must be
+prepared to possibly look at the same "skipped" task more than once.
+
+Note that the return value here is similar to the fork function pointer
+below except there is no notion of failing the fork since existing processes
+aren't forking.
+
+PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
+PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
+
+static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber)
+{
+ if (pnotify_get_subscriber(tsk, "test_module") == NULL)
+ dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid);
+
+ dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid);
+ atomic_inc(&init_count);
+ return 0;
+}
+
+
+=== Example fork (test_attach) function ===
+
+This function is executed when a process forks - this is associated
+with the pnotify_callout callout in copy_process. There would be a very
+similar test_detach function (not shown).
+
+pnotify will add the kernel module to the notification list for the child
+process automatically and then execute this fork function pointer (test_attach
+in this example). However, the kernel module can control whether the kernel
+module stays on the process's subscriber list and wants notification by the
+return value.
+
+PNOTIFY_ERROR - prevent the process from continuing - failing the fork
+PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
+PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
+
+
+static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp)
+{
+ dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid);
+ atomic_inc(&attach_count);
+
+ return PNOTIFY_OK;
+}
+
+
+=== Example exec event function ===
+
+And here is an example function to run when a task gets to exec. So any
+time a "tracked" process gets to exec, this would execute.
+
+static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber)
+{
+ dprintk("pnotify exec hook fired for PID %d\n", tsk->pid);
+ atomic_inc(&exec_count);
+}
+
+
+=== Unregistering with pnotify ===
+
+You will likely wish to unregister with pnotify in the kernel module's
+module_exit function. Here is an example:
+
+static void __exit test_module_cleanup(void)
+{
+ pnotify_unregister(&pnotify_events);
+ printk("detach called %d times...\n", atomic_read(&detach_count));
+ printk("attach called %d times...\n", atomic_read(&attach_count));
+ printk("init called %d times...\n", atomic_read(&init_count));
+ printk("exec called %d times ...\n", atomic_read(&exec_count));
+ if (atomic_read(&attach_count) + atomic_read(&init_count) !=
+ atomic_read(&detach_count))
+ printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n");
+ else
+ printk("Good - attach count + init count equals detach count.\n");
+}
+
+
+
+=== Actually using data associated with the process in your module ===
+
+The above examples show you how to create an example kernel module using
+pnotify, but they didn't show what you might do with the data pointer
+associated with a given process. Below, find an example of accessing
+the data pointer for a given process from within a kernel module making use
+of pnotify.
+
+pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given
+process and kernel module. Like this:
+
+subscriber = pnotify_get_subscriber(task, name);
+
+Where name is your kernel module's name (as provided in the pnotify_events
+structure) and task is the process you're interested
+in.
+
+Please be careful about locking. The task structure has a
+pnotify_subscriber_list_sem to be used for locking. This example retrieves
+a given task in a way that ensures it doesn't disappear while we try to
+access it (that's why we do locking for the tasklist_lock and task). The
+pnotify subscriber list is locked to ensure the list doesn't change as we
+search it with pnotify_get_subscriber.
+
+ read_lock(&tasklist_lock);
+ get_task_struct(task); /* Ensure the task doesn't vanish on us */
+ read_unlock(&tasklist_lock); /* Unlock the tasklist */
+ down_read(&task->pnotify_subscriber_list_sem); /* readlock subscriber list */
+
+ subscriber = pnotify_get_subscriber(task, name);
+ if (subscriber) {
+ /* Get the widgitId associated with this task */
+ widgitId = ((widgitId_t *)subscriber->data);
+ }
+ put_task_struct(task); /* Done accessing the task */
+ up_read(&task->pnotify_subscriber_list_sem); /* unlock subscriber list */
+
+
+Future Events
+-------------
+Kingsley Cheung suggested that we add events for uid and gid changes and this
+may inspire broader use. Depending on how the discussoin goes, I'll post a
+patch to add this functionality in the next day or two.
+
+History
+-------
+Process Notification used to be known as PAGG (Process Aggregates).
+It was re-written to be called Process Notification because we believe this
+better describes its purpose. Structures and functions were re-named to
+be more clear and to reflect the new name.
+
+
+Why Not Notifier Lists?
+-----------------------
+We investigated the use of notifier lists, available in newer kernels.
+
+Notifier lists would not be as efficient as pnotify for kernel modules
+wishing to associate data with processes. With pnotify, if the
+pnotify_subscriber_list of a given task is NULL, we can instantly know
+there are no kernel modules that care about the process. Further, the
+callbacks happen in places were the task struct is likely to be cached.
+So this is a quick operation. With notifier lists, the scope is system
+wide rather than per process. As long as one kernel module wants to be
+notified, we have to walk the notifier list and potentially waste cycles.
+In the case of pnotify, we only walk lists if we're interested about
+a specific task.
+
+On a system where pnotify is used to track only a few processes, the
+overhead of walking the notifier list is high compared to the overhead
+of walking the kernel module subscriber list only when a kernel module
+is interested in a given process.
+
+I don't believe this is easily solved in notifier lists themselves as
+they are meant to be global resources, not per-task resources.
+
+Overlooking performance issues, notifier lists in and of themselves wouldn't
+solve the problem pnotify solves anyway. Although you could argue notifier
+lists can implement the callback portion of pnotify, there is no association
+of data with a given process. This is a needed for kernel modules to
+efficiently associate a task with a data pointer without cluttering up
+the task struct.
+
+In addition to data associated with a process, we desire the ability for
+kernel modules to add themselves to the subscriber list for any arbitrary
+process - not just current or a child of current.
+
+
+Some Justification
+------------------
+We feel that pnotify could be used to reduce the size of the task struct or
+the number of functions in copy_process. For example, if another part of the
+kernel needs to know when a process is forking or exiting, they could use
+pnotify instead of adding additional code to task struct, copy_process, or
+exit.
+
+Some have argued that PAGG in the past shouldn't be used because it will
+allow interesting things to be implemented outside of the kernel. While this
+might be a small risk, having these in place allows customers and users to
+implement kernel components that you don't want to see in the kernel anyway.
+
+For example, a certain vendor may have an urgent need to implement kernel
+functionality or special types of accounting that nobody else is interested
+in. That doesn't mean the code isn't open-source, it just means it isn't
+applicable to all of Linux because it satisfies a niche.
+
+All of pnotify's functionality that needs to be exported is exported with
+EXPORT_SYMBOL_GPL to discourage abuse.
+
+The risk already exists in the kernel for people to implement modules outside
+the kernel that suffer from less peer review and possibly bad programming
+practice. pnotify could add more oppurtunities for out-of-tree kernel module
+authors to make new modules. I believe this is somewhat mitigated by the
+already-existing 'tainted' warnings in the kernel.
+
+Other Ideas?
+------------
+There have been similar proposals to provide pieces of the pnotify
+functionality. If there is a better proposal out there, let's explore it.
+Here are some key functions I hope to see in any proposal:
+
+ - Ability to have notification for exec, fork, exit at minimum
+ - Ability to extend to other callouts later (such as uid/gid changes as
+ I described earlier)
+ - Ability for pnotify user modules to implement code that ends up adding
+ a kernel module subscriber to any arbitrary process (not just current and
+ its children).
+
+I believe, if the above are more or less met, we should be in good shape for
+our other open source projects such as linux job.
+
+Variable Name Changes from PAGG to pnotify
+------------------------------------------
+PAGG_NAMELEN -> PNOTIFY_NAMELEN
+struct pagg -> pnotify_subscriber
+pagg_get -> pnotify_get_subscriber
+pagg_alloc -> pnotify_subscribe
+pagg_free -> pnotify_unsubscribe
+pagg_hook_register -> pnotify_register
+pagg_hook_unregister -> pnotify_unregister
+pagg_attach -> pnotify_fork
+pagg_detach -> pnotify_exit
+pagg_exec -> pnotify_exec
+struct pagg_hook -> pnotify_events
+
+With pnotify_events (formerly pagg_hook):
+ attach -> fork
+ detach -> exit
+
+Return codes for the init and fork function pointers should use:
+PNOTIFY_ERROR - prevent the process from continuing - failing the fork
+PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
+PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
+

Erik Jacobson <erikj@...> wrote:
>
> What I propose here is Process Notification (pnotify). This is derived from
> PAGG. It's been re-worked to have some better documentation (below) and
> variable names that better reflect what is really happening.
Last time we discussed all this we'd pretty much worked out that all
accounting functions could be performed from userspace, as long as the
connector-based fork()/exit()/exec() notifications were in place.
That is of course a vastly preferable approach. Do you see something which
makes it unfeasible?

> > What I propose here is Process Notification (pnotify). This is derived from
> > PAGG. It's been re-worked to have some better documentation (below) and
> > variable names that better reflect what is really happening.
>
>
> Last time we discussed all this we'd pretty much worked out that all
> accounting functions could be performed from userspace, as long as the
> connector-based fork()/exit()/exec() notifications were in place.
>
> That is of course a vastly preferable approach. Do you see something which
> makes it unfeasible?
Unless I'm missing something (feel free to point me somewhere), my
understanding was this technique is potentially lossy, right? We have
customers who don't want to take even a chance that some accounting data
is lost for some reason.
These are the same customers who tend to use a machine to its fullest
(read beat it to death :).
The idea of pnotify is various kernel components can register to be
notified about events in the life of any process they choose. It's somewhat
similar to notifier lists, only they are per-process based and have a data
pointer associated with the task. The idea here isn't to limit it just
to accounting type applications.
If someone has a suggestion of something in the task struct or
copy_process path that could make use of this, I'd be happy to give
a shot at implementing it. Perhaps something that isn't super "hot"
in the task struct?
If we added process notifiers for gid/uid changes, could CKRM make use of
this as well (Kingsley asked me this, I said I'd look in to implementing
it)?

Erik Jacobson <erikj@...> wrote:
>
> > > What I propose here is Process Notification (pnotify). This is derived from
> > > PAGG. It's been re-worked to have some better documentation (below) and
> > > variable names that better reflect what is really happening.
> >
> >
> > Last time we discussed all this we'd pretty much worked out that all
> > accounting functions could be performed from userspace, as long as the
> > connector-based fork()/exit()/exec() notifications were in place.
> >
> > That is of course a vastly preferable approach. Do you see something which
> > makes it unfeasible?
>
> Unless I'm missing something (feel free to point me somewhere), my
> understanding was this technique is potentially lossy, right?
Well this comes back to requirements and I've never seen anyone attempt to
detail accounting requirements.
I don't know what you mean by "lossy", really and no, I don't know if it'll
be lossy. Guillaume may remember/explain.
> We have
> customers who don't want to take even a chance that some accounting data
> is lost for some reason.
>
> These are the same customers who tend to use a machine to its fullest
> (read beat it to death :).
>
> The idea of pnotify is various kernel components can register to be
> notified about events in the life of any process they choose. It's somewhat
> similar to notifier lists, only they are per-process based and have a data
> pointer associated with the task. The idea here isn't to limit it just
> to accounting type applications.
Erik, this is a real problem. We _really_ dislike adding fancy
infrastructure because something might come along later and use it.
I thought the main reason for adding the connector infrastructure was to
support systam accounting functions, and now this!

> Well this comes back to requirements and I've never seen anyone attempt to
> detail accounting requirements.
>
> I don't know what you mean by "lossy", really and no, I don't know if it'll
> be lossy. Guillaume may remember/explain.
I mean, are we guaranteed that no information will be lost?
> Erik, this is a real problem. We _really_ dislike adding fancy
> infrastructure because something might come along later and use it.
>
> I thought the main reason for adding the connector infrastructure was to
> support system accounting functions, and now this!
I understand, and I'm hit by the other side of that problem. Linux Job is
of general interest I think. But our systems have unique customers. For
those unique customers, we have special functionality that is important to
them. They're open-source, but they're certainly not of general interest to
everybody.
So, besides Job, it's hard for me to provide a list of users that everybody
would agree are useful. I don't have a good way to hook these things we
need in, and you don't have a good way to accept them because they're not of
general interest.
Thankfully, SUSE has PAGG in SLES9, otherwise we'd be in worse shape.
But for obvious reasons, SUSE doesn't to have non-mainline kernel pieces in
their own kernels forever.
My humble request is: Can you help me to make this interesting to you?
Perhaps by using it to reduce some fields from the task struct or
similar? Or are you saying that even if I could show these things, I
have no chance here? If you think the idea is good but just lacks users,
maybe you have some suggestions I can research?
--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota

Erik Jacobson <erikj@...> wrote:
>
> > Well this comes back to requirements and I've never seen anyone attempt to
> > detail accounting requirements.
> >
> > I don't know what you mean by "lossy", really and no, I don't know if it'll
> > be lossy. Guillaume may remember/explain.
>
> I mean, are we guaranteed that no information will be lost?
What information needs to be gathered?
> > Erik, this is a real problem. We _really_ dislike adding fancy
> > infrastructure because something might come along later and use it.
> >
> > I thought the main reason for adding the connector infrastructure was to
> > support system accounting functions, and now this!
>
> I understand, and I'm hit by the other side of that problem. Linux Job is
> of general interest I think. But our systems have unique customers. For
> those unique customers, we have special functionality that is important to
> them. They're open-source, but they're certainly not of general interest to
> everybody.
>
> So, besides Job, it's hard for me to provide a list of users that everybody
> would agree are useful. I don't have a good way to hook these things we
> need in, and you don't have a good way to accept them because they're not of
> general interest.
This just all seems to be coming at it from the wrong end. What are the
requirements?
> Thankfully, SUSE has PAGG in SLES9, otherwise we'd be in worse shape.
> But for obvious reasons, SUSE doesn't to have non-mainline kernel pieces in
> their own kernels forever.
>
> My humble request is: Can you help me to make this interesting to you?
I don't count. Here, I'm communicating to you what I expect will be the
consensus reaction from the main body of the kernel team. Accounting is a
specialised thing and we'd prefer to keep its footprint small and pushing
things out to userspace if poss is always good. It should always be the
default choice and we only move things in-kernel if it's demonstrated that
they can not be performed in userspace. And then we put the minimum amount
of functionality into the kernel.
I don't know anything useful about system accounting (I rely upon you to
educate me, sorry) and really I don't want to - if I see an implementation
which all stakeholders are happy with and which has a kernel footprint
which non-stakeholders find acceptable then in it goes. Right now, I don't
think we have either.
So, please, can you take the time to put together a brief bullet-point for
description of the system accounting requirements, describing where
appropriate the parts which simply cannot be performed in userspace?
On 23 Feb 2005 Guillaume said, on this mailing list:
Guillaume Thouvenin <guillaume.thouvenin@...> wrote:
>
> On Wed, 2005-02-23 at 00:51 -0800, Andrew Morton wrote:
> > ...
> > The 2.6.8.1 ELSA patch adds quite a bit of kernel code, but from what
> > you're saying it seems like most of that has become redundant, and all
> > you now need is the fork notifier. Is that correct?
>
> Yes, that's correct. All I need is the fork connector patch. It needs
> more work like, as you said, sending an on/off message down the netlink
> socket. I'm working on this (thank you very much Andrew for your
> comments).
If this is still true for ELSA, what's different about your accounting
requirements?

>
> On 23 Feb 2005 Guillaume said, on this mailing list:
>
> Guillaume Thouvenin <guillaume.thouvenin@...> wrote:
>
>>On Wed, 2005-02-23 at 00:51 -0800, Andrew Morton wrote:
>>
>>>...
>>>The 2.6.8.1 ELSA patch adds quite a bit of kernel code, but from what
>>>you're saying it seems like most of that has become redundant, and all
>>>you now need is the fork notifier. Is that correct?
>>
>> Yes, that's correct. All I need is the fork connector patch. It needs
>>more work like, as you said, sending an on/off message down the netlink
>>socket. I'm working on this (thank you very much Andrew for your
>>comments).
>
>
> If this is still true for ELSA, what's different about your accounting
> requirements?
Andrew, ELSA does not collect system accounting raw data. It relies on
BSD and CSA to provide accouting data via a pacct file in binary format.
That is why ELSA does not need an exit notifier.
ELSA allows users to view accounting data in terms of "job" (a group of
processes). It relies on fork notifier to put pid's into one "job"
container if it was forked from the same ppid. Thus, ELSA performs
process grouping and data presentation functions.
Per-process accounting data is accumulated at task and task->mm. These
data need to be saved off somewhere before a task struct is freed.
The action is triggered by an exit event. The BSD accounting has
a function hook at do_exit() to do just that.
What you like to see is to move fork/exit/exec event notification
to userspace through a connector. However, a connector is a
unreliable datagram socket. If we lose exit notification when system
is extremely busy, we lose accounting data once task struct is freed.
That is not what we have today. Today we have BSD function hook
in the kernel to copy off the data.
"pnotify" is not an accounting thing. It is viewed as an in-kernel
event notification. Functionality-wise it is like a connector
except that "pnotify" does event notification in the kernel while
connector sends event notification to userspace.
A fork notifier is used to do process grouping. An exit notifier
is used to save off accounting data.
System accounting needs a reliable event notifier. It is especially
true on exit events.
Does CKRM save off accounting data also? How is that done?
Thanks,
- jay
>

Jay Lan <jlan@...> wrote:
>
> >
> > On 23 Feb 2005 Guillaume said, on this mailing list:
> >
> > Guillaume Thouvenin <guillaume.thouvenin@...> wrote:
> >
> >>On Wed, 2005-02-23 at 00:51 -0800, Andrew Morton wrote:
> >>
> >>>...
> >>>The 2.6.8.1 ELSA patch adds quite a bit of kernel code, but from what
> >>>you're saying it seems like most of that has become redundant, and all
> >>>you now need is the fork notifier. Is that correct?
> >>
> >> Yes, that's correct. All I need is the fork connector patch. It needs
> >>more work like, as you said, sending an on/off message down the netlink
> >>socket. I'm working on this (thank you very much Andrew for your
> >>comments).
> >
> >
> > If this is still true for ELSA, what's different about your accounting
> > requirements?
>
> Andrew, ELSA does not collect system accounting raw data. It relies on
> BSD and CSA to provide accouting data via a pacct file in binary format.
> That is why ELSA does not need an exit notifier.
OK..
> ...
> What you like to see is to move fork/exit/exec event notification
> to userspace through a connector.
That would be nice, rather than adding new stuff.
> However, a connector is a
> unreliable datagram socket. If we lose exit notification when system
> is extremely busy, we lose accounting data once task struct is freed.
That hasn't been demonstrated, has it? connector uses GFP_KERNEL and is in
fact reliable. Stuff may get lost down in the bowels of netlink code,
although connector does pass GFP_KERNEL into netlink.
> That is not what we have today. Today we have BSD function hook
> in the kernel to copy off the data.
>
> "pnotify" is not an accounting thing. It is viewed as an in-kernel
> event notification. Functionality-wise it is like a connector
> except that "pnotify" does event notification in the kernel while
> connector sends event notification to userspace.
OK. connector could be used for in-kernel notifications, but it would be
fairly unnatural.
> A fork notifier is used to do process grouping. An exit notifier
> is used to save off accounting data.
Does that code exist yet?
> System accounting needs a reliable event notifier. It is especially
> true on exit events.
Well I doubt if connector would cause any reliability problems in this
application. However having to implement a connector client in-kernel
might be a bit silly.
> Does CKRM save off accounting data also? How is that done?
Heaven knows.

I renamed the subject line so that the accounting discussion can be
easily separated from pnotify discussion.
Andrew Morton wrote:
> Jay Lan <jlan@...> wrote:
>
>
>>...
>>What you like to see is to move fork/exit/exec event notification
>>to userspace through a connector.
>
>
> That would be nice, rather than adding new stuff.
>
>
>>However, a connector is a
>>unreliable datagram socket. If we lose exit notification when system
>>is extremely busy, we lose accounting data once task struct is freed.
>
>
> That hasn't been demonstrated, has it? connector uses GFP_KERNEL and is in
> fact reliable. Stuff may get lost down in the bowels of netlink code,
> although connector does pass GFP_KERNEL into netlink.
I ran my testing in March time frame. I used Evgeniy's fork generation
program (fork-test) to generate fork events.
The fork-test is like this:
# while 1
# ./fork-test 10000000
# sleep 1
# end
I used a user-space
program doing nothing but reading data and compares sequrence numbers.
I ran the fork-test together with AIM7 and sometimes ubench.
In earlier testings, i observed duplicate messages and a gap of
messages number (a chunk of messages got lost.)
Later on, i have not seen duplicate messages, probably because
improvements made to netlink/connector.
I let the test running over nights. In that 7-8 hours time
period, i observed messages loss (sequence number gap) once
or a few times a night. Each time it happened, the messages loss
count always more than a few. Most of the time, i lost a few
10's messaages.
Statistically, it was not that bad. I did not do the calculation
but it should be less than 0.01%. However, there WAS data loss.
I tried to dig out my test data in March without success. But,
it is reproducible.
The test machine was an SGI IA64 4 Processors SN2 machine.
The fork test does not involve allocating memory since it needs
to send only (ppid,pid, seq_number) down the pipe.
Accounting needs two things at exit: exit event notifcation for
process grouping and saving off accounting data from task and
task->mm.
It is not clear to me, Andrew, whether you expect both functions
be moved to userspace? Or, only the event notification?
The accounting data can not wait because we want do_exit() to
be completed so that resources can be released. If data can not
be sent out, we lose data. I do not have a test for exit notifier,
but i expect the exit data lose would be worse than fork event.
The accounting record used in BSD is 64 bytes. Those are very
condensed data. With additional fields i added back in 2.6.11,
it will be more. Let's day 80 bytes, or to be extreame, 128 bytes.
>
>
>>That is not what we have today. Today we have BSD function hook
>>in the kernel to copy off the data.
>>
>>"pnotify" is not an accounting thing. It is viewed as an in-kernel
>>event notification. Functionality-wise it is like a connector
>>except that "pnotify" does event notification in the kernel while
>>connector sends event notification to userspace.
>
>
> OK. connector could be used for in-kernel notifications, but it would be
> fairly unnatural.
I agreed. It is just not right.
>
>
>>A fork notifier is used to do process grouping. An exit notifier
>>is used to save off accounting data.
>
>
> Does that code exist yet?
An exit notifier does not exist yet.
Shailabh Nagar wrote on accouting data requirements:
> For this kind of data, we're talking of the following type of kernel
> changes
> - marking the state of a task using task flags
> - accumalating the data in fields added to the task_struct
> - sending the accumalated data periodically and/or on demand, through a
> kernel module, to userspace. A connector-based mechanism for sending
> data should be adequate since this is definitely not a high-volume data
> output situation. pnotify doesn't help us here at all afaics.
Based on Shailibh's response, CKRM would have similar needs at exit
time. CKRM is going to use a kernel module. So, how the kernel
module is to be notified of those events? Although you did not say,
you will need to save data exit time as well.
Thanks,
- jay

Chandra Seetharaman wrote:
> There is a need for fork() and exit() notification in CKRM in kernel:
> - at fork() to initialize the task data structure
> - at exit() to relinquish the task from the class and send the
> uncollected accounting data of the task( (2) above) to the user
> space.
Not exactly the same processing, but the same requirements in kernel
for CSA. :)
So, CKRM still needs in-kernel module and event notification. You are
not completely moving off to user space.
>
> I was thinking of using connector for this too, but pnotify would be
> simpler to use. But, pnotify is little heavy weight for our purposes.
> All we need is just a notification, with the task data structure.
>
> Is making pnotify leaner an option for PAGG ? something as simple as
> what is in the patch http://marc.theaimsgroup.com/?l=linux-
> kernel&m=111532025203086&w=2
>
> Note that we do not need all the events listed there anymore, only fork
> and exit.
fork and exit, same here again!
- jay
>

Jay Lan <jlan@...> wrote:
>
> I renamed the subject line so that the accounting discussion can be
> easily separated from pnotify discussion.
>
> Andrew Morton wrote:
> > Jay Lan <jlan@...> wrote:
> >
> >
> >>...
> >>What you like to see is to move fork/exit/exec event notification
> >>to userspace through a connector.
> >
> >
> > That would be nice, rather than adding new stuff.
> >
> >
> >>However, a connector is a
> >>unreliable datagram socket. If we lose exit notification when system
> >>is extremely busy, we lose accounting data once task struct is freed.
> >
> >
> > That hasn't been demonstrated, has it? connector uses GFP_KERNEL and is in
> > fact reliable. Stuff may get lost down in the bowels of netlink code,
> > although connector does pass GFP_KERNEL into netlink.
>
> I ran my testing in March time frame. I used Evgeniy's fork generation
> program (fork-test) to generate fork events.
> The fork-test is like this:
> # while 1
> # ./fork-test 10000000
> # sleep 1
> # end
>
> I used a user-space
> program doing nothing but reading data and compares sequrence numbers.
>
> I ran the fork-test together with AIM7 and sometimes ubench.
>
> In earlier testings, i observed duplicate messages and a gap of
> messages number (a chunk of messages got lost.)
>
> Later on, i have not seen duplicate messages, probably because
> improvements made to netlink/connector.
Yes, I don't think earlier versions of cbus/connector were wholly
race-free. Testing would need to be redone on the in-kernel version. If
netlink itself is not doing internal hard-coded-GFP_ATOMIC allocations I
_think_ the whole thing should be reliable.
If not, we need to work out where the messages went...
> I let the test running over nights. In that 7-8 hours time
> period, i observed messages loss (sequence number gap) once
> or a few times a night. Each time it happened, the messages loss
> count always more than a few. Most of the time, i lost a few
> 10's messaages.
>
> Statistically, it was not that bad. I did not do the calculation
> but it should be less than 0.01%. However, there WAS data loss.
> I tried to dig out my test data in March without success. But,
> it is reproducible.
>
> The test machine was an SGI IA64 4 Processors SN2 machine.
>
> The fork test does not involve allocating memory since it needs
> to send only (ppid,pid, seq_number) down the pipe.
>
> Accounting needs two things at exit: exit event notifcation for
> process grouping and saving off accounting data from task and
> task->mm.
>
> It is not clear to me, Andrew, whether you expect both functions
> be moved to userspace? Or, only the event notification?
Well obviously we want as much as makes sense to be done in userspace, and
no more.
If connector works OK then is it not possible to send a single connector
message containing the accounting information, within the context of
do_exit()?

Andrew Morton wrote:
> Jay Lan <jlan@...> wrote:
>
>>>
>>>>However, a connector is a
>>>>unreliable datagram socket. If we lose exit notification when system
>>>>is extremely busy, we lose accounting data once task struct is freed.
>>>
>>>
>>>That hasn't been demonstrated, has it? connector uses GFP_KERNEL and is in
>>>fact reliable. Stuff may get lost down in the bowels of netlink code,
>>>although connector does pass GFP_KERNEL into netlink.
>>
>>I ran my testing in March time frame. I used Evgeniy's fork generation
>>program (fork-test) to generate fork events.
>>The fork-test is like this:
>># while 1
>># ./fork-test 10000000
>># sleep 1
>># end
>>
>>I used a user-space
>>program doing nothing but reading data and compares sequrence numbers.
>>
>>I ran the fork-test together with AIM7 and sometimes ubench.
>>
>>In earlier testings, i observed duplicate messages and a gap of
>>messages number (a chunk of messages got lost.)
>>
>>Later on, i have not seen duplicate messages, probably because
>>improvements made to netlink/connector.
>
>
> Yes, I don't think earlier versions of cbus/connector were wholly
> race-free. Testing would need to be redone on the in-kernel version. If
> netlink itself is not doing internal hard-coded-GFP_ATOMIC allocations I
> _think_ the whole thing should be reliable.
>
> If not, we need to work out where the messages went...
Agreed. I will rerun my test later tis week.
Thanks,
- jay

Jay Lan wrote:
> Andrew Morton wrote:
>> Jay Lan <jlan@...> wrote:
>>
>>>>
>>>>> However, a connector is a
>>>>> unreliable datagram socket. If we lose exit notification when system
>>>>> is extremely busy, we lose accounting data once task struct is freed.
>>>>
>>>>
>>>> That hasn't been demonstrated, has it? connector uses GFP_KERNEL
>>>> and is in
>>>> fact reliable. Stuff may get lost down in the bowels of netlink code,
>>>> although connector does pass GFP_KERNEL into netlink.
>>>
>>> I ran my testing in March time frame. I used Evgeniy's fork generation
>>> program (fork-test) to generate fork events.
>>> The fork-test is like this:
>>> # while 1
>>> # ./fork-test 10000000
>>> # sleep 1
>>> # end
>>>
>>> I used a user-space
>>> program doing nothing but reading data and compares sequrence numbers.
>>>
>>> I ran the fork-test together with AIM7 and sometimes ubench.
>>>
>>> In earlier testings, i observed duplicate messages and a gap of
>>> messages number (a chunk of messages got lost.)
>>>
>>> Later on, i have not seen duplicate messages, probably because
>>> improvements made to netlink/connector.
>>
>>
>> Yes, I don't think earlier versions of cbus/connector were wholly
>> race-free. Testing would need to be redone on the in-kernel
>> version. If
>> netlink itself is not doing internal hard-coded-GFP_ATOMIC allocations I
>> _think_ the whole thing should be reliable.
>>
>> If not, we need to work out where the messages went...
>
> Agreed. I will rerun my test later tis week.
Guillaume,
I could not find your fork connector patch in fork.c in 2.6.14-rc2 and
2.6.14-rc2-mm1?
My test assumes your stuff being there...
And my fclisten failed in bind... Something must have changed...
I will sort it out.
Thanks,
- jay
>
> Thanks,
> - jay
>

On Wed, 21 Sep 2005 17:30:10 -0700
Jay Lan <jlan@...> wrote:
> However, a connector is a
> unreliable datagram socket. If we lose exit notification when system
> is extremely busy, we lose accounting data once task struct is freed.
I ran many benchmarks concurrently with the fork connector enabled, I
used it during several weeks and I never lost any data with connector.
Netlink is not a reliable protocol but connector's message header
contains fields to deal with the reliability.
Guillaume

On Thu, Sep 22, 2005 at 09:31:59AM +0200, Guillaume Thouvenin (guillaume.thouvenin@...) wrote:
> On Wed, 21 Sep 2005 17:30:10 -0700
> Jay Lan <jlan@...> wrote:
>
> > However, a connector is a
> > unreliable datagram socket. If we lose exit notification when system
> > is extremely busy, we lose accounting data once task struct is freed.
>
> I ran many benchmarks concurrently with the fork connector enabled, I
> used it during several weeks and I never lost any data with connector.
> Netlink is not a reliable protocol but connector's message header
> contains fields to deal with the reliability.
Yes, that is right.
According to unreliability of netlink - message can be lost
only if there are no free memory or socket queue is overflowed -
in the first scenario your system is in a real trouble,
in a second - one should create proper userspace application.
> Guillaume
--
Evgeniy Polyakov

Jay Lan wrote:
>>
>> On 23 Feb 2005 Guillaume said, on this mailing list:
>>
>> Guillaume Thouvenin <guillaume.thouvenin@...> wrote:
>>
>>> On Wed, 2005-02-23 at 00:51 -0800, Andrew Morton wrote:
>>>
>>>> ...
>>>> The 2.6.8.1 ELSA patch adds quite a bit of kernel code, but from what
>>>> you're saying it seems like most of that has become redundant, and all
>>>> you now need is the fork notifier. Is that correct?
>>>
>>>
>>> Yes, that's correct. All I need is the fork connector patch. It needs
>>> more work like, as you said, sending an on/off message down the netlink
>>> socket. I'm working on this (thank you very much Andrew for your
>>> comments).
>>
>>
>>
>> If this is still true for ELSA, what's different about your accounting
>> requirements?
>
>
> Andrew, ELSA does not collect system accounting raw data. It relies on
> BSD and CSA to provide accouting data via a pacct file in binary format.
> That is why ELSA does not need an exit notifier.
>
> ELSA allows users to view accounting data in terms of "job" (a group of
> processes). It relies on fork notifier to put pid's into one "job"
> container if it was forked from the same ppid. Thus, ELSA performs
> process grouping and data presentation functions.
>
> Per-process accounting data is accumulated at task and task->mm. These
> data need to be saved off somewhere before a task struct is freed.
> The action is triggered by an exit event. The BSD accounting has
> a function hook at do_exit() to do just that.
>
> What you like to see is to move fork/exit/exec event notification
> to userspace through a connector. However, a connector is a
> unreliable datagram socket. If we lose exit notification when system
> is extremely busy, we lose accounting data once task struct is freed.
> That is not what we have today. Today we have BSD function hook
> in the kernel to copy off the data.
>
> "pnotify" is not an accounting thing. It is viewed as an in-kernel
> event notification. Functionality-wise it is like a connector
> except that "pnotify" does event notification in the kernel while
> connector sends event notification to userspace.
>
> A fork notifier is used to do process grouping. An exit notifier
> is used to save off accounting data.
>
> System accounting needs a reliable event notifier. It is especially
> true on exit events.
>
> Does CKRM save off accounting data also? How is that done?
Didn't quite understand the question but I guess this is a good time to
relist our requirements:
CKRM needs to get two types of "accounting" data
1. events like fork, exit, exec, setuid, setgid etc.
In the new CKRM implementation (where classification is being done in
user space), we don't need a callback to be invoked from a kernel module
at these events - just the notification to userspace is sufficient
Here we can piggyback onto any notification scheme and pnotify is more
heavyweight than what we need.
Not losing any event is important to us but we don't expect this to be a
major design constraint. If a connector-based solution has buffering
problems in high-volume situations (which should be rare), a relayfs
based solution can also be explored. Our earlier design used relayfs
without any problems even with rapid event generation (however its not a
suitable design going forward since we also need to send commands from
userspace into the kernel which relayfs cannot do).
2. per-task accounting data on
- time spent by a task waiting for I/O to complete
- time spent waiting for page faults to be serviced (subset of I/O wait
time)
- time spent waiting for CPU (runnable but not running)
- possibly some additions of the same type
For this kind of data, we're talking of the following type of kernel
changes
- marking the state of a task using task flags
- accumalating the data in fields added to the task_struct
- sending the accumalated data periodically and/or on demand, through a
kernel module, to userspace. A connector-based mechanism for sending
data should be adequate since this is definitely not a high-volume data
output situation. pnotify doesn't help us here at all afaics.
We'll need to go over the pnotify proposal in some more detail but I
think it will be very useful to follow up on Andrew's suggestion and get
some consensus between ELSA, CSA, CKRM (and other projects that I don't
know of) on the accounting requirements of the different projects.
By and large, a connector-type notification scheme is what we're leaning
towards.
-- Shailabh
>
> Thanks,
> - jay
>
>
>>
>
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server.
> Download
> it for free - -and be entered to win a 42" plasma tv or your very own
> Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@...
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>

On Thu, 2005-09-22 at 13:44 -0400, Shailabh Nagar wrote:
> Jay Lan wrote:
<snip>
> >
> > Does CKRM save off accounting data also? How is that done?
>
> Didn't quite understand the question but I guess this is a good time to
> relist our requirements:
>
> CKRM needs to get two types of "accounting" data
>
> 1. events like fork, exit, exec, setuid, setgid etc.
> In the new CKRM implementation (where classification is being done in
> user space), we don't need a callback to be invoked from a kernel module
> at these events - just the notification to userspace is sufficient
>
> Here we can piggyback onto any notification scheme and pnotify is more
> heavyweight than what we need.
and since we moved our functionality to user space, if we are going to
use pnotify for this, then we have add some additional code to send that
data to user space. So, connector is a preferred by CKRM in this
context.
>
> Not losing any event is important to us but we don't expect this to be a
> major design constraint. If a connector-based solution has buffering
> problems in high-volume situations (which should be rare), a relayfs
> based solution can also be explored. Our earlier design used relayfs
> without any problems even with rapid event generation (however its not a
> suitable design going forward since we also need to send commands from
> userspace into the kernel which relayfs cannot do).
This is what I referred yesterday (as moving to user space).
>
> 2. per-task accounting data on
> - time spent by a task waiting for I/O to complete
> - time spent waiting for page faults to be serviced (subset of I/O wait
> time)
> - time spent waiting for CPU (runnable but not running)
> - possibly some additions of the same type
>
> For this kind of data, we're talking of the following type of kernel
> changes
> - marking the state of a task using task flags
> - accumalating the data in fields added to the task_struct
> - sending the accumalated data periodically and/or on demand, through a
> kernel module, to userspace. A connector-based mechanism for sending
> data should be adequate since this is definitely not a high-volume data
> output situation. pnotify doesn't help us here at all afaics.
>
>
> We'll need to go over the pnotify proposal in some more detail but I
> think it will be very useful to follow up on Andrew's suggestion and get
> some consensus between ELSA, CSA, CKRM (and other projects that I don't
> know of) on the accounting requirements of the different projects.
>
> By and large, a connector-type notification scheme is what we're leaning
> towards.
There is a need for fork() and exit() notification in CKRM in kernel:
- at fork() to initialize the task data structure
- at exit() to relinquish the task from the class and send the
uncollected accounting data of the task( (2) above) to the user
space.
I was thinking of using connector for this too, but pnotify would be
simpler to use. But, pnotify is little heavy weight for our purposes.
All we need is just a notification, with the task data structure.
Is making pnotify leaner an option for PAGG ? something as simple as
what is in the patch http://marc.theaimsgroup.com/?l=linux-
kernel&m=111532025203086&w=2
Note that we do not need all the events listed there anymore, only fork
and exit.
chandra
>
> -- Shailabh
>
>
>
> >
> > Thanks,
> > - jay
> >
> >
> >>
> >
> >
> >
> > -------------------------------------------------------
> > SF.Net email is sponsored by:
> > Tame your development challenges with Apache's Geronimo App Server.
> > Download
> > it for free - -and be entered to win a 42" plasma tv or your very own
> > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
> > _______________________________________________
> > Lse-tech mailing list
> > Lse-tech@...
> > https://lists.sourceforge.net/lists/listinfo/lse-tech
> >
>
>
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server. Download
> it for free - -and be entered to win a 42" plasma tv or your very own
> Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@...
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@... | .......you may get it.
----------------------------------------------------------------------

On Wed, 2005-09-21 at 17:18 -0500, Erik Jacobson wrote:
> > > What I propose here is Process Notification (pnotify). This is derived from
> > > PAGG. It's been re-worked to have some better documentation (below) and
> > > variable names that better reflect what is really happening.
> >
> >
> > Last time we discussed all this we'd pretty much worked out that all
> > accounting functions could be performed from userspace, as long as the
> > connector-based fork()/exit()/exec() notifications were in place.
> >
> > That is of course a vastly preferable approach. Do you see something which
> > makes it unfeasible?
>
> Unless I'm missing something (feel free to point me somewhere), my
> understanding was this technique is potentially lossy, right? We have
> customers who don't want to take even a chance that some accounting data
> is lost for some reason.
>
> These are the same customers who tend to use a machine to its fullest
> (read beat it to death :).
>
> The idea of pnotify is various kernel components can register to be
> notified about events in the life of any process they choose. It's somewhat
> similar to notifier lists, only they are per-process based and have a data
> pointer associated with the task. The idea here isn't to limit it just
> to accounting type applications.
>
> If someone has a suggestion of something in the task struct or
> copy_process path that could make use of this, I'd be happy to give
> a shot at implementing it. Perhaps something that isn't super "hot"
> in the task struct?
>
> If we added process notifiers for gid/uid changes, could CKRM make use of
> this as well (Kingsley asked me this, I said I'd look in to implementing
> it)?
CKRM is reworking its design, so that it does not need these event
notifications in kernel.
We will be using connectors and planning to move the auto-
classification functionality to the user space.
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server. Download
> it for free - -and be entered to win a 42" plasma tv or your very own
> Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@...
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@... | .......you may get it.
----------------------------------------------------------------------

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I apologize for jumping into this a bit late. I just wanted to
reiterate out that PAGGs isn't just a hook used for accounting. I have
implemented a chroot OS (CHOS) that requires hooks similar to what PAGGs
provides. I won't bother describing CHOS, but it isn't related to
accounting. Right now I have to create those hooks myself. If there
was a method like PAGGs in the kernel, then things would be much more
simple.
PAGGs is one way to implement an extensible process attribute framework.
This could allow associating credentials, Grid certificates (PKI cert),
or, for CHOS, a user selected path with a process and guaranteeing
inheritance. Maybe this could all be accomplished with a netlink to
user-land, but I would worry that the overhead for handing things out to
user-land would be high. However, I don't have hard numbers to back
this up.
- --Shane
Erik Jacobson wrote:
>>>What I propose here is Process Notification (pnotify). This is derived from
>>> PAGG. It's been re-worked to have some better documentation (below) and
>>> variable names that better reflect what is really happening.
>>
>>
>>Last time we discussed all this we'd pretty much worked out that all
>>accounting functions could be performed from userspace, as long as the
>>connector-based fork()/exit()/exec() notifications were in place.
>>
>>That is of course a vastly preferable approach. Do you see something which
>>makes it unfeasible?
>
>
> Unless I'm missing something (feel free to point me somewhere), my
> understanding was this technique is potentially lossy, right? We have
> customers who don't want to take even a chance that some accounting data
> is lost for some reason.
>
> These are the same customers who tend to use a machine to its fullest
> (read beat it to death :).
>
> The idea of pnotify is various kernel components can register to be
> notified about events in the life of any process they choose. It's somewhat
> similar to notifier lists, only they are per-process based and have a data
> pointer associated with the task. The idea here isn't to limit it just
> to accounting type applications.
>
> If someone has a suggestion of something in the task struct or
> copy_process path that could make use of this, I'd be happy to give
> a shot at implementing it. Perhaps something that isn't super "hot"
> in the task struct?
>
> If we added process notifiers for gid/uid changes, could CKRM make use of
> this as well (Kingsley asked me this, I said I'd look in to implementing
> it)?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFDMsbPZd/2zrI5CioRApt+AJ9ZcDslYfAO+YnlBg2l/SB4eV1VYgCeIxLb
w7Sf2U5be1WZdWNH6n/IfY0=
=RjZH
-----END PGP SIGNATURE-----

Shane Canon wrote:
>
> I apologize for jumping into this a bit late. I just wanted to
> reiterate out that PAGGs isn't just a hook used for accounting. I have
Yes!
We should not hijack this discussion here for accounting requirement.
What Erik proposed here was a process notifier. CSA just happens
to be a potential user of this scheme.
CSA prefers to use an in=kernel event notification, be it PAGG,
pnotifier or other means, but it should be a separate thread.
- jay
> implemented a chroot OS (CHOS) that requires hooks similar to what PAGGs
> provides. I won't bother describing CHOS, but it isn't related to
> accounting. Right now I have to create those hooks myself. If there
> was a method like PAGGs in the kernel, then things would be much more
> simple.
>
> PAGGs is one way to implement an extensible process attribute framework.
> This could allow associating credentials, Grid certificates (PKI cert),
> or, for CHOS, a user selected path with a process and guaranteeing
> inheritance. Maybe this could all be accomplished with a netlink to
> user-land, but I would worry that the overhead for handing things out to
> user-land would be high. However, I don't have hard numbers to back
> this up.
>
> --Shane
>

On Wed, Sep 21, 2005 at 04:36:45PM -0500, Erik Jacobson wrote:
> What I propose here is Process Notification (pnotify). This is derived from
> PAGG. It's been re-worked to have some better documentation (below) and
> variable names that better reflect what is really happening.
This looks pretty cool, and to show how it's useful you could convert
a few of the fork/exec calls to less important subsystems (e.g. keys)
to this.
I don't think rwsem locking is apropinquate for what you're doing. We need
to keep the overhead as low as possible in fork/exec/exit, and given how
infrequent modifications are this looks like a prime candidate for RCU.

> This looks pretty cool, and to show how it's useful you could convert
> a few of the fork/exec calls to less important subsystems (e.g. keys)
> to this.
>
> I don't think rwsem locking is apropinquate for what you're doing. We need
> to keep the overhead as low as possible in fork/exec/exit, and given how
> infrequent modifications are this looks like a prime candidate for RCU.
Sure, I'll give it a shot. I need to come up to speed with RCU - so I
won't have an example patch instantly. I'll reply in to this thread
when I have something ready.
--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota

On Thu, Sep 22, 2005 at 10:36:10AM -0500, Erik Jacobson wrote:
> > This looks pretty cool, and to show how it's useful you could convert
> > a few of the fork/exec calls to less important subsystems (e.g. keys)
> > to this.
> >
> > I don't think rwsem locking is apropinquate for what you're doing. We need
> > to keep the overhead as low as possible in fork/exec/exit, and given how
> > infrequent modifications are this looks like a prime candidate for RCU.
>
> Sure, I'll give it a shot. I need to come up to speed with RCU - so I
> won't have an example patch instantly. I'll reply in to this thread
> when I have something ready.
Although I must confess to being a bit behind, I would be happy to
review your patch from an RCU viewpoint.
Thanx, Paul

Erik,
If my understanding is correct, these callbacks are called synchronously
when the event (fork/exit/exec) happens. Isn't it ? So, callback has
be really really quick and can't sleep. Otherwise, it will adversly
affect the performance of fork/exit/exec etc.
How much overhead this notification mechanism adds ? Noticible ?
Thanks,
Badari

> If my understanding is correct, these callbacks are called synchronously
> when the event (fork/exit/exec) happens. Isn't it ? So, callback has
Correct. Each task has a list of interested subscribers. By default, the
list is empty. If the subscriber list isn't empty, we run each
subscriber's function for the event (fork, for example).
By default, child tasks inherit the subscriber list of the parent.
> be really really quick and can't sleep. Otherwise, it will adversly
> affect the performance of fork/exit/exec etc.
I agree, when a kernel module registers with pnotify, the associated
fork/exit/exec functions for tasks the kernel module is interested in
should be quick. However, the number of tasks a kernel module is
associated with (the number of subscribers on the task's subscriber list)
is an important factor as well. What is nice about the implementation is
that tasks not associated with a kernel module (tasks that have an empty
subscriber list) don't have much of a cost.
The locking for the subscriber list for each task is performed by a
rwsem lock in the task itself. There is no "global lock" here we
need to be concerned with. Christoph suggested investigating RCU for
this lock, which I plan to do.
> How much overhead this notification mechanism adds ? Noticible ?
The easy part is if there are no kernel module subscribers associated with
a task - that is quick. As more modules register with pnotify, we have
more to look at in the list.
I don't have recent performance data here, and the data would of course
very depending on how many modules are registered and what they're doing.
I'm planning to run AIM7 and a timed fork/exit loop to try to measure
performance differences.
Assuming the keyring experiment pans out, I'll run it against a kernel
without pnotify and keyrings turned on verses a kernel with pnotify and
keyrings converted to use it. I'm not sure how valid this test will
be without keys, so I have some learning about keyrings to do before I
have a good answer.
--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota