Implement the proc fs write to set the audit container ID of a process,emitting an AUDIT_CONTAINER record to document the event.

This is a write from the container orchestrator task to a proc entry ofthe form /proc/PID/containerid where PID is the process ID of the newlycreated task that is to become the first task in a container, or anadditional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

This will produce a record such as this:type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0

The "op" field indicates an initial set. The "pid" to "ses" fields arethe orchestrator while the "opid" field is the object's PID, the processbeing "contained". Old and new container ID values are given in the"contid" fields, while res indicates its success.

It is not permitted to self-set, unset or re-set the container ID. Achild inherits its parent's container ID, but then can be set only onceafter.

Check if a task has existing children or co-threads and refuse to setthe container ID if either are present. Failure to check this couldpermit games where a child scratches its parent's back to work aroundinheritance and double-setting policy.

Post by Richard Guy BriggsCheck if a task has existing children or co-threads and refuse to setthe container ID if either are present. Failure to check this couldpermit games where a child scratches its parent's back to work aroundinheritance and double-setting policy.---kernel/auditsc.c | 4 ++++1 file changed, 4 insertions(+)

I would just include this in patch 1/2 as I can't think of world wherewe wouldn't this check.

Post by Richard Guy BriggsImplement container ID filtering using the AUDIT_CONTAINERID field nameto send an 8-character string representing a u64 since the value fieldis only u32.Sending it as two u32 was considered, but gathering and comparing twofields was more complex.

My only worry here is that you aren't really sending a string in theASCII sense, you are sending an 8 byte buffer (that better be NULterminated) that happens to be an unsigned 64-bit integer. To beclear, I'm okay with that (it's protected by AUDIT_CONTAINERID), andthe code is okay with that, I just want us to pause for a minute andmake sure that is an okay thing to do long term.

Post by Richard Guy BriggsImplement container ID filtering using the AUDIT_CONTAINERID field nameto send an 8-character string representing a u64 since the value fieldis only u32.Sending it as two u32 was considered, but gathering and comparing twofields was more complex.

My only worry here is that you aren't really sending a string in theASCII sense, you are sending an 8 byte buffer (that better be NULterminated) that happens to be an unsigned 64-bit integer. To beclear, I'm okay with that (it's protected by AUDIT_CONTAINERID), andthe code is okay with that, I just want us to pause for a minute andmake sure that is an okay thing to do long term.

I already went through that process and warned of it 7 weeks ago. Asalready noted, That was preferable to two seperate u32 fields thatdepend on each other making comparisons more complicated. Using twoseperate fields to configure the rule could be gated for validity, thenthe result stored in a special rule field, but I wasn't keen about thatapproach.

We should be able to use current->audit_context here right? If wecan't for every caller, perhaps we pass an audit_context as anargument and only allocate a local context when the passedaudit_context is NULL.

Also, if you're not comfortable always using current, just pass theaudit_context as you do with audit_log_common_recv_msg().

Pretty sure current can be used here too. In fact I think everywherewhere we are processing commands from netlink we can use current as Ibelieve the entire netlink stack is processed in the context of thecaller.

We should be able to use current->audit_context here right? If wecan't for every caller, perhaps we pass an audit_context as anargument and only allocate a local context when the passedaudit_context is NULL.Also, if you're not comfortable always using current, just pass theaudit_context as you do with audit_log_common_recv_msg().

As mentioned in the tree/watch/mark patch, this is all obsoleted bymaking the AUDIT_CONFIG_CHANGE record a SYSCALL auxiliary record.This review would have been more helpful a month and a half ago.

Pretty sure current can be used here too. In fact I think everywherewhere we are processing commands from netlink we can use current as Ibelieve the entire netlink stack is processed in the context of thecaller.

We should be able to use current->audit_context here right? If wecan't for every caller, perhaps we pass an audit_context as anargument and only allocate a local context when the passedaudit_context is NULL.Also, if you're not comfortable always using current, just pass theaudit_context as you do with audit_log_common_recv_msg().

As mentioned in the tree/watch/mark patch, this is all obsoleted bymaking the AUDIT_CONFIG_CHANGE record a SYSCALL auxiliary record.

If you really want to sink to that level of discussion, better qualitypatches from you would have been helpful too, that is the one of themain reasons why it takes so long to review your code. Let's keep thecommentary focused on the code, discussions like this aren't likely tobe helpful to anyone.

Let's get these changes into the first patch whereaudit_log_container_info() is defined. Why? This inserts a new fieldinto the record which is a no-no. Yes, it is one single patchset, butthey are still separate patches and who knows which patches a givendistribution and/or tree may decide to backport.

Let's get these changes into the first patch whereaudit_log_container_info() is defined. Why? This inserts a new fieldinto the record which is a no-no. Yes, it is one single patchset, butthey are still separate patches and who knows which patches a givendistribution and/or tree may decide to backport.

Fair enough. That first thought went through my mind... Would it besufficient to move that field addition to the first patch and leave therest here to support trace and signals?

Let's get these changes into the first patch whereaudit_log_container_info() is defined. Why? This inserts a new fieldinto the record which is a no-no. Yes, it is one single patchset, butthey are still separate patches and who knows which patches a givendistribution and/or tree may decide to backport.

Fair enough. That first thought went through my mind... Would it besufficient to move that field addition to the first patch and leave therest here to support trace and signals?

I should have been more clear ... yes, that's what I was thinking; therecord format is the important part as it's user visible.

Standalone audit records have the timestamp and serial number generatedon the fly and as such are unique, making them standalone. This newfunction audit_alloc_local() generates a local audit context that willbe used only for a standalone record and its auxiliary record(s). Thecontext is discarded immediately after the local associated records areproduced.

I'm reserving the option to comment on this idea further as I make myway through the patchset, but audit_free_context() definitelyshouldn't be declared as an inline function.

Ok, I think I follow. When it wasn't exported, inline was fine, but nowthat it has been exported, it should no longer be inlined, or should usean intermediate function name to export so that local uses of it canremain inline.

I'm reserving the option to comment on this idea further as I make myway through the patchset, but audit_free_context() definitelyshouldn't be declared as an inline function.

Ok, I think I follow. When it wasn't exported, inline was fine, but nowthat it has been exported, it should no longer be inlined ...

Pretty much. Based on a few comments I've seen by compiler folks overthe years, my current thinking is that we shouldn't worry aboutexplicit inlining static functions in C files (header files are adifferent story). The basic idea being that the compiler almostalways does a better job than us stupid developers.

Ok, so both syscall aux records. That elimintes this patch from theset, can go in independently.

Yep. It should help shrink the audit container ID patchset andperhaps more importantly it should put some distance between theconnected-record debate and the audit container ID debate.

I understand we are going to need a "local" context for some things,the network packets are probably the best example, but wheneverpossible I would like to connect these records back to a task'scontext.

Audit events could happen in a network namespace outside of a taskcontext due to packets received from the net that trigger an auditingrule prior to being associated with a running task. The networknamespace could in use by multiple containers by association to thetasks in that network namespace. We still want a way to attributethese events to any potential containers. Keep a list per networknamespace to track these container identifiiers.

Post by Richard Guy BriggsAudit events could happen in a network namespace outside of a taskcontext due to packets received from the net that trigger an auditingrule prior to being associated with a running task. The networknamespace could in use by multiple containers by association to thetasks in that network namespace. We still want a way to attributethese events to any potential containers. Keep a list per networknamespace to track these container identifiiers.- initial setting of the container id via /proc- clone/fork call that inherits a container identifier- unshare call that inherits a container identifier- setns call that inherits a container identifier- an inherited container id dropped when child set- process exit- unshare call that drops a net namespace- setns call that drops a net namespaceSee: https://github.com/linux-audit/audit-kernel/issues/32See: https://github.com/linux-audit/audit-testsuite/issues/64---include/linux/audit.h | 7 +++++++include/net/net_namespace.h | 12 ++++++++++++kernel/auditsc.c | 9 ++++++---kernel/nsproxy.c | 6 ++++++net/core/net_namespace.c | 45 +++++++++++++++++++++++++++++++++++++++++++++5 files changed, 76 insertions(+), 3 deletions(-)

Post by Richard Guy BriggsAudit events could happen in a network namespace outside of a taskcontext due to packets received from the net that trigger an auditingrule prior to being associated with a running task. The networknamespace could in use by multiple containers by association to thetasks in that network namespace. We still want a way to attributethese events to any potential containers. Keep a list per networknamespace to track these container identifiiers.- initial setting of the container id via /proc- clone/fork call that inherits a container identifier- unshare call that inherits a container identifier- setns call that inherits a container identifier- an inherited container id dropped when child set- process exit- unshare call that drops a net namespace- setns call that drops a net namespaceSee: https://github.com/linux-audit/audit-kernel/issues/32See: https://github.com/linux-audit/audit-testsuite/issues/64---include/linux/audit.h | 7 +++++++include/net/net_namespace.h | 12 ++++++++++++kernel/auditsc.c | 9 ++++++---kernel/nsproxy.c | 6 ++++++net/core/net_namespace.c | 45 +++++++++++++++++++++++++++++++++++++++++++++5 files changed, 76 insertions(+), 3 deletions(-)

Post by Richard Guy BriggsAudit events could happen in a network namespace outside of a taskcontext due to packets received from the net that trigger an auditingrule prior to being associated with a running task. The networknamespace could in use by multiple containers by association to thetasks in that network namespace. We still want a way to attributethese events to any potential containers. Keep a list per networknamespace to track these container identifiiers.- initial setting of the container id via /proc- clone/fork call that inherits a container identifier- unshare call that inherits a container identifier- setns call that inherits a container identifier- an inherited container id dropped when child set- process exit- unshare call that drops a net namespace- setns call that drops a net namespaceSee: https://github.com/linux-audit/audit-kernel/issues/32See: https://github.com/linux-audit/audit-testsuite/issues/64---include/linux/audit.h | 7 +++++++include/net/net_namespace.h | 12 ++++++++++++kernel/auditsc.c | 9 ++++++---kernel/nsproxy.c | 6 ++++++net/core/net_namespace.c | 45 +++++++++++++++++++++++++++++++++++++++++++++5 files changed, 76 insertions(+), 3 deletions(-)

Hopefully we can handle this in audit_net_init(), we just need tofigure out where we can get the correct task_struct for the auditcontainer ID (some backpointer in the net struct?).

I don't follow. This needs to happen on every task startup.audit_net_init() is only called when a new network namespace starts up.

Yep, sorry, my mistake. I must have confused myself when I waslooking at the code.

I'm thinking out loud here, bear with me ...

Assuming we move the netns/audit-container-ID tracking to audit_net,and considering we already have an audit hook in copy_process() (itcalls audit_alloc()), would this be better handled by thecopy_process() hook? This ignores naming, audit_alloc() reuse, etc.;those can be easily fixed. I'm just thinking of ways to limit ourimpact on the core kernel and leverage our existing interactionpoints.

Post by Richard Guy BriggsAudit events could happen in a network namespace outside of a taskcontext due to packets received from the net that trigger an auditingrule prior to being associated with a running task. The networknamespace could in use by multiple containers by association to thetasks in that network namespace. We still want a way to attributethese events to any potential containers. Keep a list per networknamespace to track these container identifiiers.- initial setting of the container id via /proc- clone/fork call that inherits a container identifier- unshare call that inherits a container identifier- setns call that inherits a container identifier- an inherited container id dropped when child set- process exit- unshare call that drops a net namespace- setns call that drops a net namespaceSee: https://github.com/linux-audit/audit-kernel/issues/32See: https://github.com/linux-audit/audit-testsuite/issues/64---include/linux/audit.h | 7 +++++++include/net/net_namespace.h | 12 ++++++++++++kernel/auditsc.c | 9 ++++++---kernel/nsproxy.c | 6 ++++++net/core/net_namespace.c | 45 +++++++++++++++++++++++++++++++++++++++++++++5 files changed, 76 insertions(+), 3 deletions(-)

Hopefully we can handle this in audit_net_init(), we just need tofigure out where we can get the correct task_struct for the auditcontainer ID (some backpointer in the net struct?).

I don't follow. This needs to happen on every task startup.audit_net_init() is only called when a new network namespace starts up.

Yep, sorry, my mistake. I must have confused myself when I waslooking at the code.I'm thinking out loud here, bear with me ...Assuming we move the netns/audit-container-ID tracking to audit_net,and considering we already have an audit hook in copy_process() (itcalls audit_alloc()), would this be better handled by thecopy_process() hook? This ignores naming, audit_alloc() reuse, etc.;those can be easily fixed. I'm just thinking of ways to limit ourimpact on the core kernel and leverage our existing interactionpoints.

The new namespace hasn't been cloned yet and this is the only functionwhere we have access to both namespaces, so I don't see how that couldwork...

On April 20, 2018 4:48:34 PM Richard Guy Briggs <***@redhat.com> wrote:On 2018-04-20 16:22, Paul Moore wrote:On Fri, Apr 20, 2018 at 4:02 PM, Richard Guy Briggs <***@redhat.com> wrote:On 2018-04-18 21:46, Paul Moore wrote:On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <***@redhat.com> wrote:Audit events could happen in a network namespace outside of a taskcontext due to packets received from the net that trigger an auditingrule prior to being associated with a running task. The networknamespace could in use by multiple containers by association to thetasks in that network namespace. We still want a way to attributethese events to any potential containers. Keep a list per networknamespace to track these container identifiiers.

Hopefully we can handle this in audit_net_init(), we just need tofigure out where we can get the correct task_struct for the auditcontainer ID (some backpointer in the net struct?).

I don't follow. This needs to happen on every task startup.audit_net_init() is only called when a new network namespace starts up.

Yep, sorry, my mistake. I must have confused myself when I waslooking at the code.

I'm thinking out loud here, bear with me ...

Assuming we move the netns/audit-container-ID tracking to audit_net,and considering we already have an audit hook in copy_process() (itcalls audit_alloc()), would this be better handled by thecopy_process() hook? This ignores naming, audit_alloc() reuse, etc.;those can be easily fixed. I'm just thinking of ways to limit ourimpact on the core kernel and leverage our existing interactionpoints.

The new namespace hasn't been cloned yet and this is the only functionwhere we have access to both namespaces, so I don't see how that couldwork...

It seems like this could (should?) be hidden inside an audit function,e.g. audit_log_net_containers() or something like that.

Perhaps... It was open-coded since at this point there are no otherusers. That'll make this tidier though.

If the code was all contained within a single subsystem them I wouldgenerally agree that open coding is preferable, but since we arecrossing a subsystem boundary I think it would be preferable toabstract away the details into a separate function.

This will probably also be necessary once you change to using theaudit_net/net_generic mechanism.

I went looking for cid_valid(), but it turns out you don't add it untilpatch 5. That, I expect, will not be good for bisectability (or patchreview).

Nice catch, thanks Jon. That is very likely another victim of a gitrebase to re-order afterthoughts in the right place. I'll need to bemore careful of that class of bug, rethink my workflow, or script buildsto verify each commit is compilable.

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.

Unfortunately, we can't add the field to audit_context as thingscurrently stand because we don't always allocate an audit_context,it's dependent on the system's configuration, and we need to track theaudit container ID for a given process, regardless of the auditconfiguration. Pretty much the same reason why loginuid and sessionidare located directly in task_struct now. As I stressed during thedesign phase, I really want to keep this as an *audit* container IDand not a general purpose kernel wide container ID. If the kernelever grows a general purpose container ID token, I'll be the first inline to convert the audit code, but I don't want audit to be thatgeneral purpose mechanism ... audit is hated enough as-is ;)

I think the right solution to this is to create another new struct,audit_task_info (or similar, the name really isn't that important),which would be stored as a pointer in task_struct and would replacethe audit_context pointer, loginuid, sessionid, and the newly proposedcontainerid. The new audit_task_info would always be allocated in theaudit_alloc() function (please use kmem_cache), and the audit_contextpointer included inside would continue to be allocated based on theexisting conditions. By keeping audit_task_info as a pointer insidetask_struct we could hide the structure definition insidekernel/audit*.c and make it much more difficult for other subsystemsto abuse it.[1]

Actually, we might even want to consider storing audit_context inaudit_task_info (no pointer), or making it a zero length array(ctx[0]) and going with a variable sized allocation of audit_task_info... but all that could be done as a follow up optimization once we getthe basic idea sorted.

[1] If for some reason allocating audit_task_info becomes too muchoverhead to bear (somewhat doubtful since we would only do it at taskcreation), we could do some ugly tricks to directly include anaudit_task_struct chunk in task_struct but I'd like to avoid that ifpossible (and I think we can).

I think we need to decide if we want to distinguish between the "host"(e.g. init ns) and "unset". Looking at this patch (I've only quicklyskimmed the others so far) it would appear that you don't think weneed to worry about this distinction; that's fine, but let's make itexplicit with a comment in the code that AUDIT_CID_UNSET means "unset"as well as "host".

If we do need to make a distinction, let's add a constant/macro for "host".

I ask because I suppose it might be possible for some containerruntime to do a fork, setup some of the environment and them exec thecontainer (before you answer the obvious "namespaces!" please rememberwe're not trying to define containers).

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.

If we can get the LSM infrastructure managed task blobs frommodule stacking in ahead of this we could create a trivial securitymodule to manage this. It's not as if there aren't all sorts ofinteractions between security modules and the audit system already.

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.

If we can get the LSM infrastructure managed task blobs frommodule stacking in ahead of this we could create a trivial securitymodule to manage this. It's not as if there aren't all sorts ofinteractions between security modules and the audit system already.

While yes, there are plenty of interactions between the two, it ispossible to use audit without the LSMs and I would like to preservethat. Further, I don't want to entangle two very complicated codechanges or make the audit container ID effort dependent on LSMstacking.

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.

If we can get the LSM infrastructure managed task blobs frommodule stacking in ahead of this we could create a trivial securitymodule to manage this. It's not as if there aren't all sorts ofinteractions between security modules and the audit system already.

While yes, there are plenty of interactions between the two, it ispossible to use audit without the LSMs and I would like to preservethat.

Fair enough.

Post by Paul MooreFurther, I don't want to entangle two very complicated codechanges or make the audit container ID effort dependent on LSMstacking.

Also fair, although the use case for container audit IDs isalready pulling in audit, namespaces (yeah, I know it's notnecessary for a container to use namespaces) security modules(stacked and/or namespaced), cgroups and who knows what else.

Why can't we just use AUDIT_CID_UNSET? Is there an importantdistinction? If so, they shouldn't they have different values?

One was intended as user-facing and the other was intended for kernelinternal. As you point out, this does not appear to be necessary sincethey are both the same type. This was to mirror loginuid due to UIDnamespace practice to seperate the two to make things very clear that auserspace view of a UID needed to be translated from the user's usernamespace to the kernel's absolute view of UIDs from the init usernamespace. Since container ID meanings do not depend on any namespacecontext, I agree we can use just one and I'd go with AUDIT_CID_UNSET.

Post by Paul MooreIf we do need to keep INVALID_CID, let's rename it toAUDIT_CID_INVALID so we have some consistency to the naming patternsand we stress that it is an *audit* container ID.

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.

Fair enough.

Post by Paul MooreUnfortunately, we can't add the field to audit_context as thingscurrently stand because we don't always allocate an audit_context,it's dependent on the system's configuration, and we need to track theaudit container ID for a given process, regardless of the auditconfiguration. Pretty much the same reason why loginuid and sessionidare located directly in task_struct now. As I stressed during thedesign phase, I really want to keep this as an *audit* container IDand not a general purpose kernel wide container ID. If the kernelever grows a general purpose container ID token, I'll be the first inline to convert the audit code, but I don't want audit to be thatgeneral purpose mechanism ... audit is hated enough as-is ;)

When would we need an audit container ID when audit is not enabledenough to have an audit_context?

If it is only used for audit, and audit is the only consumer, and auditcan only use it when it is enabled, then we can just return success toany write to the proc filehandle, or not even present it. Nothing willbe able to know that value wasn't used.

When are loginuid and sessionid used now when audit is not enabled (orshould I say, explicitly disabled)?

Post by Paul MooreI think the right solution to this is to create another new struct,audit_task_info (or similar, the name really isn't that important),which would be stored as a pointer in task_struct and would replacethe audit_context pointer, loginuid, sessionid, and the newly proposedcontainerid. The new audit_task_info would always be allocated in theaudit_alloc() function (please use kmem_cache), and the audit_contextpointer included inside would continue to be allocated based on theexisting conditions. By keeping audit_task_info as a pointer insidetask_struct we could hide the structure definition insidekernel/audit*.c and make it much more difficult for other subsystemsto abuse it.[1]struct audit_task_info {kuid_t loginuid;unsigned int sessionid;u64 containerid;struct audit_context *ctx;}

I agree this looks like a good change.

Post by Paul MooreActually, we might even want to consider storing audit_context inaudit_task_info (no pointer), or making it a zero length array(ctx[0]) and going with a variable sized allocation of audit_task_info... but all that could be done as a follow up optimization once we getthe basic idea sorted.[1] If for some reason allocating audit_task_info becomes too muchoverhead to bear (somewhat doubtful since we would only do it at taskcreation), we could do some ugly tricks to directly include anaudit_task_struct chunk in task_struct but I'd like to avoid that ifpossible (and I think we can).

I think we need to decide if we want to distinguish between the "host"(e.g. init ns) and "unset". Looking at this patch (I've only quicklyskimmed the others so far) it would appear that you don't think weneed to worry about this distinction; that's fine, but let's make itexplicit with a comment in the code that AUDIT_CID_UNSET means "unset"as well as "host".

I don't see any reason to distinguish between "host" and "unset". Sincea container doesn't have a concrete definition based in namespaces, theinitial namespace set is meaningless here.

Is there value in having a container orchestrator process have areserved container ID that has a policy distinct from any othercontainer? If so, then I could see the value in making the distinction.For example, I've heard of interest in systemd acting as a containerorchestrator, so if it took on that role as PID 1, then every process inthe system would inherit that ID and none would be unset.

I can't picture how having seperate "host" and "unset" values helps us.

Post by Paul MooreIf we do need to make a distinction, let's add a constant/macro for "host".

Currently "unset" is -1 which fits the convention used for sessionid andloginuid and a number of others, so I think it makes sense to stick withthat. If we decide we need a "host" flag, would it make sense to use 0or (u64)-2?

We then lose the distinction in the AUDIT_CONTAINER record between theinitiating PID and the target PID. This was outlined in the proposal.

Having said that, I'm still not sure we have protected sufficiently froma child turning around and setting it's parent's as yet unset orinherited audit container ID.

Post by Paul MooreI ask because I suppose it might be possible for some containerruntime to do a fork, setup some of the environment and them exec thecontainer (before you answer the obvious "namespaces!" please rememberwe're not trying to define containers).

I don't think namespaces have any bearing on this concern since none arerequired.

Why are audit_set_containerid_perm() and audit_log_containerid()separate functions?

(I assume you mean audit_log_set_containerid()?)It seemed clearer that all the permission checking was in one functionand its return code could be used to report the outcome when logging the(attempted) action. This is the same structure as audit_set_loginuid()and it made sense.

This would be the time to connect it to a syscall if that seems like agood idea and remove pid, uid, auid, tty, ses fields.

Post by Richard Guy BriggsImplement the proc fs write to set the audit container ID of a process,emitting an AUDIT_CONTAINER record to document the event.This is a write from the container orchestrator task to a proc entry ofthe form /proc/PID/containerid where PID is the process ID of the newlycreated task that is to become the first task in a container, or anadditional task added to a container.The write expects up to a u64 value (unset: 18446744073709551615).type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0The "op" field indicates an initial set. The "pid" to "ses" fields arethe orchestrator while the "opid" field is the object's PID, the processbeing "contained". Old and new container ID values are given in the"contid" fields, while res indicates its success.It is not permitted to self-set, unset or re-set the container ID. Achild inherits its parent's container ID, but then can be set only onceafter.See: https://github.com/linux-audit/audit-kernel/issues/32---fs/proc/base.c | 37 ++++++++++++++++++++include/linux/audit.h | 16 +++++++++include/linux/init_task.h | 4 ++-include/linux/sched.h | 1 +include/uapi/linux/audit.h | 2 ++kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++6 files changed, 143 insertions(+), 1 deletion(-)

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.

Fair enough.

Post by Paul MooreUnfortunately, we can't add the field to audit_context as thingscurrently stand because we don't always allocate an audit_context,it's dependent on the system's configuration, and we need to track theaudit container ID for a given process, regardless of the auditconfiguration. Pretty much the same reason why loginuid and sessionidare located directly in task_struct now. As I stressed during thedesign phase, I really want to keep this as an *audit* container IDand not a general purpose kernel wide container ID. If the kernelever grows a general purpose container ID token, I'll be the first inline to convert the audit code, but I don't want audit to be thatgeneral purpose mechanism ... audit is hated enough as-is ;)

When would we need an audit container ID when audit is not enabledenough to have an audit_context?

I'm thinking of the audit_alloc() case where audit_filter_task()returns AUDIT_DISABLED.

I believe this is the same reason why loginuid and sessionid livedirectly in the task_struct and not in the audit_context; they need topersist for the lifetime of the task.

Post by Casey SchauflerIf it is only used for audit, and audit is the only consumer, and auditcan only use it when it is enabled, then we can just return success toany write to the proc filehandle, or not even present it. Nothing willbe able to know that value wasn't used.When are loginuid and sessionid used now when audit is not enabled (orshould I say, explicitly disabled)?

I think we need to decide if we want to distinguish between the "host"(e.g. init ns) and "unset". Looking at this patch (I've only quicklyskimmed the others so far) it would appear that you don't think weneed to worry about this distinction; that's fine, but let's make itexplicit with a comment in the code that AUDIT_CID_UNSET means "unset"as well as "host".

I don't see any reason to distinguish between "host" and "unset". Sincea container doesn't have a concrete definition based in namespaces, theinitial namespace set is meaningless here.

Okay, that sounds reasonable.

Post by Casey SchauflerIs there value in having a container orchestrator process have areserved container ID that has a policy distinct from any othercontainer?

I'm open to arguments for this idea, but I don't see a point to it right now.

Post by Casey SchauflerIf so, then I could see the value in making the distinction.For example, I've heard of interest in systemd acting as a containerorchestrator, so if it took on that role as PID 1, then every process inthe system would inherit that ID and none would be unset.I can't picture how having seperate "host" and "unset" values helps us.

I don't have a strong feeling either way, I just wanted to ask the question.

We then lose the distinction in the AUDIT_CONTAINER record between theinitiating PID and the target PID. This was outlined in the proposal.

I just went back and reread the v3 proposal and I still don't see agood explanation of this. Why is this bad? What's the securityconcern?

Post by Casey SchauflerHaving said that, I'm still not sure we have protected sufficiently froma child turning around and setting it's parent's as yet unset orinherited audit container ID.

Yes, I believe we only want to let a task set the audit container forit's children (or itself/threads if we decide to allow that, seeabove). There *has* to be a function to check to see if a task if achild of a given task ... right? ... although this is likely to be apointer traversal and locking nightmare ... hmmm.

Post by Paul MooreI ask because I suppose it might be possible for some containerruntime to do a fork, setup some of the environment and them exec thecontainer (before you answer the obvious "namespaces!" please rememberwe're not trying to define containers).

I don't think namespaces have any bearing on this concern since none arerequired.

Why are audit_set_containerid_perm() and audit_log_containerid()separate functions?

(I assume you mean audit_log_set_containerid()?)

Yep. My fingers got tired typing in that function name and decided ashortcut was necessary.

Post by Casey SchauflerIt seemed clearer that all the permission checking was in one functionand its return code could be used to report the outcome when logging the(attempted) action. This is the same structure as audit_set_loginuid()and it made sense.

When possible I really like it when the permission checks are in thesame function as the code which does the work; it's less likely to getabused that way (you have to willfully bypass the access checks). Theexceptions might be if you wanted to reuse the access control code, orinsert a modular access mechanism (e.g. LSMs).

I'm less concerned about audit_log_set_containerid(), but the usualidea of avoiding single-use function within the same scope applieshere.

Post by Casey SchauflerThis would be the time to connect it to a syscall if that seems like agood idea and remove pid, uid, auid, tty, ses fields.

Ah yes, I missed that. You know my stance on connecting records bynow (hint: yes, connect them) so I think that would be a good thing todo for the next round.

Post by Richard Guy BriggsImplement the proc fs write to set the audit container ID of a process,emitting an AUDIT_CONTAINER record to document the event.This is a write from the container orchestrator task to a proc entry ofthe form /proc/PID/containerid where PID is the process ID of the newlycreated task that is to become the first task in a container, or anadditional task added to a container.The write expects up to a u64 value (unset: 18446744073709551615).type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0The "op" field indicates an initial set. The "pid" to "ses" fields arethe orchestrator while the "opid" field is the object's PID, the processbeing "contained". Old and new container ID values are given in the"contid" fields, while res indicates its success.It is not permitted to self-set, unset or re-set the container ID. Achild inherits its parent's container ID, but then can be set only onceafter.See: https://github.com/linux-audit/audit-kernel/issues/32---fs/proc/base.c | 37 ++++++++++++++++++++include/linux/audit.h | 16 +++++++++include/linux/init_task.h | 4 ++-include/linux/sched.h | 1 +include/uapi/linux/audit.h | 2 ++kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++6 files changed, 143 insertions(+), 1 deletion(-)

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.

Fair enough.

Post by Paul MooreUnfortunately, we can't add the field to audit_context as thingscurrently stand because we don't always allocate an audit_context,it's dependent on the system's configuration, and we need to track theaudit container ID for a given process, regardless of the auditconfiguration. Pretty much the same reason why loginuid and sessionidare located directly in task_struct now. As I stressed during thedesign phase, I really want to keep this as an *audit* container IDand not a general purpose kernel wide container ID. If the kernelever grows a general purpose container ID token, I'll be the first inline to convert the audit code, but I don't want audit to be thatgeneral purpose mechanism ... audit is hated enough as-is ;)

When would we need an audit container ID when audit is not enabledenough to have an audit_context?

I'm thinking of the audit_alloc() case where audit_filter_task()returns AUDIT_DISABLED.

Ok, so a task could be marked as filtered but its children would stillbe auditable and inheriting its parent containerid (as well at itsloginuid and sessionid)...

Post by Paul MooreI believe this is the same reason why loginuid and sessionid livedirectly in the task_struct and not in the audit_context; they need topersist for the lifetime of the task.

Post by Casey SchauflerIf it is only used for audit, and audit is the only consumer, and auditcan only use it when it is enabled, then we can just return success toany write to the proc filehandle, or not even present it. Nothing willbe able to know that value wasn't used.When are loginuid and sessionid used now when audit is not enabled (orshould I say, explicitly disabled)?

I think we need to decide if we want to distinguish between the "host"(e.g. init ns) and "unset". Looking at this patch (I've only quicklyskimmed the others so far) it would appear that you don't think weneed to worry about this distinction; that's fine, but let's make itexplicit with a comment in the code that AUDIT_CID_UNSET means "unset"as well as "host".

I don't see any reason to distinguish between "host" and "unset". Sincea container doesn't have a concrete definition based in namespaces, theinitial namespace set is meaningless here.

Okay, that sounds reasonable.

Post by Casey SchauflerIs there value in having a container orchestrator process have areserved container ID that has a policy distinct from any othercontainer?

I'm open to arguments for this idea, but I don't see a point to it right now.

Post by Casey SchauflerIf so, then I could see the value in making the distinction.For example, I've heard of interest in systemd acting as a containerorchestrator, so if it took on that role as PID 1, then every process inthe system would inherit that ID and none would be unset.I can't picture how having seperate "host" and "unset" values helps us.

I don't have a strong feeling either way, I just wanted to ask the question.

Post by Casey SchauflerHaving said that, I'm still not sure we have protected sufficiently froma child turning around and setting it's parent's as yet unset orinherited audit container ID.

Yes, I believe we only want to let a task set the audit container forit's children (or itself/threads if we decide to allow that, seeabove). There *has* to be a function to check to see if a task if achild of a given task ... right? ... although this is likely to be apointer traversal and locking nightmare ... hmmm.

Post by Paul MooreI ask because I suppose it might be possible for some containerruntime to do a fork, setup some of the environment and them exec thecontainer (before you answer the obvious "namespaces!" please rememberwe're not trying to define containers).

I don't think namespaces have any bearing on this concern since none arerequired.

Why are audit_set_containerid_perm() and audit_log_containerid()separate functions?

(I assume you mean audit_log_set_containerid()?)

Yep. My fingers got tired typing in that function name and decided ashortcut was necessary.

Post by Casey SchauflerIt seemed clearer that all the permission checking was in one functionand its return code could be used to report the outcome when logging the(attempted) action. This is the same structure as audit_set_loginuid()and it made sense.

When possible I really like it when the permission checks are in thesame function as the code which does the work; it's less likely to getabused that way (you have to willfully bypass the access checks). Theexceptions might be if you wanted to reuse the access control code, orinsert a modular access mechanism (e.g. LSMs).

I don't follow how it could be abused. The return code from the permcheck gates setting the value and is used in the success field in thelog.

Post by Paul MooreI'm less concerned about audit_log_set_containerid(), but the usualidea of avoiding single-use function within the same scope applieshere.

Post by Casey SchauflerThis would be the time to connect it to a syscall if that seems like agood idea and remove pid, uid, auid, tty, ses fields.

Ah yes, I missed that. You know my stance on connecting records bynow (hint: yes, connect them) so I think that would be a good thing todo for the next round.

Post by Richard Guy BriggsImplement the proc fs write to set the audit container ID of a process,emitting an AUDIT_CONTAINER record to document the event.This is a write from the container orchestrator task to a proc entry ofthe form /proc/PID/containerid where PID is the process ID of the newlycreated task that is to become the first task in a container, or anadditional task added to a container.The write expects up to a u64 value (unset: 18446744073709551615).type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0The "op" field indicates an initial set. The "pid" to "ses" fields arethe orchestrator while the "opid" field is the object's PID, the processbeing "contained". Old and new container ID values are given in the"contid" fields, while res indicates its success.It is not permitted to self-set, unset or re-set the container ID. Achild inherits its parent's container ID, but then can be set only onceafter.See: https://github.com/linux-audit/audit-kernel/issues/32---fs/proc/base.c | 37 ++++++++++++++++++++include/linux/audit.h | 16 +++++++++include/linux/init_task.h | 4 ++-include/linux/sched.h | 1 +include/uapi/linux/audit.h | 2 ++kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++6 files changed, 143 insertions(+), 1 deletion(-)

Post by Richard Guy BriggsHaving said that, I'm still not sure we have protected sufficiently froma child turning around and setting it's parent's as yet unset orinherited audit container ID.

Yes, I believe we only want to let a task set the audit container forit's children (or itself/threads if we decide to allow that, seeabove). There *has* to be a function to check to see if a task if achild of a given task ... right? ... although this is likely to be apointer traversal and locking nightmare ... hmmm.

Isn't that just (struct task_struct)parent == (structtask_struct)child->parent (or ->real_parent)?And now that I say that, it is covered by the following patch's childcheck, so as long as we keep that, we should be fine.

I was thinking of checking not just current's immediate children, butany of it's descendants as I believe that is what we want to limit,yes? I just worry that it isn't really practical to perform thatcheck.

Post by Paul MooreI ask because I suppose it might be possible for some containerruntime to do a fork, setup some of the environment and them exec thecontainer (before you answer the obvious "namespaces!" please rememberwe're not trying to define containers).

I don't think namespaces have any bearing on this concern since none arerequired.

I'm looking at the parent checks again and I wonder if the logic aboveis what we really want. Maybe it is, but I'm not sure.

Things I'm wondering about:

* "ccontainerid" and "containerid" are too close in name, I keptconfusing myself when looking at this code. Please change one. Bonuspoints if it is shorter.

* What if the orchestrator wants to move the task to a new container?Right now it looks like you can only do that once, then then thetask's audit container ID will no longer be the same as real_parent... or does the orchestrator change that? *Can* the orchestratorchange real_parent (I suspect the answer is "no")?

* I think the key is the relationship between current and task, notbetween task and task->real_parent. I believe what we really careabout is that task is a descendant of current. We might also want toallow current to change the audit container ID if it holdsCAP_AUDIT_CONTROL, regardless of it's relationship with task.

Why are audit_set_containerid_perm() and audit_log_containerid()separate functions?

(I assume you mean audit_log_set_containerid()?)

Yep. My fingers got tired typing in that function name and decided ashortcut was necessary.

Post by Richard Guy BriggsIt seemed clearer that all the permission checking was in one functionand its return code could be used to report the outcome when logging the(attempted) action. This is the same structure as audit_set_loginuid()and it made sense.

When possible I really like it when the permission checks are in thesame function as the code which does the work; it's less likely to getabused that way (you have to willfully bypass the access checks). Theexceptions might be if you wanted to reuse the access control code, orinsert a modular access mechanism (e.g. LSMs).

I don't follow how it could be abused. The return code from the permcheck gates setting the value and is used in the success field in thelog.

If the permission checks are in the same function body as the codewhich does the work you have to either split the function, or rewriteit, if you want to bypass the permission checks. It may be more of astyle issue than an actual safety issue, but the comments aboutsingle-use functions in the same scope is the tie breaker.

Post by Richard Guy BriggsImplement the proc fs write to set the audit container ID of a process,emitting an AUDIT_CONTAINER record to document the event.This is a write from the container orchestrator task to a proc entry ofthe form /proc/PID/containerid where PID is the process ID of the newlycreated task that is to become the first task in a container, or anadditional task added to a container.The write expects up to a u64 value (unset: 18446744073709551615).type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0The "op" field indicates an initial set. The "pid" to "ses" fields arethe orchestrator while the "opid" field is the object's PID, the processbeing "contained". Old and new container ID values are given in the"contid" fields, while res indicates its success.It is not permitted to self-set, unset or re-set the container ID. Achild inherits its parent's container ID, but then can be set only onceafter.See: https://github.com/linux-audit/audit-kernel/issues/32---fs/proc/base.c | 37 ++++++++++++++++++++include/linux/audit.h | 16 +++++++++include/linux/init_task.h | 4 ++-include/linux/sched.h | 1 +include/uapi/linux/audit.h | 2 ++kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++6 files changed, 143 insertions(+), 1 deletion(-)

Post by Richard Guy BriggsHaving said that, I'm still not sure we have protected sufficiently froma child turning around and setting it's parent's as yet unset orinherited audit container ID.

Yes, I believe we only want to let a task set the audit container forit's children (or itself/threads if we decide to allow that, seeabove). There *has* to be a function to check to see if a task if achild of a given task ... right? ... although this is likely to be apointer traversal and locking nightmare ... hmmm.

Isn't that just (struct task_struct)parent == (structtask_struct)child->parent (or ->real_parent)?And now that I say that, it is covered by the following patch's childcheck, so as long as we keep that, we should be fine.

I was thinking of checking not just current's immediate children, butany of it's descendants as I believe that is what we want to limit,yes? I just worry that it isn't really practical to perform thatcheck.

The child check I'm talking about prevents setting a task's auditcontainer ID if it *has* any children or threads, so if it has childrenit is automatically disqualified and its grandchildren are irrelevant.

Post by Paul MooreI ask because I suppose it might be possible for some containerruntime to do a fork, setup some of the environment and them exec thecontainer (before you answer the obvious "namespaces!" please rememberwe're not trying to define containers).

I don't think namespaces have any bearing on this concern since none arerequired.

I'm looking at the parent checks again and I wonder if the logic aboveis what we really want. Maybe it is, but I'm not sure.* "ccontainerid" and "containerid" are too close in name, I keptconfusing myself when looking at this code. Please change one. Bonuspoints if it is shorter.

Would c_containerid and p_containerid be ok? child_cid and parent_cid?I'd really like it to have the same root as the parameter handed in soteh code is easier to follow. It would be nice to have that acrosscaller to local, but that's challenging.

I've been tempted to use contid or even cid everywhere instead ofcontainerid. Perhaps the longer name doesn't bother me because Ilike its uniqueness and I learned touch-typing in grade 9 and I like100+ character wide terminals? ;-)

Post by Paul Moore* What if the orchestrator wants to move the task to a new container?Right now it looks like you can only do that once, then then thetask's audit container ID will no longer be the same as real_parent

A task's audit container ID can be unset or inherited, and then setonly once. After that, if you want it moved to a new container youcan't and your only option is to spawn another peer to that task or achild of it and set that new task's audit container ID.

Currently, the method of detecting if its audit container ID has beenset (rather than inherited) was to check its parent's audit containerID. The only reason to change this might be if the audit container IDwere not inheritable, but then we lose the accountability of a taskspawning another process and being able to leave its child's auditcontainer ID unset and unaccountable to any existing container. I thinkthe relationship to the parent is crucial, and if something wants tochange audit container ID it can, by spawning childrent and leaving atrail of container IDs in its parent processes. (So what if a parentdies?)

Post by Paul Moore... or does the orchestrator change that? *Can* the orchestratorchange real_parent (I suspect the answer is "no")?

I don't think the orchestrator is able to change real_parent. I'veforgotten why there is a ->parent and ->real_parent and how they canchange. One is for the wait signal. I don't remember the purpose ofthe other.

If the parent dies before the child, the child will be re-parented onits grandparent if the parent doesn't hang around zombified, if Iunderstand correctly. If anything, a parent dying would likely furtherrestrict the ability to set a task's audit container ID because a parentwith an identical ID could vanish.

Post by Paul Moore* I think the key is the relationship between current and task, notbetween task and task->real_parent. I believe what we really careabout is that task is a descendant of current. We might also want toallow current to change the audit container ID if it holdsCAP_AUDIT_CONTROL, regardless of it's relationship with task.

Currently, a process with CAP_AUDIT_CONTROL can set the audit containerID of any task that hasn't got children or threads, isn't itself, andits audit container ID is inherited or unset. This was to try toprevent games with parents and children scratching each other's backs.

I would feel more comfortable if only descendants were settable, soadding that restriction sounds like a good idea to me other than thetree-climbing excercise and overhead involved.

Why are audit_set_containerid_perm() and audit_log_containerid()separate functions?

(I assume you mean audit_log_set_containerid()?)

Yep. My fingers got tired typing in that function name and decided ashortcut was necessary.

Post by Richard Guy BriggsIt seemed clearer that all the permission checking was in one functionand its return code could be used to report the outcome when logging the(attempted) action. This is the same structure as audit_set_loginuid()and it made sense.

When possible I really like it when the permission checks are in thesame function as the code which does the work; it's less likely to getabused that way (you have to willfully bypass the access checks). Theexceptions might be if you wanted to reuse the access control code, orinsert a modular access mechanism (e.g. LSMs).

I don't follow how it could be abused. The return code from the permcheck gates setting the value and is used in the success field in thelog.

If the permission checks are in the same function body as the codewhich does the work you have to either split the function, or rewriteit, if you want to bypass the permission checks. It may be more of astyle issue than an actual safety issue, but the comments aboutsingle-use functions in the same scope is the tie breaker.

Perhaps I'm just being quite dense, but I just don't follow what theproblem is and how you suggest fixing it. A bunch of gotos to a labelsuch as "out:" to log the refused action? That seems messy andunstructured.

Post by Richard Guy BriggsImplement the proc fs write to set the audit container ID of a process,emitting an AUDIT_CONTAINER record to document the event.This is a write from the container orchestrator task to a proc entry ofthe form /proc/PID/containerid where PID is the process ID of the newlycreated task that is to become the first task in a container, or anadditional task added to a container.The write expects up to a u64 value (unset: 18446744073709551615).type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0The "op" field indicates an initial set. The "pid" to "ses" fields arethe orchestrator while the "opid" field is the object's PID, the processbeing "contained". Old and new container ID values are given in the"contid" fields, while res indicates its success.It is not permitted to self-set, unset or re-set the container ID. Achild inherits its parent's container ID, but then can be set only onceafter.See: https://github.com/linux-audit/audit-kernel/issues/32---fs/proc/base.c | 37 ++++++++++++++++++++include/linux/audit.h | 16 +++++++++include/linux/init_task.h | 4 ++-include/linux/sched.h | 1 +include/uapi/linux/audit.h | 2 ++kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++6 files changed, 143 insertions(+), 1 deletion(-)

Post by Richard Guy BriggsHaving said that, I'm still not sure we have protected sufficiently froma child turning around and setting it's parent's as yet unset orinherited audit container ID.

Yes, I believe we only want to let a task set the audit container forit's children (or itself/threads if we decide to allow that, seeabove). There *has* to be a function to check to see if a task if achild of a given task ... right? ... although this is likely to be apointer traversal and locking nightmare ... hmmm.

Isn't that just (struct task_struct)parent == (structtask_struct)child->parent (or ->real_parent)?And now that I say that, it is covered by the following patch's childcheck, so as long as we keep that, we should be fine.

I was thinking of checking not just current's immediate children, butany of it's descendants as I believe that is what we want to limit,yes? I just worry that it isn't really practical to perform thatcheck.

The child check I'm talking about prevents setting a task's auditcontainer ID if it *has* any children or threads, so if it has childrenit is automatically disqualified and its grandchildren are irrelevant.

Post by Paul MooreI ask because I suppose it might be possible for some containerruntime to do a fork, setup some of the environment and them exec thecontainer (before you answer the obvious "namespaces!" please rememberwe're not trying to define containers).

I don't think namespaces have any bearing on this concern since none arerequired.

I'm looking at the parent checks again and I wonder if the logic aboveis what we really want. Maybe it is, but I'm not sure.* "ccontainerid" and "containerid" are too close in name, I keptconfusing myself when looking at this code. Please change one. Bonuspoints if it is shorter.

Would c_containerid and p_containerid be ok? child_cid and parent_cid?

Either would be an improvement over ccontainerid/containerid. I wouldgive a slight node to child_cid/parent_cid just for length reasons.

Post by Richard Guy BriggsI'd really like it to have the same root as the parameter handed in soteh code is easier to follow. It would be nice to have that acrosscaller to local, but that's challenging.

That's fine, but you have to admit that ccontainerid/containerid isawkward and not easy to quickly differentiate :)

Post by Richard Guy BriggsI've been tempted to use contid or even cid everywhere instead ofcontainerid. Perhaps the longer name doesn't bother me because Ilike its uniqueness and I learned touch-typing in grade 9 and I like100+ character wide terminals? ;-)

I would definitely appreciate contid/cid or similar, but I don't caretoo much either way. As far as terminal width is concerned, pleasemake sure your code fits in 80 char terminals.

Post by Paul Moore* What if the orchestrator wants to move the task to a new container?Right now it looks like you can only do that once, then then thetask's audit container ID will no longer be the same as real_parent

A task's audit container ID can be unset or inherited, and then setonly once. After that, if you want it moved to a new container youcan't and your only option is to spawn another peer to that task or achild of it and set that new task's audit container ID.

Okay. We've had some many discussions about this both on and off listthat I lose track on where we stand for certain things. I thinkpreventing task movement is fine for the initial effort so long as wedon't prevent adding it in the future; I don't see anything (otherthan the permission checks under discussion, which is fine) preventingthis.

Post by Richard Guy BriggsCurrently, the method of detecting if its audit container ID has beenset (rather than inherited) was to check its parent's audit containerID.

Yeah ... those are two different things. I've been wondering if weshould introduce a set/inherited flag as simply checking the parenttask's audit container ID isn't quite the same; although it may be"close enough" that it doesn't matter in practice. However, I'mbeginning to think this parent/child relationship isn't reallyimportant beyond the inheritance issue ... more on this below.

Post by Richard Guy BriggsThe only reason to change this might be if the audit container IDwere not inheritable, but then we lose the accountability of a taskspawning another process and being able to leave its child's auditcontainer ID unset and unaccountable to any existing container. I thinkthe relationship to the parent is crucial, and if something wants tochange audit container ID it can, by spawning childrent and leaving atrail of container IDs in its parent processes. (So what if a parentdies?)

The audit container ID *must* be inherited, I don't really thinkanyone is questioning that. What I'm wondering about is what weaccomplish by comparing the child's and parent's audit container ID?

I've thought about this a bit more and I think we are making this waytoo complicated right now. We basically have three rules for theaudit container ID which we need to follow:

1. Children inherit their parent's audit container ID; this includesthe magic "unset" audit container ID.2. You can't change the audit container ID once set.3. In order to set the audit container ID of a process you must haveCAP_AUDIT_CONTROL.

With that in mind, I think the permission checks would be something like this:

[SIDE NOTE: Audit Container ID in acronym form works out to "acid" ;) ]

Post by Paul Moore... or does the orchestrator change that? *Can* the orchestratorchange real_parent (I suspect the answer is "no")?

I don't think the orchestrator is able to change real_parent.

I didn't think so either, but I didn't do an exhaustive check.

Post by Richard Guy BriggsI've forgotten why there is a ->parent and ->real_parent and how they canchange. One is for the wait signal. I don't remember the purpose ofthe other.

I know ptrace makes use of real_parent when re-parenting the processbeing ptrace'd.

Post by Richard Guy BriggsIf the parent dies before the child, the child will be re-parented onits grandparent if the parent doesn't hang around zombified, if Iunderstand correctly. If anything, a parent dying would likely furtherrestrict the ability to set a task's audit container ID because a parentwith an identical ID could vanish.

All the more reason to go with the simplified approach above. I thinkthe parent/child relationship is a bit of a distraction and acomplexity that isn't important (except for the inheritance ofcourse).

Post by Paul Moore* I think the key is the relationship between current and task, notbetween task and task->real_parent. I believe what we really careabout is that task is a descendant of current. We might also want toallow current to change the audit container ID if it holdsCAP_AUDIT_CONTROL, regardless of it's relationship with task.

Currently, a process with CAP_AUDIT_CONTROL can set the audit containerID of any task that hasn't got children or threads, isn't itself, andits audit container ID is inherited or unset. This was to try toprevent games with parents and children scratching each other's backs.I would feel more comfortable if only descendants were settable, soadding that restriction sounds like a good idea to me other than thetree-climbing excercise and overhead involved.

Why are audit_set_containerid_perm() and audit_log_containerid()separate functions?

(I assume you mean audit_log_set_containerid()?)

Yep. My fingers got tired typing in that function name and decided ashortcut was necessary.

Post by Richard Guy BriggsIt seemed clearer that all the permission checking was in one functionand its return code could be used to report the outcome when logging the(attempted) action. This is the same structure as audit_set_loginuid()and it made sense.

When possible I really like it when the permission checks are in thesame function as the code which does the work; it's less likely to getabused that way (you have to willfully bypass the access checks). Theexceptions might be if you wanted to reuse the access control code, orinsert a modular access mechanism (e.g. LSMs).

I don't follow how it could be abused. The return code from the permcheck gates setting the value and is used in the success field in thelog.

If the permission checks are in the same function body as the codewhich does the work you have to either split the function, or rewriteit, if you want to bypass the permission checks. It may be more of astyle issue than an actual safety issue, but the comments aboutsingle-use functions in the same scope is the tie breaker.

Perhaps I'm just being quite dense, but I just don't follow what theproblem is and how you suggest fixing it. A bunch of gotos to a labelsuch as "out:" to log the refused action? That seems messy andunstructured.

This one line addition to the task_struct scares me the most ofanything in this patchset. Why? It's a field named "containerid" ina perhaps one of the most widely used core kernel structures; thepossibilities for abuse are endless, and it's foolish to think wewould ever be able to adequately police this.Unfortunately, we can't add the field to audit_context as thingscurrently stand because we don't always allocate an audit_context,it's dependent on the system's configuration, and we need to track theaudit container ID for a given process, regardless of the auditconfiguration. Pretty much the same reason why loginuid and sessionidare located directly in task_struct now. As I stressed during thedesign phase, I really want to keep this as an *audit* container IDand not a general purpose kernel wide container ID. If the kernelever grows a general purpose container ID token, I'll be the first inline to convert the audit code, but I don't want audit to be thatgeneral purpose mechanism ... audit is hated enough as-is ;)I think the right solution to this is to create another new struct,audit_task_info (or similar, the name really isn't that important),which would be stored as a pointer in task_struct and would replacethe audit_context pointer, loginuid, sessionid, and the newly proposedcontainerid. The new audit_task_info would always be allocated in theaudit_alloc() function (please use kmem_cache), and the audit_contextpointer included inside would continue to be allocated based on theexisting conditions. By keeping audit_task_info as a pointer insidetask_struct we could hide the structure definition insidekernel/audit*.c and make it much more difficult for other subsystemsto abuse it.[1]struct audit_task_info {kuid_t loginuid;unsigned int sessionid;u64 containerid;struct audit_context *ctx;}Actually, we might even want to consider storing audit_context inaudit_task_info (no pointer), or making it a zero length array(ctx[0]) and going with a variable sized allocation of audit_task_info... but all that could be done as a follow up optimization once we getthe basic idea sorted.

I tried statically allocating struct audit_task_info (with a pointer tostruct audit_context) in addition to dynamically allocating structaudit_task_info due to a bug I'd introduced while dynamically allocatingaudit_task_info, so I now have proof-of-concepts for working static andalmost working dynamic allocated struct audit_task_info.

Statically allocating it required a new header file, so I'm not thatcrazy about it, but it proved it works.

Dynamically allocating it isn't quite as clean as was hoped sinceinit/init_task.c still needs initializaiton values for loginuid andsessionid which could be supplied by a statically allocated structaudit_task_info and still needs to know the internals of that struct todo so. Dynamic allocation is also more disruptive initially, but in thelong run will be more stable to the rest of the kernel.

I'm not crazy about the idea of dynamically (or even statically)allocating struct audit_task_info which includes allocated space forstruct audit_context since the latter is far larger than the former.

Post by Paul Moore[1] If for some reason allocating audit_task_info becomes too muchoverhead to bear (somewhat doubtful since we would only do it at taskcreation), we could do some ugly tricks to directly include anaudit_task_struct chunk in task_struct but I'd like to avoid that ifpossible (and I think we can).

On allocation, I don't see too much of a problem. When callingaudit_free() if there is no audit context it is pretty lightweight, butgets heavier if we eliminate the inline audit_free() and rename__audit_free() back to audit_free(). Having struct audit_task_infodirectly in struct task_struct would be faster and also allow defaultsto be set in init/init_task.c (which has recently been populated frominclude/linux/init_task.h). I'm not sure this is enough of a reason toavoid a pointer from task_struct.

(As an aside, converting allocation of audit_context could also benefitfrom kmem_cache... and maybe even struct audit_names)