The Linux audit system needs a way to be able to track the containerprovenance of events and actions. Audit needs the kernel's help to dothis.

Since the concept of a container is entirely a userspace concept, aregistration from the userspace container orchestration system initiatesthis. This will define a point in time and a set of resourcesassociated with a particular container with an audit containeridentifier.

The registration is a u64 representing the audit container identifierwritten to a special file in a pseudo filesystem (proc, since PID treealready exists) representing a process that will become a parent processin that container. This write might place restrictions on mountnamespaces required to define a container, or at least careful checkingof namespaces in the kernel to verify permissions of the orchestrator soit can't change its own container ID. A bind mount of nsfs may benecessary in the container orchestrator's mount namespace. This writecan only happen once per process.

Note: The justification for using a u64 is that it minimizes theinformation printed in every audit record, reducing bandwidth and limitscomparisons to a single u64 which will be faster and less error-prone.

Require CAP_AUDIT_CONTROL to be able to carry out the registration. Atthat time, record the target container's user-supplied audit containeridentifier along with a target container's parent process (which maybecome the target container's "init" process) process ID (referencedfrom the initial PID namespace) in a new record AUDIT_CONTAINER with aqualifying op=$action field.

Issue a new auxilliary record AUDIT_CONTAINER_INFO for each validcontainer ID present on an auditable action or event.

Forked and cloned processes inherit their parent's audit containeridentifier, referenced in the process' task_struct. Since the auditcontainer identifier is inherited rather than written, it can still bewritten once. This will prevent tampering while allowing nesting.(This can be implemented with an internal settable flag uponregistration that does not get copied across a fork/clone.)

Mimic setns(2) and return an error if the process has already initiatedthreading or forked since this registration should happen before theprocess execution is started by the orchestrator and hence should notyet have any threads or children. If this is deemed overly restrictive,switch all of the target's threads and children to the new containerID.

Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL.

When a container ceases to exist because the last process in thatcontainer has exited log the fact to balance the registration action.(This is likely needed for certification accountability.)

At this point it appears unnecessary to add a container sessionidentifier since this is all tracked from loginuid and sessionid tocommunicate with the container orchestrator to spawn an additionalsession into an existing container which would be logged. It can beadded at a later date without breaking API should it be deemednecessary.

The following namespace logging actions are not needed for certificationpurposes at this point, but are helpful for tracking namespace activity.These are auxilliary records that are associated with namespacemanipulation syscalls unshare(2), clone(2) and setns(2), so the recordswill only show up if explicit syscall rules have been added to documentthis activity.

Log the creation of every namespace, inheriting/adding its spawningprocess' audit container identifier(s), if applicable. Include thespawning and spawned namespace IDs (device and inode number tuples).[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]Note: At this point it appears only network namespaces may need to trackcontainer IDs apart from processes since incoming packets may cause anauditable event before being associated with a process. Since anamespace can be shared by processes in different containers, thenamespace will need to track all containers to which it has beenassigned.

Upon registration, the target process' namespace IDs (in the form of ansfs device number and inode number tuple) will be recorded in anAUDIT_NS_INFO auxilliary record.

Log the destruction of every namespace that is no longer used by anyprocess, including the namespace IDs (device and inode number tuples).[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]

Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)the parent and child namespace IDs for any changes to a process'namespaces. [setns(2)]Note: It may be possible to combine AUDIT_NS_* record formats anddistinguish them with an op=$action field depending on the fieldsrequired for each message type.

The audit container identifier will need to be reaped from allimplicated namespaces upon the destruction of a container.

This namespace information adds supporting information for trackingevents not attributable to specific processes.

Changelog:

(Upstream V3)- switch back to u64 (from pmoore, can be expanded to u128 in future ifneed arises without breaking API. u32 was originally proposed, up toc36 discussed)- write-once, but children inherit audit container identifier and canthen still be written once- switch to CAP_AUDIT_CONTROL- group namespace actions together, auxilliary records to namespaceoperations.

(Upstream V2)- switch from u64 to u128 UUID- switch from "signal" and "trigger" to "register"- restrict registration to single process or force all threads andchildren into same container

Post by Richard Guy BriggsContainers are a userspace concept. The kernel knows nothing of them.The Linux audit system needs a way to be able to track the containerprovenance of events and actions. Audit needs the kernel's help to dothis.Since the concept of a container is entirely a userspace concept, aregistration from the userspace container orchestration system initiatesthis. This will define a point in time and a set of resourcesassociated with a particular container with an audit containeridentifier.The registration is a u64 representing the audit container identifierwritten to a special file in a pseudo filesystem (proc, since PID treealready exists) representing a process that will become a parent processin that container. This write might place restrictions on mountnamespaces required to define a container, or at least careful checkingof namespaces in the kernel to verify permissions of the orchestrator soit can't change its own container ID. A bind mount of nsfs may benecessary in the container orchestrator's mount namespace. This writecan only happen once per process.Note: The justification for using a u64 is that it minimizes theinformation printed in every audit record, reducing bandwidth and limitscomparisons to a single u64 which will be faster and less error-prone.Require CAP_AUDIT_CONTROL to be able to carry out the registration. Atthat time, record the target container's user-supplied audit containeridentifier along with a target container's parent process (which maybecome the target container's "init" process) process ID (referencedfrom the initial PID namespace) in a new record AUDIT_CONTAINER with aqualifying op=$action field.Issue a new auxilliary record AUDIT_CONTAINER_INFO for each validcontainer ID present on an auditable action or event.Forked and cloned processes inherit their parent's audit containeridentifier, referenced in the process' task_struct. Since the auditcontainer identifier is inherited rather than written, it can still bewritten once. This will prevent tampering while allowing nesting.(This can be implemented with an internal settable flag uponregistration that does not get copied across a fork/clone.)Mimic setns(2) and return an error if the process has already initiatedthreading or forked since this registration should happen before theprocess execution is started by the orchestrator and hence should notyet have any threads or children. If this is deemed overly restrictive,switch all of the target's threads and children to the new containerID.Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL.When a container ceases to exist because the last process in thatcontainer has exited log the fact to balance the registration action.(This is likely needed for certification accountability.)At this point it appears unnecessary to add a container sessionidentifier since this is all tracked from loginuid and sessionid tocommunicate with the container orchestrator to spawn an additionalsession into an existing container which would be logged. It can beadded at a later date without breaking API should it be deemednecessary.The following namespace logging actions are not needed for certificationpurposes at this point, but are helpful for tracking namespace activity.These are auxilliary records that are associated with namespacemanipulation syscalls unshare(2), clone(2) and setns(2), so the recordswill only show up if explicit syscall rules have been added to documentthis activity.Log the creation of every namespace, inheriting/adding its spawningprocess' audit container identifier(s), if applicable. Include thespawning and spawned namespace IDs (device and inode number tuples).[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]Note: At this point it appears only network namespaces may need to trackcontainer IDs apart from processes since incoming packets may cause anauditable event before being associated with a process. Since anamespace can be shared by processes in different containers, thenamespace will need to track all containers to which it has beenassigned.Upon registration, the target process' namespace IDs (in the form of ansfs device number and inode number tuple) will be recorded in anAUDIT_NS_INFO auxilliary record.Log the destruction of every namespace that is no longer used by anyprocess, including the namespace IDs (device and inode number tuples).[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)the parent and child namespace IDs for any changes to a process'namespaces. [setns(2)]Note: It may be possible to combine AUDIT_NS_* record formats anddistinguish them with an op=$action field depending on the fieldsrequired for each message type.The audit container identifier will need to be reaped from allimplicated namespaces upon the destruction of a container.This namespace information adds supporting information for trackingevents not attributable to specific processes.(Upstream V3)- switch back to u64 (from pmoore, can be expanded to u128 in future ifneed arises without breaking API. u32 was originally proposed, up toc36 discussed)- write-once, but children inherit audit container identifier and canthen still be written once- switch to CAP_AUDIT_CONTROL- group namespace actions together, auxilliary records to namespaceoperations.(Upstream V2)- switch from u64 to u128 UUID- switch from "signal" and "trigger" to "register"- restrict registration to single process or force all threads andchildren into same container

Orchestrators tend to span many nodes, and containers tend to have IDsthat are either UUID or have a Hash (like SHA256) as identifier.

The problem here is two-fold:

a) Your auditing requires some mapping to be useful outside of thesystem.If you aggreggate audit logs outside of the system or you want tocorrelate the system audit logs with other components dealing withcontainers, now you need a place where you provide a mapping from youraudit u64 to the ID a container has in the rest of the system.

b) Now you need a mapping of some sort. The simplest way a containerorchestrator can go about this is to just use the UUID or Hashrepresenting their view of the container, truncate it to a u64 and usethat for Audit. This means there are some chances there will be acollision and a duplicate u64 ID will be used by the orchestrator asthe container ID. What happen in that case ?

Post by Richard Guy BriggsContainers are a userspace concept. The kernel knows nothing of them.The Linux audit system needs a way to be able to track the containerprovenance of events and actions. Audit needs the kernel's help to dothis.Since the concept of a container is entirely a userspace concept, aregistration from the userspace container orchestration system initiatesthis. This will define a point in time and a set of resourcesassociated with a particular container with an audit containeridentifier.The registration is a u64 representing the audit container identifierwritten to a special file in a pseudo filesystem (proc, since PID treealready exists) representing a process that will become a parent processin that container. This write might place restrictions on mountnamespaces required to define a container, or at least careful checkingof namespaces in the kernel to verify permissions of the orchestrator soit can't change its own container ID. A bind mount of nsfs may benecessary in the container orchestrator's mount namespace. This writecan only happen once per process.Note: The justification for using a u64 is that it minimizes theinformation printed in every audit record, reducing bandwidth and limitscomparisons to a single u64 which will be faster and less error-prone.Require CAP_AUDIT_CONTROL to be able to carry out the registration. Atthat time, record the target container's user-supplied audit containeridentifier along with a target container's parent process (which maybecome the target container's "init" process) process ID (referencedfrom the initial PID namespace) in a new record AUDIT_CONTAINER with aqualifying op=$action field.Issue a new auxilliary record AUDIT_CONTAINER_INFO for each validcontainer ID present on an auditable action or event.Forked and cloned processes inherit their parent's audit containeridentifier, referenced in the process' task_struct. Since the auditcontainer identifier is inherited rather than written, it can still bewritten once. This will prevent tampering while allowing nesting.(This can be implemented with an internal settable flag uponregistration that does not get copied across a fork/clone.)Mimic setns(2) and return an error if the process has already initiatedthreading or forked since this registration should happen before theprocess execution is started by the orchestrator and hence should notyet have any threads or children. If this is deemed overly restrictive,switch all of the target's threads and children to the new containerID.Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL.When a container ceases to exist because the last process in thatcontainer has exited log the fact to balance the registration action.(This is likely needed for certification accountability.)At this point it appears unnecessary to add a container sessionidentifier since this is all tracked from loginuid and sessionid tocommunicate with the container orchestrator to spawn an additionalsession into an existing container which would be logged. It can beadded at a later date without breaking API should it be deemednecessary.The following namespace logging actions are not needed for certificationpurposes at this point, but are helpful for tracking namespace activity.These are auxilliary records that are associated with namespacemanipulation syscalls unshare(2), clone(2) and setns(2), so the recordswill only show up if explicit syscall rules have been added to documentthis activity.Log the creation of every namespace, inheriting/adding its spawningprocess' audit container identifier(s), if applicable. Include thespawning and spawned namespace IDs (device and inode number tuples).[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]Note: At this point it appears only network namespaces may need to trackcontainer IDs apart from processes since incoming packets may cause anauditable event before being associated with a process. Since anamespace can be shared by processes in different containers, thenamespace will need to track all containers to which it has beenassigned.Upon registration, the target process' namespace IDs (in the form of ansfs device number and inode number tuple) will be recorded in anAUDIT_NS_INFO auxilliary record.Log the destruction of every namespace that is no longer used by anyprocess, including the namespace IDs (device and inode number tuples).[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)the parent and child namespace IDs for any changes to a process'namespaces. [setns(2)]Note: It may be possible to combine AUDIT_NS_* record formats anddistinguish them with an op=$action field depending on the fieldsrequired for each message type.The audit container identifier will need to be reaped from allimplicated namespaces upon the destruction of a container.This namespace information adds supporting information for trackingevents not attributable to specific processes.(Upstream V3)- switch back to u64 (from pmoore, can be expanded to u128 in future ifneed arises without breaking API. u32 was originally proposed, up toc36 discussed)- write-once, but children inherit audit container identifier and canthen still be written once- switch to CAP_AUDIT_CONTROL- group namespace actions together, auxilliary records to namespaceoperations.(Upstream V2)- switch from u64 to u128 UUID- switch from "signal" and "trigger" to "register"- restrict registration to single process or force all threads andchildren into same container

I am trying to understand the back and forth on the ID size.From an orchestrator POV anything that requires tracking a nodespecific ID is not ideal.Orchestrators tend to span many nodes, and containers tend to have IDsthat are either UUID or have a Hash (like SHA256) as identifier.a) Your auditing requires some mapping to be useful outside of thesystem.If you aggreggate audit logs outside of the system or you want tocorrelate the system audit logs with other components dealing withcontainers, now you need a place where you provide a mapping from youraudit u64 to the ID a container has in the rest of the system.b) Now you need a mapping of some sort. The simplest way a containerorchestrator can go about this is to just use the UUID or Hashrepresenting their view of the container, truncate it to a u64 and usethat for Audit. This means there are some chances there will be acollision and a duplicate u64 ID will be used by the orchestrator asthe container ID. What happen in that case ?

Paul, can you justify this somewhat larger inconvenience for somerelatively minor convenience on our part? u64 vs u128 is easy for us toaccomodate in terms of scalar comparisons. It doubles the informationin every container id field we print in audit records. A c36 is abigger step.

Post by Richard Guy BriggsContainers are a userspace concept. The kernel knows nothing of them.The Linux audit system needs a way to be able to track the containerprovenance of events and actions. Audit needs the kernel's help to dothis.Since the concept of a container is entirely a userspace concept, aregistration from the userspace container orchestration system initiatesthis. This will define a point in time and a set of resourcesassociated with a particular container with an audit containeridentifier.The registration is a u64 representing the audit container identifierwritten to a special file in a pseudo filesystem (proc, since PID treealready exists) representing a process that will become a parent processin that container. This write might place restrictions on mountnamespaces required to define a container, or at least careful checkingof namespaces in the kernel to verify permissions of the orchestrator soit can't change its own container ID. A bind mount of nsfs may benecessary in the container orchestrator's mount namespace. This writecan only happen once per process.Note: The justification for using a u64 is that it minimizes theinformation printed in every audit record, reducing bandwidth and limitscomparisons to a single u64 which will be faster and less error-prone.Require CAP_AUDIT_CONTROL to be able to carry out the registration. Atthat time, record the target container's user-supplied audit containeridentifier along with a target container's parent process (which maybecome the target container's "init" process) process ID (referencedfrom the initial PID namespace) in a new record AUDIT_CONTAINER with aqualifying op=$action field.Issue a new auxilliary record AUDIT_CONTAINER_INFO for each validcontainer ID present on an auditable action or event.Forked and cloned processes inherit their parent's audit containeridentifier, referenced in the process' task_struct. Since the auditcontainer identifier is inherited rather than written, it can still bewritten once. This will prevent tampering while allowing nesting.(This can be implemented with an internal settable flag uponregistration that does not get copied across a fork/clone.)Mimic setns(2) and return an error if the process has already initiatedthreading or forked since this registration should happen before theprocess execution is started by the orchestrator and hence should notyet have any threads or children. If this is deemed overly restrictive,switch all of the target's threads and children to the new containerID.Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL.When a container ceases to exist because the last process in thatcontainer has exited log the fact to balance the registration action.(This is likely needed for certification accountability.)At this point it appears unnecessary to add a container sessionidentifier since this is all tracked from loginuid and sessionid tocommunicate with the container orchestrator to spawn an additionalsession into an existing container which would be logged. It can beadded at a later date without breaking API should it be deemednecessary.The following namespace logging actions are not needed for certificationpurposes at this point, but are helpful for tracking namespace activity.These are auxilliary records that are associated with namespacemanipulation syscalls unshare(2), clone(2) and setns(2), so the recordswill only show up if explicit syscall rules have been added to documentthis activity.Log the creation of every namespace, inheriting/adding its spawningprocess' audit container identifier(s), if applicable. Include thespawning and spawned namespace IDs (device and inode number tuples).[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]Note: At this point it appears only network namespaces may need to trackcontainer IDs apart from processes since incoming packets may cause anauditable event before being associated with a process. Since anamespace can be shared by processes in different containers, thenamespace will need to track all containers to which it has beenassigned.Upon registration, the target process' namespace IDs (in the form of ansfs device number and inode number tuple) will be recorded in anAUDIT_NS_INFO auxilliary record.Log the destruction of every namespace that is no longer used by anyprocess, including the namespace IDs (device and inode number tuples).[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)the parent and child namespace IDs for any changes to a process'namespaces. [setns(2)]Note: It may be possible to combine AUDIT_NS_* record formats anddistinguish them with an op=$action field depending on the fieldsrequired for each message type.The audit container identifier will need to be reaped from allimplicated namespaces upon the destruction of a container.This namespace information adds supporting information for trackingevents not attributable to specific processes.(Upstream V3)- switch back to u64 (from pmoore, can be expanded to u128 in future ifneed arises without breaking API. u32 was originally proposed, up toc36 discussed)- write-once, but children inherit audit container identifier and canthen still be written once- switch to CAP_AUDIT_CONTROL- group namespace actions together, auxilliary records to namespaceoperations.(Upstream V2)- switch from u64 to u128 UUID- switch from "signal" and "trigger" to "register"- restrict registration to single process or force all threads andchildren into same container

I am trying to understand the back and forth on the ID size.From an orchestrator POV anything that requires tracking a nodespecific ID is not ideal.Orchestrators tend to span many nodes, and containers tend to have IDsthat are either UUID or have a Hash (like SHA256) as identifier.a) Your auditing requires some mapping to be useful outside of thesystem.If you aggreggate audit logs outside of the system or you want tocorrelate the system audit logs with other components dealing withcontainers, now you need a place where you provide a mapping from youraudit u64 to the ID a container has in the rest of the system.b) Now you need a mapping of some sort. The simplest way a containerorchestrator can go about this is to just use the UUID or Hashrepresenting their view of the container, truncate it to a u64 and usethat for Audit. This means there are some chances there will be acollision and a duplicate u64 ID will be used by the orchestrator asthe container ID. What happen in that case ?

Paul, can you justify this somewhat larger inconvenience for somerelatively minor convenience on our part?

Done in direct response to Simo.

But to be clear Richard, we've talked about this a few times, it's nota "minor convenience" on our part, it's a pretty big convenience oncewe starting having to route audit events and make decisions based onthe audit container ID information. Audit performance is less thanawesome now, I'm working hard to not make it worse.

Post by Richard Guy Briggsu64 vs u128 is easy for us toaccomodate in terms of scalar comparisons. It doubles the informationin every container id field we print in audit records.

Post by Richard Guy BriggsContainers are a userspace concept. The kernel knows nothing of them.The Linux audit system needs a way to be able to track the containerprovenance of events and actions. Audit needs the kernel's help to dothis.Since the concept of a container is entirely a userspace concept, aregistration from the userspace container orchestration system initiatesthis. This will define a point in time and a set of resourcesassociated with a particular container with an audit containeridentifier.The registration is a u64 representing the audit container identifierwritten to a special file in a pseudo filesystem (proc, since PID treealready exists) representing a process that will become a parent processin that container. This write might place restrictions on mountnamespaces required to define a container, or at least careful checkingof namespaces in the kernel to verify permissions of the orchestrator soit can't change its own container ID. A bind mount of nsfs may benecessary in the container orchestrator's mount namespace. This writecan only happen once per process.Note: The justification for using a u64 is that it minimizes theinformation printed in every audit record, reducing bandwidth and limitscomparisons to a single u64 which will be faster and less error-prone.Require CAP_AUDIT_CONTROL to be able to carry out the registration. Atthat time, record the target container's user-supplied audit containeridentifier along with a target container's parent process (which maybecome the target container's "init" process) process ID (referencedfrom the initial PID namespace) in a new record AUDIT_CONTAINER with aqualifying op=$action field.Issue a new auxilliary record AUDIT_CONTAINER_INFO for each validcontainer ID present on an auditable action or event.Forked and cloned processes inherit their parent's audit containeridentifier, referenced in the process' task_struct. Since the auditcontainer identifier is inherited rather than written, it can still bewritten once. This will prevent tampering while allowing nesting.(This can be implemented with an internal settable flag uponregistration that does not get copied across a fork/clone.)Mimic setns(2) and return an error if the process has already initiatedthreading or forked since this registration should happen before theprocess execution is started by the orchestrator and hence should notyet have any threads or children. If this is deemed overly restrictive,switch all of the target's threads and children to the new containerID.Trust the orchestrator to judiciously use and restrict CAP_AUDIT_CONTROL.When a container ceases to exist because the last process in thatcontainer has exited log the fact to balance the registration action.(This is likely needed for certification accountability.)At this point it appears unnecessary to add a container sessionidentifier since this is all tracked from loginuid and sessionid tocommunicate with the container orchestrator to spawn an additionalsession into an existing container which would be logged. It can beadded at a later date without breaking API should it be deemednecessary.The following namespace logging actions are not needed for certificationpurposes at this point, but are helpful for tracking namespace activity.These are auxilliary records that are associated with namespacemanipulation syscalls unshare(2), clone(2) and setns(2), so the recordswill only show up if explicit syscall rules have been added to documentthis activity.Log the creation of every namespace, inheriting/adding its spawningprocess' audit container identifier(s), if applicable. Include thespawning and spawned namespace IDs (device and inode number tuples).[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]Note: At this point it appears only network namespaces may need to trackcontainer IDs apart from processes since incoming packets may cause anauditable event before being associated with a process. Since anamespace can be shared by processes in different containers, thenamespace will need to track all containers to which it has beenassigned.Upon registration, the target process' namespace IDs (in the form of ansfs device number and inode number tuple) will be recorded in anAUDIT_NS_INFO auxilliary record.Log the destruction of every namespace that is no longer used by anyprocess, including the namespace IDs (device and inode number tuples).[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)the parent and child namespace IDs for any changes to a process'namespaces. [setns(2)]Note: It may be possible to combine AUDIT_NS_* record formats anddistinguish them with an op=$action field depending on the fieldsrequired for each message type.The audit container identifier will need to be reaped from allimplicated namespaces upon the destruction of a container.This namespace information adds supporting information for trackingevents not attributable to specific processes.(Upstream V3)- switch back to u64 (from pmoore, can be expanded to u128 in future ifneed arises without breaking API. u32 was originally proposed, up toc36 discussed)- write-once, but children inherit audit container identifier and canthen still be written once- switch to CAP_AUDIT_CONTROL- group namespace actions together, auxilliary records to namespaceoperations.(Upstream V2)- switch from u64 to u128 UUID- switch from "signal" and "trigger" to "register"- restrict registration to single process or force all threads andchildren into same container

I am trying to understand the back and forth on the ID size.From an orchestrator POV anything that requires tracking a nodespecific ID is not ideal.Orchestrators tend to span many nodes, and containers tend to have IDsthat are either UUID or have a Hash (like SHA256) as identifier.a) Your auditing requires some mapping to be useful outside of thesystem.If you aggreggate audit logs outside of the system or you want tocorrelate the system audit logs with other components dealing withcontainers, now you need a place where you provide a mapping from youraudit u64 to the ID a container has in the rest of the system.b) Now you need a mapping of some sort. The simplest way a containerorchestrator can go about this is to just use the UUID or Hashrepresenting their view of the container, truncate it to a u64 and usethat for Audit. This means there are some chances there will be acollision and a duplicate u64 ID will be used by the orchestrator asthe container ID. What happen in that case ?

Paul, can you justify this somewhat larger inconvenience for somerelatively minor convenience on our part?

Done in direct response to Simo.

Sorry but your response sounds more like waving away then addressingthem, the excuse being: we can't please everyone, so we are going toplease no one.

Post by Paul MooreBut to be clear Richard, we've talked about this a few times, it's nota "minor convenience" on our part, it's a pretty big convenience oncewe starting having to route audit events and make decisions based onthe audit container ID information. Audit performance is less thanawesome now, I'm working hard to not make it worse.

I can see why you do not want to have arbitrary length strings, but au128 sounded like a reasonable compromise to me as it has enough roomto be able to have unique cluster-wide IDs which a u64 definitely makesa lot harder to provide w/o tight coordination.

Post by Richard Guy BriggsPaul, can you justify this somewhat larger inconvenience for somerelatively minor convenience on our part?

Done in direct response to Simo.

Sorry but your response sounds more like waving away then addressingthem, the excuse being: we can't please everyone, so we are going toplease no one.

I obviously disagree with the take on my comments but you're free toyour opinion.

I believe saying we are pleasing no one isn't really fair now is it?Is there any type of audit container ID now? How would you go aboutassociating audit events with containers now? (spoiler alert: it ain'tpretty, and there are gaps I don't believe you can cover) Thisproposal provides a mechanism to do this in a way that isn't tied toany one particular concept of a container and is manageable inside thekernel.

If you have a need to track audit events for containers, I find itextremely hard to believe that you are not at least partially pleasedby the solutions presented here. It may not be everything on yourwishlist, but when did you ever get *everything* on your wishlist?

Post by Paul MooreBut to be clear Richard, we've talked about this a few times, it's nota "minor convenience" on our part, it's a pretty big convenience oncewe starting having to route audit events and make decisions based onthe audit container ID information. Audit performance is less thanawesome now, I'm working hard to not make it worse.

Sounds like a security vs performance trade off to me.

Welcome to software development. It's generally a pretty terriblehobby and/or occupation, but we make up for it with long hours andendless frustration.

Post by Richard Guy Briggsu64 vs u128 is easy for us toaccomodate in terms of scalar comparisons. It doubles the informationin every container id field we print in audit records.

... and slows down audit container ID checks.

Are you saying a cmp on a u128 is slower than a comparison on a u64 andthis is something that will be noticeable ?

Do you have a 128 bit system? I don't. I've got a bunch of 64 bitsystems, and a couple of 32 bit systems too. People that use audithave a tendency to really hammer on it, to the point that we getperformance complaints on a not infrequent basis. I don't know theexact number of times we are going to need to check the auditcontainer ID, but it's reasonable to think that we'll expose it as afilter-able field which adds a few checks, we'll use it for recordrouting so that's a few more, and if we're running multiple auditdaemons we will probably want to include LSM checks which could resultin a few more audit container ID checks. If it was one comparison Iwouldn't be too worried about it, but the point I'm trying to make isthat we don't know what the implementation is going to look like yetand I suspect this ID is going to be leveraged in several places inthe audit subsystem and I would much rather start small to saveheadaches later.

We can always expand the ID to a larger integer at a later date, butwe can't make it smaller.

Ok, I can see your point though I do not agree with it.I can see why you do not want to have arbitrary length strings, but au128 sounded like a reasonable compromise to me as it has enough roomto be able to have unique cluster-wide IDs which a u64 definitely makesa lot harder to provide w/o tight coordination.

I originally wanted it to be a 32-bit integer, but Richard managed totalk me into 64-bits, that was my compromise :)

As I said earlier, if you are doing container auditing you're going toneed coordination with the orchestrator, regardless of the auditcontainer ID size.

Post by Richard Guy BriggsPaul, can you justify this somewhat larger inconvenience for somerelatively minor convenience on our part?

Done in direct response to Simo.

Sorry but your response sounds more like waving away then addressingthem, the excuse being: we can't please everyone, so we are going toplease no one.

I obviously disagree with the take on my comments but you're free toyour opinion.I believe saying we are pleasing no one isn't really fair now is it?Is there any type of audit container ID now? How would you go aboutassociating audit events with containers now? (spoiler alert: it ain'tpretty, and there are gaps I don't believe you can cover) Thisproposal provides a mechanism to do this in a way that isn't tied toany one particular concept of a container and is manageable inside thekernel.If you have a need to track audit events for containers, I find itextremely hard to believe that you are not at least partially pleasedby the solutions presented here. It may not be everything on yourwishlist, but when did you ever get *everything* on your wishlist?

I am going to back Paul 100% on this point. The containercommunity's emphatic position that containers are strictlya user-space construct makes it impossible for the kernelto provide any data more sophisticated than an integer, andany processing based on that data cleverer than a checkfor equality.

Post by Paul MooreBut to be clear Richard, we've talked about this a few times, it's nota "minor convenience" on our part, it's a pretty big convenience oncewe starting having to route audit events and make decisions based onthe audit container ID information. Audit performance is less thanawesome now, I'm working hard to not make it worse.

Sounds like a security vs performance trade off to me.

Without the kernel having a "container" policy to work with thereis no "security" it can possibly enforce.

Post by Paul MooreWelcome to software development. It's generally a pretty terriblehobby and/or occupation, but we make up for it with long hours andendless frustration.

Post by Richard Guy Briggsu64 vs u128 is easy for us toaccomodate in terms of scalar comparisons. It doubles the informationin every container id field we print in audit records.

... and slows down audit container ID checks.

Are you saying a cmp on a u128 is slower than a comparison on a u64 andthis is something that will be noticeable ?

Do you have a 128 bit system? I don't. I've got a bunch of 64 bitsystems, and a couple of 32 bit systems too. People that use audithave a tendency to really hammer on it, to the point that we getperformance complaints on a not infrequent basis. I don't know theexact number of times we are going to need to check the auditcontainer ID, but it's reasonable to think that we'll expose it as afilter-able field which adds a few checks, we'll use it for recordrouting so that's a few more, and if we're running multiple auditdaemons we will probably want to include LSM checks which could resultin a few more audit container ID checks. If it was one comparison Iwouldn't be too worried about it, but the point I'm trying to make isthat we don't know what the implementation is going to look like yetand I suspect this ID is going to be leveraged in several places inthe audit subsystem and I would much rather start small to saveheadaches later.We can always expand the ID to a larger integer at a later date, butwe can't make it smaller.

Ok, I can see your point though I do not agree with it.I can see why you do not want to have arbitrary length strings, but au128 sounded like a reasonable compromise to me as it has enough roomto be able to have unique cluster-wide IDs which a u64 definitely makesa lot harder to provide w/o tight coordination.

I originally wanted it to be a 32-bit integer, but Richard managed totalk me into 64-bits, that was my compromise :)As I said earlier, if you are doing container auditing you're going toneed coordination with the orchestrator, regardless of the auditcontainer ID size.

Well, of course you are going to please the audit subsystem, Iunderstand that. I think there is a problem of expectations. Somepeople, me included, hoped to have a way to identify a container withthe help of the kernel.

Post by Paul MooreIs there any type of audit container ID now? How would you go aboutassociating audit events with containers now?

We do not have a good way, there are some dirty tricks like inferringthe container identity via cgroup names, but that is ... eww.This is why, given audit has the same need of user space, there wassome hope we could agree on an identifier that could be used by both.It would make correlating audit logs and other cluster-wide eventssimpler. That is all.

Post by Paul Moore(spoiler alert: it ain'tpretty, and there are gaps I don't believe you can cover) Thisproposal provides a mechanism to do this in a way that isn't tied toany one particular concept of a container and is manageable inside thekernel.

I like the proposal for the most part, we are just discussing on thenature of the identifier, which is a minor detail in the end.

Post by Paul MooreIf you have a need to track audit events for containers, I find itextremely hard to believe that you are not at least partially pleasedby the solutions presented here. It may not be everything on yourwishlist, but when did you ever get *everything* on your wishlist?

It is true, and I am sorry if I came out demanding or abrasive. It wasnot my intention. Of course a u64 that has to be mapped is still betterthan nothing. It does cause a lot more work in user space, but it isnot impossible to deal with.

Post by Paul MooreBut to be clear Richard, we've talked about this a few times, it's nota "minor convenience" on our part, it's a pretty big convenience oncewe starting having to route audit events and make decisions based onthe audit container ID information. Audit performance is less thanawesome now, I'm working hard to not make it worse.

Sounds like a security vs performance trade off to me.

Welcome to software development. It's generally a pretty terriblehobby and/or occupation, but we make up for it with long hours andendless frustration.

Post by Richard Guy Briggsu64 vs u128 is easy for us toaccomodate in terms of scalar comparisons. It doubles the informationin every container id field we print in audit records.

... and slows down audit container ID checks.

Are you saying a cmp on a u128 is slower than a comparison on a u64 andthis is something that will be noticeable ?

Do you have a 128 bit system?

no, but all 64bit systems have an instruction that allow you to doatomic 128 compare and swap (IIRC ?).

Post by Paul MooreI don't. I've got a bunch of 64 bitsystems, and a couple of 32 bit systems too. People that use audithave a tendency to really hammer on it, to the point that we getperformance complaints on a not infrequent basis. I don't know theexact number of times we are going to need to check the auditcontainer ID, but it's reasonable to think that we'll expose it as afilter-able field which adds a few checks, we'll use it for recordrouting so that's a few more, and if we're running multiple auditdaemons we will probably want to include LSM checks which could resultin a few more audit container ID checks. If it was one comparison Iwouldn't be too worried about it, but the point I'm trying to make isthat we don't know what the implementation is going to look like yetand I suspect this ID is going to be leveraged in several places inthe audit subsystem and I would much rather start small to saveheadaches later.We can always expand the ID to a larger integer at a later date, butwe can't make it smaller.

Well looking through the history of in kernel identifiers I know it ishard also to increase size, because userspace will end up depending ona specific size ... and this is the only reason I am really debatingthis. If it were really easy to change I wouldn't bother to do it now.

Ok, I can see your point though I do not agree with it.I can see why you do not want to have arbitrary length strings, but au128 sounded like a reasonable compromise to me as it has enough roomto be able to have unique cluster-wide IDs which a u64 definitely makesa lot harder to provide w/o tight coordination.

I originally wanted it to be a 32-bit integer, but Richard managed totalk me into 64-bits, that was my compromise :)As I said earlier, if you are doing container auditing you're going toneed coordination with the orchestrator, regardless of the auditcontainer ID size.

Ok, I guess that's as good as I can get it for now, thank you for yourpatient explanations.

Post by Richard Guy Briggs(Upstream V3)- switch back to u64 (from pmoore, can be expanded to u128 in future ifneed arises without breaking API. u32 was originally proposed, up toc36 discussed)- write-once, but children inherit audit container identifier and canthen still be written once- switch to CAP_AUDIT_CONTROL- group namespace actions together, auxilliary records to namespaceoperations.(Upstream V2)- switch from u64 to u128 UUID- switch from "signal" and "trigger" to "register"- restrict registration to single process or force all threads andchildren into same container

I am trying to understand the back and forth on the ID size.

I'm just now getting a chance to read Richard's latest draft, but Iwanted to comment on this quickly.

There are two main reasons for keeping this a 32 or 64 bit integer:

1) After the initial "be able to associate audit events with acontainer" stage, we are going to look into supporting multiple auditdaemons on the system so that you could run an audit daemon inside acontainer and it would collect events generated by the container(we're tentatively calling this "phase 2", feel free to insert yourown "magic happens" joke). There are a lot things that need to happenin phase two, one of these things is the addition of an audit eventrouting mechanism that will send audit records to the right auditdaemons (the "host" daemon will always see everything), in order to dothis we will need to be able to quickly compare audit container IDs,this means an integer.

2) Whatever we pick for an audit container ID it is going to be wrongfor at least one container orchestrator. There is no "one" solutionhere, so we are providing a small and flexible mechanism that higherlevel orchestrators can use to provide a more complete solution.

specific ID is not ideal.Orchestrators tend to span many nodes, and containers tend to have IDsthat are either UUID or have a Hash (like SHA256) as identifier.

You're helping me prove my reason #2.

Post by Simo Sorcea) Your auditing requires some mapping to be useful outside of thesystem.If you aggreggate audit logs outside of the system or you want tocorrelate the system audit logs with other components dealing withcontainers, now you need a place where you provide a mapping from youraudit u64 to the ID a container has in the rest of the system.

Yep, see my reason #2. I want us to have something that "works" for asingle system as well as something that can be leveraged by higherlevel tools for large networks of machines.

I realize it's easy, and tempting, to expand the scope of this effort;but if we are to have any success it is only going to be through somediscipline. We need to focus on a small solution which addresses thebasic needs and hopefully remains flexible enough for any potentialexpansion while staying palatable to the audit folks and the generalkernel community.

Post by Simo Sorceb) Now you need a mapping of some sort. The simplest way a containerorchestrator can go about this is to just use the UUID or Hashrepresenting their view of the container, truncate it to a u64 and usethat for Audit. This means there are some chances there will be acollision and a duplicate u64 ID will be used by the orchestrator asthe container ID. What happen in that case ?

That is a design decision left to the different container orchestrators.

Please let's have a description of the problem you are trying to solve.

A proposed solution without talking about the problem space is useless.Any proposed solution could potentially work.

I know to these exist. There is motivation for your work.What is the motivation?What problem are you trying to solve?

In particular what information are you trying to get into logs that youcan not get into the logs today?

I am going to try to give this the attention it deserves but right now Iam having to deal with half thought out patches for information leaksfrom speculative code paths, so I won't be able to give this muchattention for a little bit.

Post by Eric W. BiedermanA proposed solution without talking about the problem space is useless.Any proposed solution could potentially work.I know to these exist. There is motivation for your work.What is the motivation?What problem are you trying to solve?In particular what information are you trying to get into logs that youcan not get into the logs today?I am going to try to give this the attention it deserves but right now Iam having to deal with half thought out patches for information leaksfrom speculative code paths, so I won't be able to give this muchattention for a little bit.Eric

Post by Richard Guy BriggsContainers are a userspace concept. The kernel knows nothing of them.The Linux audit system needs a way to be able to track the containerprovenance of events and actions. Audit needs the kernel's help to dothis.

Two small comments below, but I tend to think we are at a point whereyou can start cobbling together some prototype/RFC patches. Surelythere are going to be a few changes, and new comments, that come outonce we see an initial implementation so let's see what those are.

Post by Richard Guy BriggsThe registration is a u64 representing the audit container identifierwritten to a special file in a pseudo filesystem (proc, since PID treealready exists) representing a process that will become a parent processin that container. This write might place restrictions on mountnamespaces required to define a container, or at least careful checkingof namespaces in the kernel to verify permissions of the orchestrator soit can't change its own container ID. A bind mount of nsfs may benecessary in the container orchestrator's mount namespace. This writecan only happen once per process.Note: The justification for using a u64 is that it minimizes theinformation printed in every audit record, reducing bandwidth and limitscomparisons to a single u64 which will be faster and less error-prone.

I know Steve generally worries about audit record size, which is aperfectly valid concern in this case, I also worry about theadditional overhead when we start routing audit records to multipleaudit daemons (see my other emails in this thread).

Post by Richard Guy Briggs...When a container ceases to exist because the last process in thatcontainer has exited log the fact to balance the registration action.(This is likely needed for certification accountability.)

On the "container ceases to exist" point, I expect this "containerdead" message to come from the orchestrator and not the kernel itself(I don't want the kernel to have to handle that level of bookkeeping).I imagine this should be similar to what is done for VM auditing withlibvirt.

Post by Richard Guy BriggsContainers are a userspace concept. The kernel knows nothing of them.The Linux audit system needs a way to be able to track the containerprovenance of events and actions. Audit needs the kernel's help to dothis.

Two small comments below, but I tend to think we are at a point whereyou can start cobbling together some prototype/RFC patches. Surely

Agreed.

LGTM.

Post by Paul Moorethere are going to be a few changes, and new comments, that come outonce we see an initial implementation so let's see what those are.