Quoting Joseph Salisbury (<email address hidden>):
> One additional question, do you happen to know if this is a regression?
> Did this not happen with previous releases/kernels?

This is not a regression, it has never worked right.

We believe the problem is that if a task is !dumpable, then the kernel
marks some of its /proc/pid files as owned by the global host root,
which is not mapped into a user namespace. If that is the case, then
the question is whether it is safe to mark them owned by the container
root; or whether we can distinguish between tasks which became dumpable
before switching namespaces; or whether there is something else we can
do.

Quoting Seth Forshee (<email address hidden>):
> I tried the kernel patch from the mailing list, but that doesn't fix the
> problem. It does fix permissions for most /proc/pid/* files in setuid
> processes, but the console problems remain.

stderr actually is mapped to a pty. The problem seems to be that getty can't set /dev/console as its controlling terminal because it's already the controlling tty for init, which is in a different process group. Thus getty ends up with no controlling tty, this is inherited by bash, and thus bash cannot set up job control.

Quoting Seth Forshee (<email address hidden>):
> stderr actually is mapped to a pty. The problem seems to be that getty
> can't set /dev/console as its controlling terminal because it's already
> the controlling tty for init, which is in a different process group.
> Thus getty ends up with no controlling tty, this is inherited by bash,
> and thus bash cannot set up job control.

Interesting.

Note that what you describe should also be the case if using a regular
container

sudo lxc-create -t ubuntu-cloud -n u1
sudo lxc-start -n u1

Is the process group of init somehow ending up different in the user
namespace case? Or else why would this only be a problem in the
user namespace case?

On Tue, Jan 14, 2014 at 08:42:06PM -0000, Serge Hallyn wrote:
> Note that what you describe should also be the case if using a regular
> container
>
> sudo lxc-create -t ubuntu-cloud -n u1
> sudo lxc-start -n u1
>
> Is the process group of init somehow ending up different in the user
> namespace case? Or else why would this only be a problem in the
> user namespace case?

It is diffferent. Here's the controlling ttys without user namespaces:

init should have its controlling terminal cleared when it calls
setsid(), so either it isn't calling setsid() or else setsid() is
failing. The reasons setsid() would fail are that the process is already
a session group leader or else a session with the same id already
exists. I haven't found how user namespaces would have any effect on
those things, however.

The same basic sequence of events happens with and without user namespaces. init sheds its tty with setsid() but then opens /dev/console, which as the effect of making /dev/console it's controlling tty. Later getty also opens /dev/console and tries the TIOCSCTTY ioctl on the fd. At this point I think the following code in the kernel handling of that ioctl comes into play:

I.e. getty doesn't have CAP_SYS_ADMIN and thus can't steal the console from init. I'm not sure what the fix is yet, whether there's something we can do here which can allow root within a namespace to steal the console or whether upstart just needs to explicitly shed the console after opening it.

On Wed, Jan 15, 2014 at 06:37:41PM -0000, Serge Hallyn wrote:
> If it is possible to get to the inode backing the tty at this point
> then we should be able to do inode_capable(tty_inode(tty),
> CAP_SYS_ADMIN), which should be safe and adquate right?
>
> But I dont' think we can get inode from tty. However we can get the

I'm new to how capabilities are handled with user namespaces, but at a
glance I think inode_capable() looks sufficient. We can't get the inode
from the tty but it could easily be passed as an argument the function
containing that code.

> tty->session which is a struct pid*. So we can check whether we have
> ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)

Except that we're not interested in the capabilities of tty->session but
of current since current is the one doing the stealing. So that should
probably be ns_capable(current_user_ns(), CAP_SYS_ADMIN).

I'm thinking though we also need to verify that tty->session is in the
same namespace, otherwise nothing seems to prevent a lesser priveleged
namespace from doing mknod and stealing any tty from another namespace,
which seems like a serious security issue. So something along the lines
of:

On Wed, Jan 15, 2014 at 07:53:54PM -0000, Seth Forshee wrote:
> On Wed, Jan 15, 2014 at 06:37:41PM -0000, Serge Hallyn wrote:
> > If it is possible to get to the inode backing the tty at this point
> > then we should be able to do inode_capable(tty_inode(tty),
> > CAP_SYS_ADMIN), which should be safe and adquate right?
> >
> > But I dont' think we can get inode from tty. However we can get the
>
> I'm new to how capabilities are handled with user namespaces, but at a
> glance I think inode_capable() looks sufficient. We can't get the inode
> from the tty but it could easily be passed as an argument the function
> containing that code.
>
> > tty->session which is a struct pid*. So we can check whether we have
> > ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)
>
> Except that we're not interested in the capabilities of tty->session but
> of current since current is the one doing the stealing. So that should
> probably be ns_capable(current_user_ns(), CAP_SYS_ADMIN).
>
> I'm thinking though we also need to verify that tty->session is in the
> same namespace, otherwise nothing seems to prevent a lesser priveleged
> namespace from doing mknod and stealing any tty from another namespace,
> which seems like a serious security issue. So something along the lines
> of:
>
> if (arg == 1 &&
> (capable(CAP_SYS_ADMIN) ||
> (current_user_namespace() == ns_of_pid(tty->session) &&
> ns_capable(current_user_ns(), CAP_SYS_ADMIN)))) {
> /* steal tty */
> }
>
> Or am I being too paranoid?

mknod isn't possible from a userns, otherwise we'd be in a lot more
problem than just tty devices (think what would hapeen if I could mknod
sda in a container).

Quoting Seth Forshee (<email address hidden>):
> On Wed, Jan 15, 2014 at 06:37:41PM -0000, Serge Hallyn wrote:
> > If it is possible to get to the inode backing the tty at this point
> > then we should be able to do inode_capable(tty_inode(tty),
> > CAP_SYS_ADMIN), which should be safe and adquate right?
> >
> > But I dont' think we can get inode from tty. However we can get the
>
> I'm new to how capabilities are handled with user namespaces, but at a
> glance I think inode_capable() looks sufficient. We can't get the inode
> from the tty but it could easily be passed as an argument the function
> containing that code.

The question actually remains: what do we need privilege toward? If
user A has file F open, and we are going to steal F from A... IIUC we
already should have check for permission to access F right? So now the
question is only whether we can take something from A, or whether A is
more privileged than us.

> > tty->session which is a struct pid*. So we can check whether we have
> > ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)
>
> Except that we're not interested in the capabilities of tty->session but

The ns_capable line doesn't check the capabilities of tty->session,
but rather current's capabilities targeted toward the user namespace
which owns tty->session.

> of current since current is the one doing the stealing. So that should
> probably be ns_capable(current_user_ns(), CAP_SYS_ADMIN).

That would check the privilege of current toward his own userns. Any
unprivileged user can clone(CLONE_NEWUSER) and have that test evaluate
to true.

> I'm thinking though we also need to verify that tty->session is in the
> same namespace, otherwise nothing seems to prevent a lesser priveleged
> namespace from doing mknod and stealing any tty from another namespace,
> which seems like a serious security issue. So something along the lines
> of:
>
> if (arg == 1 &&
> (capable(CAP_SYS_ADMIN) ||
> (current_user_namespace() == ns_of_pid(tty->session) &&
> ns_capable(current_user_ns(), CAP_SYS_ADMIN)))) {
> /* steal tty */
> }
>
> Or am I being too paranoid?

That would be the point of doing:

ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)

If you are in a child userns of init, you cannot CAP_SYS_ADMIN toward
init's pidns.

I've added an upstart task to the bug. After looking a bit more it seems upstart is trying to always open terminal devices with O_NOCTTY, so the tty ownership by init is likely unintentional and therefore a bug. I haven't been able to find where in upstart this is happening, but on the kernel side I can tell that it's due to an open() without O_NOCTTY. So while I think the kernel change makes sense it seems like it's more of a workaround for a bug in upstart.

I figured out what's happening. lxc sets up /dev/kmsg as a symlink to /dev/console, init fopens kmsg, and suddenly it owns the console. Not sure whether the fix is to handle kmsg differently or special-case it in upstart to be opened with O_NOCTTY. I'll leave it to Serge and James to figure that out, and in the meantime I'll attend to the kernel patch.