Question

I would like to launch a Docker container which runs a process which may or may not clone itself.

Is it possible to set up a user with normal + clone_newuts permission so that I do not have a login user to my container which is a superuser?

Initially this question was labelled: CLONE_NEWUTS permission only, which is incorrect. @sourcejedi has answed in good faith below and improved my understanding considerably.

EDIT-1

I've found where the su flags are held: /usr/include/linux/sched.h, I expect the answer to be something along the lines of monkey patching a specific users permissions on container create. I'm going to go along that route for now and see where it takes me.

EDIT-2

I found where the user/file capabilities can be set. I see I have a lot of reading to do but I think that a specific permission (capability) can be given to a file (which will be excutable in this case). From the capabilities manpage:
Starting with kernel 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute.
so, it should be possible to apply a specific capability (whose flag is found in the above header) to either a file or user. Which one, I'm not sure yet but it's getting pretty fun and deep to find this out.

EDIT-3

As @sourcejedi has pointed out I have misinterpreted my needs. The information neccessary is in man limits.conf, in which one may run a process at a specified user level, in this case root.

1 Answer
1

There is not a capability to specifically allow calling clone() with the CLONE_NEWUTS flag.

CLONE_NEWUTS is used to create and enter a new "UTS namespace". All of the namespace types require CAP_SYS_ADMIN to create, with one exception: The upstream Linux kernel allows unprivileged users to create and enter a new user namespace.

When you create a user namespace, you can allow yourself root / full capabilities inside that namespace, including CAP_SYS_ADMIN. If your system supports this, you can see it with unshare -r . It opens a root shell in a new user namespace.

The intended method for unprivileged users to use namespaces, was inside a new user namespace. However some Linux distributions configure the kernel to dis-allow this feature.

CAP_SYS_ADMIN is used as a catchall for anything without a more specific capability. It is far too powerful. You should assume it can be used to take over other programs and hence gain any other capability.[1]

If unprivileged users could create all the types of namespace directly, it would have raised issues where namespaces could be used to confuse a setuid program, into performing privileged actions that it was not supposed to. Cross-reference: "Why does unsharing mount namespace require CAP_SYS_ADMIN?"

The other option is to use a helper executable with setuid/capabilities which only allows a specific task. Like how sudo can be configured to allow running specific privileged commands only. This is the approach taken by bubblewrap, which is used by FlatPak.

The bubblewrap README also provides some references, about the security concerns which caused Linux distributions to restrict user namespaces.

I think this story overlaps with the reasons that "Docker in Docker" is not really supported / is not possible without disabling important security features in the main Docker daemon. Although it is not quite the same.

[1] For example CAP_SYS_ADMIN is the capability used to mount block filesystems, which kernel developers consider are not possible to reliably secure against malicious FS images.

Inside a new user namespace, CAP_SYS_ADMIN does not allow you to mount block filesystems. But, if you created a new mount namespace as well - e.g. unshare -rm - CAP_SYS_ADMIN will allow you to create bind mounts, mount the proc filesystem, and in kernerl 4.18 or above you can mount FUSE filesystems.

Docker also uses LSM-based security - SELinux or AppArmor - on systems where those are available. It's possible these layers could restrict CAP_SYS_ADMIN in some ways. This is much more obscure than Docker's other security layers. If you relied on the detailed workings of specific LSMs, that seems to defeat one of the points of building a convenient portable Docker container.

@MarkJL unless you specifically configure Docker, it does not use user namespaces. Do not grant CAP_SYS_ADMIN to processes inside docker without very careful consideration, for example CAP_SYS_ADMIN is the capability used to mount block filesystems, which kernel developers consider are not possible to reliably secure against malicious FS images. Inside a user namespace, CAP_SYS_ADMIN does not allow you to mount block filesystems.
– sourcejediApr 28 '19 at 18:10

To be clear, do you mean there was a mistake in your question?
– sourcejediApr 28 '19 at 19:18

1

@MarkJL You don't need to use CLONE_NEWUTS if all you want to do is clone a process. It looks like I guessed wrong about what questions to ask you :-). If you choose to ask a new question, you might try to include a little more redundancy - e.g. a reference for why you think CLONE_NEWUTS is required to call clone() - e.g. this can make it easier to distinguish when someone has a conceptual error, or when they are not writing very clearly because they are not very good with English :-).
– sourcejediApr 28 '19 at 20:57

1

@MarkJL I edited the answer a little more. I think you will find the bubblewrap doc interesting, but this answer is still not positive about how to achieve your goal :-).
– sourcejediApr 28 '19 at 22:24