Categories

Meta

Category: development

I’ve spent some of the last weeks working on a replacement for runC, the most used/known OCI runtime for running containers. It might not be very well known, but it is a key component for running containers. Every Docker container ultimately runs through runC.

Having containers running through some common specs allow some pieces to be replaced without having any difference in behavior.

The OCI runtime specs describe how a container looks like once it is running, for instance it lists all the mount points, the capabilities left to the process, the process that must be executed, the namespaces to create and so on.

While the rest of the containers ecosystem is written in Go, from Docker to Kubernetes, I think that for such a low level tool C still makes more sense. runC itself uses C for its lower level tasks forking itself once the configuration done and setting up the environment in C before launching the container process.

I’ve tried running sequentially 100 times a container that runs only /bin/true and the results are quite good:

Most of the time for running a container seems to be in the creation of a network namespace. I had expected some costs in the Go->C process handling but I am surprised by the results when the network namespace is not used as crun is almost double as fast as runC.

It is still an ongoing work not ready for production, but the upstream version of OpenShift origin has already an experimental support for running OpenShift Origin using system containers. The “latest” Docker image for origin, node and openvswitch, the 3 components we need, are automatically pushed to docker.io, so we can use these for our test. The rhel7/etcd system container image instead is pulled from the Red Hat registry.

The Vagrantfile will provision three virtual machines based on the `fedora/25-atomic-host` image. One machine will be used for the master node, the other two will be used as nodes. I am using static IPs for them so that it is easier to refer to them from the Ansible playbook and to require DNS configuration.

The machines can finally be provisioned with vagrant as:

# vagrant up --provider libvirt

At this point you should be able to login into the VMs as root using your ssh key:

for host in 10.0.0.{10,11,12};
do
ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no [email protected]$host "echo yes I could login on $host"
done
yes I could login on 10.0.0.10
yes I could login on 10.0.0.11
yes I could login on 10.0.0.12

Our VMs are ready. Let’ install OpenShift!

This is the inventory file used for openshift-ansible, store it in a file origin.inventory:

The new configuration required to run system containers is quite visible in the inventory file. `use_system_containers=True` is required to tell the installer to use system containers, `system_images_registry` specifies the registry from where the system containers must be pulled.

And we can finally run the installer, using python3, from the directory where we forked ansible-openshift:

$ oc login --insecure-skip-tls-verify=false 10.0.0.10:8443 -u user -p OriginUser
Login successful.
You don't have any projects. You can try to create a new project, by running
oc new-project
$ oc new-project test
Now using project "test" on server "https://10.0.0.10:8443".
You can add applications to this project with the 'new-app' command. For example, try:
oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
to build a new example application in Ruby.
$ oc new-app https://github.com/giuseppe/hello-openshift-plus.git
--> Found Docker image 1f8ec11 (6 days old) from Docker Hub for "fedora"
* An image stream will be created as "fedora:latest" that will track the source image
* A Docker build using source code from https://github.com/giuseppe/hello-openshift-plus.git will be created
* The resulting image will be pushed to image stream "hello-openshift-plus:latest"
* Every time "fedora:latest" changes a new build will be triggered
* This image will be deployed in deployment config "hello-openshift-plus"
* Ports 8080, 8888 will be load balanced by service "hello-openshift-plus"
* Other containers can access this service through the hostname "hello-openshift-plus"
* WARNING: Image "fedora" runs as the 'root' user which may not be permitted by your cluster administrator
--> Creating resources ...
imagestream "fedora" created
imagestream "hello-openshift-plus" created
buildconfig "hello-openshift-plus" created
deploymentconfig "hello-openshift-plus" created
service "hello-openshift-plus" created
--> Success
Build scheduled, use 'oc logs -f bc/hello-openshift-plus' to track its progress.
Run 'oc status' to view your app.

After some time, we can see our service running:

oc get service
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hello-openshift-plus 172.30.204.140 8080/TCP,8888/TCP 46m

Are we really running on system containers? Let’s check it out on master and one node:

(The atomic command upstream has a breaking change so with future versions of atomic we will need -f backend=ostree to filter system containers, as clearly ostree is not a runtime)

bubblewrap is a sandboxing tool that allows unprivileged users to run containers. I was recently working on a way to allow unprivileged users, to take advantage of bubblewrap to run regular system images that are using systemd. To do so, it was necessary to modify bubblewrap to keep some capabilities in the sandbox.

Capabilities are the way, since Linux 2.2, that the kernel uses to split the root power into a finer grained set of permissions that each thread can have. Together with Linux namespaces it is fine to leave unprivileged users the possibility to use some of them. To give an example, CAP_SETUID, which allows the calling process to make manipulations of process UIDs, is fine to be used in a new user namespace as the set of permitted UIDs is restricted to those UIDs that exist in the new user namespace.

The changes required in bubblewrap are not yet merged upstream. In the rest of post I will refer to the modified bubblewrap simply as bubblewrap.

The set of capabilities that bubblewrap leaves in the process is regulated with –cap-add, new namespaces are required to use these caps. The special value ALL, adds all the caps that are allowed by bubblewrap.

A development version of systemd is required to run in the modified bubblewrap. There are patches in systemd upstream that allows systemd to run without requiring CAP_AUDIT_* and to not fail when setgroups is disabled, as it is the case when running inside bubblewrap (to address CVE-2014-8989). The setgroups restriction may be lifted in future in some cases, this is still under discussion.

For my tests, I’ve used Docker to compose the container, in the following Dockerfile there are no metadata directives as anyway they are not used when exporting the rootfs.

To install the latest systemd, once you’ve cloned its repository, from the source directory you can simply do:

./autogen.sh
make -j $(nproc)
make install DESTDIR=$(rootfs)

to install it in the container rootfs.

If the files /etc/subuid and /etc/subgid are present, the first interval of additional UIDs and GIDs for the unprivileged user invoking bubblewrap is used to set the additional users and groups available in the container. This is required for the system users needed for systemd.

At this point, everything is in place and we can use bubblewrap to run the new container as an unprivileged user:

Every programmer at some point gets in touch with the Brainfuck programming language and how surprising is that very few instructions are needed to have a Turing complete language, 6 is the case of Brainfuck (plus other 2 for I/O operations).

I have recently found an old project of mine that I have used to learn how to write a GCC frontend, it took a while to adapt it to work with a newer GCC version. The code is available on github. The only positive side of this project, if any, is that it can be easily used as a starting point on how to add a frontend to GCC, or in this case, to compile a Brainfuck interpreter written in Brainfuck!

I don’t remember how I got to this code, except that I helped myself with some C preprocessor macros and and I remember one important spec: the input is NUL terminated. Looking at the code is not very helpful. This is probably one of the cases that the compiled version is more understandable than the code itself.

Assuming you already have compiled GCC with the brainfuck frontend (there are instructions on the github project page on how to do it) and that you are able to compile brainfuck files:

$ gcc brainfuck-interpreter.bf -o brainfuck-interpreter

You should have got an executable at this point in the current directory: brainfuck-interpreter. It can be used to interpret an easier program, let’s try with the usual “Hello World!” stuff. The code is short enough that we can feed it straight from stdin to the interpreter. The terminal NUL byte is very important or the interpreter will just crash. I/O for Brainfuck doesn’t handle errors and EOF 🙂

If everything works as expected, the image should be ready after that command and we can run it as:

sudo docker run --rm -ti emacs

Repeating the same command twice won’t have any effect, unless –force is specified, if there is no new OStree commit available, ostree-docker-builder stores this information in the image itself using a Docker LABEL.

Tagging

Another feature is the automatic tagging of images, when –tag is specified, the built image will be tagged as the name provided as argument to –tag and automatically pushed to the configured Docker registry.

Advantages of ostree-docker-builder

There are mainly two advantages in using ostree-docker-builder instead of a Dockerfile:

The same tool to generate both the OS image and the containers

Use OStree to track what files were changed, added or removed. If there are no differences then no image is created

Special thanks to Colin Walters for his suggestions while experimenting ostree-docker-builder and how to take advantage of the OStree checksum.

coming as a surprise, this year we have got 4 students to work full-time during the summer on wget. More than all the students who have ever worked for wget before during a Summer of Code!

The accepted projects cover different areas: security, testing, new protocols and some speed-up optimizations. Our hope is that we will be able to use the new pieces as soon as possible, this is why we ask students to keep their code always rebased on top of the current wget development version.

Improve Wget’s security

The project aims at adding HSTS support in wget and enhance FTP security through FTPS.

Speed up Wget’s Download Mechanism

Support two performance enhancements: conditional GET requests and TCP Fast Open.

At this point, modify ./fedora-cloud-atomic.ks to point to our OStree repository.

This is how I modified the file to point to the OStree repo accessible at http://192.168.125.225:37375/. Use the correct settings for your machine, and the port used to serve the OStree repository that we noted before.