This is part one of a series of blog posts detailing how docker creates containers.
We will dig deep into the various pieces that are stitched together to see what
it takes to make docker run ... awesome.

First, what is a container?

I think the various pieces of technology that goes into creating a container is fairly
commonplace. You should have seen a few cool looking charts in various presentations
about Docker where you get a quick "Docker uses namespaces, cgroups, chroot, etc."
to create containers. But why does it take all these pieces to create a contaienr?
Why is it not a simple syscall and it's all done for me?
The fact is that container's don't exist, they are made up. There is no such
thing as a "linux container" in the kernel. A container is a userland concept.

Namespaces

In part one I'll talk about how to create Linux namespaces in the context of how they
are used within docker. In later posts we will look into how namespaces are combined
with other features like cgroups and an isolated filesystem to create something useful.

First off we need a high level explanation of what a namespace does and why it's useful.
Basically, a namespace is a scoped view of your underlying Linux system. There are a
few different types of namespaces implemented inside the kernel. As we dig into
each of the different namespaces below you can follow along by running
docker run -it --privileged --net host crosbymichael/make-containers.
This has a few preloaded files and configuration to get your started. Even though we
will be creating namespaces inside an container that docker runs for us, don't let that
trip you up. I opted for this approach as providing a container preloaded with all
the dependencies that you need to run the examples is why we are doing this in the
first place. To make things a little easier, I'm using the --net host flag so that
we are able to see your host's network interfaces within our demo container. This will
be useful in the network examples. We also need to provide the --privilged flag so that
we have the correct permissions to create new namespaces within our container.

If you are interested in what the Dockerfile looks like then here it is:

I'll be doing the examples in C, as it's sometimes easier
to explain the lower level details better than the abstractions that Go provides.
So lets start...

NET Namespace

The network namespaces provides your own view of the network stack of your system. This
can include your very own localhost. Make sure you are in the crosbymichael/make-containers
and run the command ip a to view all the network interfaces of your host machine.

Ok cool, so this is all the network interfaces currently on my host system. Yours may look
a little different but you get the idea. Now let's write some code to create a new network
namespace. For this we will write a skeleton of a small C binary that uses the clone syscall. We will
start by using clone to run binaries that are already installed inside our demo container.
The file skeleton.c should be in the working directory of the demo container.
We will use this file as the basis of all our examples. Here is the code incase you don't feel
like running the container right now.

#define _GNU_SOURCE#include <stdio.h>#include <stdlib.h>#include <sched.h>#include <sys/wait.h>#include <errno.h>#define STACKSIZE (1024*1024)staticcharchild_stack[STACKSIZE];structclone_args{char**argv;};// child_exec is the func that will be executed as the result of clonestaticintchild_exec(void*stuff){structclone_args*args=(structclone_args*)stuff;if(execvp(args->argv[0],args->argv)!=0){fprintf(stderr,"failed to execvp argments %s\n",strerror(errno));exit(-1);}// we should never reach here!exit(EXIT_FAILURE);}intmain(intargc,char**argv){structclone_argsargs;args.argv=&argv[1];intclone_flags=SIGCHLD;// the result of this call is that our child_exec will be run in another// process returning it's pidpid_tpid=clone(child_exec,child_stack+STACKSIZE,clone_flags,&args);if(pid<0){fprintf(stderr,"clone failed WTF!!!! %s\n",strerror(errno));exit(EXIT_FAILURE);}// lets wait on our child process here before we, the parent, exitsif(waitpid(pid,NULL,0)==-1){fprintf(stderr,"failed to wait pid %d\n",pid);exit(EXIT_FAILURE);}exit(EXIT_SUCCESS);}

This is a small C binary that will allow you to run processes
like ./a.out ip a. It uses the arguments that you pass on the cli as the arguments to whatever
process you want to use. Don't worry about the specific implementation too much as it's the
changes we will be making that are the interesting aspects. Remember, this will execute the
binary and arguments of whatever program you want, this means if you want to run one of these demos
below and have it spawn a shell session so that you can poke around in your new namespace then go
ahead. It is a great way to explore and inspect these different namespaces at your own pace.
So to get started let's make a copy of this file to start working with the network namespace.

> cp skeleton.c network.c

Ok, within this file there is a very
special var called clone_flags. This is where most of our changes will happen throughout this
post. Namespaces are primarily controlled via the clone flags. The clone flag for the network
namespace is CLONE_NEWNET. We need to change the line in the file int clone_flags = SIGCHLD; to
int clone_flags = CLONE_NEWNET | SIGCHLD; so that the call to clone creates a new network namespace
for our process. Make this change in network.c then compile and run.

The result of this run now looks very different from the first time we ran the ip a
command. We only see a loopback interface in the output. This is because our process
that was created only has a view of its network namespace and not of the host.
And that's it. That is how you create a new network namespace.

Right now this is pretty useless as you don't have any usable
interfaces. Docker uses this new network namespace to setup a veth
interface so that your container has it's own ip address
allocated on a bridge, usually docker0.
We won't go down the path of how to setup interfaces in namespaces at this time. We can save
that for another post.

So now that we know how a network namespace is created lets look at the mount namespace.

MNT Namespace

The mount namespace gives you a scoped view of the mounts on your system. It's often
confused with jailing a process inside a chroot or similar. This is not true!
The mount namespaces does not equal a filesystem jail. So the next time you hear someone
say that a container uses the mount namespace to "jail" the process inside it's own root
filesystem you can call bullshit because they don't know what they are talking about. Do it, it's fun :)

Let's start by making a copy of skeleton.c again for our mount related changes.
We can do a quick build
and run to see what our current mount points looks like with the mount command.

This is what the mount points look like from within my demo container, yours may look different.
In order to create a new mount namespace we use the flag CLONE_NEWNS. You may notice something may look
weird here. Why is this flag not CLONE_NEWMOUNT or CLONE_NEWMNT? This is because the mount
namespace was the first Linux namespace introduced and the name was just an undersight.
If you write code you will understand that as you start building a feature or an application,
you don't often have a full picture of the end result. Anyway, let's add CLONE_NEWNS to our
clone_flags variable. It should look something like int clone_flags = CLONE_NEWNS | SIGCHLD;.

Lets go ahead and build mount.c again and run the same command.

> cp skeleton.c mount.c
> gcc -o mount mount.c
> ./mount mount

Nothing changed. Whaaat??? This is because the process that we run inside the new mount
namespace still has a view on /proc of the underlying system. The result is that the
new process sort of inherits a view on the underlying mounts.
There are a few ways that we can prevent this, like using pivot_root, but we will leave
that for an additional post on filesystem jails and how chroot / pivot_root interact with
the container's mount namespace.

However, one way that we can try out our new mount namespace is to, well, mount something.
Let's create a new tmpfs mount in /mytmp for this demo. We will do this mount
in C and continue to run our same mount command as the args to our mount binary. In order to do a mount
inside of our mount namespace we need to add the code to the child_exec function,
before the call to execvp. The code within the child_exec function is run inside the newly
created process, i.e., inside our new namespace. The code in child_exec should look like this:

// child_exec is the func that will be executed as the result of clonestaticintchild_exec(void*stuff){structclone_args*args=(structclone_args*)stuff;if(mount("none","/mytmp","tmpfs",0,"")!=0){fprintf(stderr,"failed to mount tmpfs %s\n",strerror(errno));exit(-1);}if(execvp(args->argv[0],args->argv)!=0){fprintf(stderr,"failed to execvp argments %s\n",strerror(errno));exit(-1);}// we should never reach here!exit(EXIT_FAILURE);}

We need to first create a directory /mytmp before we compile and run with the changes above.

I cut out the common output above from the first time I ran mount.
The result is that you should see a new mount point for our tmpfs mount.
Nice! Go ahead and run mount in the current shell just for comparison.
Notice how the tmpfs mount is not displayed? That is because we created the mount inside
our own mount namespace, not in the parent's namespace.

Remember how I said that the mount namepaces does not equal a filesystem jail? Go ahead and run our
./mount binary with the ls command. Everything is there. Now you have proof!

UTS Namespace

The next namespace is the UTS namespace that is responsible for system identification. This includes the
hostname and domainname. It allows a container to have it's own hostname independently
from the host system along with other containers. Let's start by making a copy of skeleton.c and running
the hostname command with it.

> cp skeleton.c uts.c
> gcc -o uts uts.c
> ./uts hostname
development

This should display your system's hostname (development in my case). Like earlier, let's add
the clone flag for the UTS namespace to the clone_flags variable. The flag should be CLONE_NEWUTS.
If you compile and run then you should see the exact same output. This is totally fine.
The values in the UTS namespace are inherited from the parent.
However, within this new namespace, we can change the hostname without it affecting the parent or other
container's that have a separate UTS namespace.

Let's modify the hostname in the child_exec function. To do that, you will need to add
the #include <unistd.h> header to gain access to the sethostname function, as well as
the #include <string.h> header to use strlen needed by sethostname.
The new body of the child_exec function should look like the following:

// child_exec is the func that will be executed as the result of clonestaticintchild_exec(void*stuff){structclone_args*args=(structclone_args*)stuff;constchar*new_hostname="myhostname";if(sethostname(new_hostname,strlen(new_hostname))!=0){fprintf(stderr,"failed to execvp argments %s\n",strerror(errno));exit(-1);}if(execvp(args->argv[0],args->argv)!=0){fprintf(stderr,"failed to execvp argments %s\n",strerror(errno));exit(-1);}// we should never reach here!exit(EXIT_FAILURE);}

Ensure that clone_flags within your main look like int clone_flags = CLONE_NEWUTS | SIGCHLD; then
compile and run the binary with the same args. You should now see the value that we set returned
from the hostname command.
To verify that this change did not affect our current shell go ahead and run hostname and make
sure that you have your original value back.

> gcc -o uts uts.c
> ./uts hostname
myhostname
> hostname
development

Awesome! We are doing good.

IPC Namespace

The IPC namespace is used for isolating interprocess communication, things like SysV message queues.
Let's make a copy of skeleton.c for this namespace.

> cp skeleton.c ipc.c

The way we are going to test the IPC namespace is by creating a message queue on the host, and
ensuring that we cannot see it when we spawn a new process inside it's own IPC namespace.
Let's first create a message queue in our current shell then compile and run our copy of the
skeleton code to view the queue.

Without a new IPC namespace you can see the same message queue that was created.
Now let's add the CLONE_NEWIPC flag to our clone_flags var to create a new IPC namespace for
our process.
The clone_flags var should look like int clone_flags = CLONE_NEWIPC | SIGCHLD;.
Recompile and run the same command again:

Done! The child process is now in a new IPC namespace and has completely separate view
and access to message queues.

PID Namespace

This one is fun. The PID namespace is a way to carve up the PIDs that one process can view and
interact with. When we create a new PID namespace the first process will get to be the loved PID 1.
If this process exits the kernel kills everyone else within the namespace.
Let's start by making a copy of skeleton.c for our changes.

> cp skeleton.c pid.c

To create a new PID namespace, we will have to set the clone_flags with CLONE_NEWPID.
The variable should look like int clone_flags = CLONE_NEWPID | SIGCHLD;. Let's test
by running ps aux in our shell and then compile and run our pid.c binary with
the same arguments.

WTF??? We expected ps aux to be PID 1 or atleast not see any other pids from the parent.
Why is that? proc. The process that we spawned still has a view of /proc from the parent, i.e.
/proc mounted on the host system. So how do we fix this? How to we ensure that our new process
can only view pids within it's namespace? We can start by remounting /proc.
Because we will be dealing with mounts, we can take the opportunity to take what we learned from the
MNT namespace and combine it with our PID namespace so that we don't mess with the /proc of our
host system.

We can start by including the clone flag for the mount namespace along side the
clone flag for pid. It should look something like
int clone_flags = CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;. We need to edit the
child_exec function and remount proc. This will be a simple unmount and mount syscall
for the proc filesystem. Because we are creating a new mount namespace we know that this will
not mess up our host system. The result should look like this:

Perfect! Our new PID namespace is now fully operational with the help of the mount namespace!

USER Namespace

The last namespace is the user namespace. This namespace is the new kid on the block and allows you to have
users within this namespace that are not equal users outside of the namespace. This is accomplished
via GID and UID mappings.

This one has a simple demo application without specifying a mapping, even though it's a totally
useless demo. If we add the flag CLONE_NEWUSER to our clone_flags then run something like id or ls -la you will notice that we get nobody within the user namespace. This is because the current user is undefined
right now.

This is a very simple example of the user namespace but you can go much deeper with it. We will
save this for another post but the idea and hopes for the user namespace is that this will allow
us to run as "root" within the container but not as "root" on the host system. Don't forget you
can always change ls -la to bash and have a shell inside the namespace to poke around and learn
more.

In the end...

So to recap we went over the mount, network, user, PID, UTS, and IPC Linux namespaces.
The majority of the code that we changed was not much, just adding a flag most of the time.
The "hard work" is mostly managing the interactions between the various kernel subsystems
in order to meet our requirements. Like most
of the descriptions before this, namespaces are just one of the tools that we use to make a
container. I hope the PID example is a glimpse of how we use multiple namespaces together
in order isolate and begin the creation of a container.

In future posts we will go into detail on how we jail the container's processes inside a root filesystem,
aka a docker image, as well as using cgroups and Linux capabilities. By the end we should be able
to pull all these things together to create a container.

Also a thanks to tibor and everyone that helps review my brain dump of a first draft ;)

Other articles

Docker is a large project that touches many different layers of the system. Everything from a REST api to low level filesystem and execution calls. Docker is also open source so where are all the awesome packages that make docker dock? Why do the docker developers not contribute back reusable ...

1: Don't boot init

For me, my favorite feature of Go is the static binary. I love compiling once and just dropping the binary into a container running the 2mb busybox image and having my app up and running. No pip, no dlls, no problem. However, this does come with a small price. If ...

The proper way to do port allocation in docker is to EXPOSE a private port in a container and let docker map the public port on the host. This can be a pain for reverse proxies, load balances, etc but with the events endpoint in docker you can listen for ...

1: Use the cache

So we finally have the new docker logo but I have not seen any wallpapers or anything else. The image posted is not very high resolution so I did some quick editing to fill in the background. All credit goes to the original post and author, I just added more ...

Docker is an amazing project. There is plenty of great documentation and tutorials online to get you up and running with docker, building images, and deploying your apps. There is not much that I can say that others have not said better. Therefore, I am going to start a series ...

In the past I have not been a huge fan of "the cloud" for all of the work that I need to do. I still prefer native over web apps but I also like the convenience of having my data accessible everywhere. However, over the past few months I have ...

Go's rpc package has an example for the default rpc implementation but nothing for the jsonrpc package. Here is how to get it work. The rpc example uses some helpers built into the rpc package, I just looked at that code in Go's stdlib to figure out the ...

If you don't know about Go it's a programming language for now. Most of the languages that we use everyday are old, 10+ years old. After so many years languages need to start to think about news ways to solve the same problems. Go is a new language ...

Working with the SecureString class in .NET is fairly easy. This class allows you to keep string data encrypted in memory. Maybe you load connection strings, passwords, or financial data into memory when your application starts or loads an object so the least you can do is use the SecureString ...

Servers

Running your own infrastructure is awesome, if your a hacker like me. If your a square and you don't want to mess with your own hardware you have two choices if you are going to host your website or services online. If you don't feel like spending ...

Python is my favorite language to work with. It is so versatile. I can automate tasks, write quick scripts, process data, and even run a web application. It is very portable and I love the syntax. For a web framework I use web.py because it stays out of your ...

I have been working on two iOS applications over the past few days and have been longing for a LINQ like syntax in Objective-C. LINQ is an amazing way to work with collections and it really adds a productivity boost to your coding. Since Objective C does not have LINQ ...

Design patterns can help you solve complex coding problems with proven solutions. This will be a mini series about different design pattern implementations in Objective - C. I will be using a simple console app and updating the code on github with each new pattern.

Factory Pattern

CodeAssistant is a simple tool to test code snippets in virtually any language on Windows. It has been in development for four months and has been a great learning experience for domain driven development and .NET's WPF.

From my point of view the reason why we create multithreaded applications is so the user experience does not suffer. We have operations that need completed but running these on the main thread will block the UI and piss off our users.

I just puked up some old code for a little app that displays your iCal events in the terminal. I have it as one of my startup scripts so when I start a new bash session I see my events for the next 7 days. It's a very simple ...

If you use Terminal on your Mac like I do. It is the first application that is run at starup and the last to close. When you start Terminal in a new window or tab that is a great time to review your computer stats, iCal events, and disk sizes ...

The Problem

So like many people my life is stored in my gmail. I trust google to keep my data safe but since so many things, from contacts to many different online accounts, are all registered under your gmail address it is important to have that data whenever you need it. There ...

So this is much easier if you have a static site that is generated using pelican or jerkyll. Both of these apps are great for generating blogs out of markdown or other simple text markup language. I use pelican because it is built with python and that is what I ...

The thing about dvsc that makes me nervous is that all the code and revision history is on your local computer. This makes it very fast when committing, branching, and merging but what if you have a drive failure or worse? All your code is gone if you don't ...

Everyone needs a fast instance in a datacenter to work with. Amazon's EC2 or any other VPS provider will give you access. Sometimes you need to transfer files around fast or do some data processing. Having a fast instance will help you get your job done faster.