My goal with these posts has been to focus on the primitives and to show how a Kubernetes cluster handles networking internally as well as how it interacts with the upstream or external network. Now that we’ve seen that, I want to dig into a networking plugin for Kubernetes – Calico. Calico is interesting to me as a network engineer because of wide variety of functionality that it offers. To start with though, we’re going to focus on a basic installation. To do that, I’ve updated my Ansible playbook for deploying Kubernetes to incorporate Calico. The playbook can be found here. If you’ve been following along up until this point, you have a couple of options.

Rebuild the cluster – I emphasized when we started all this that the cluster should be designed exclusively for testing. Starting from scratch is always the best in my opinion if you’re looking to make sure you don’t have any lingering configuration. To do that you can follow the steps here up until it asks you to deploy the KubeDNS pods. You need to deploy Calico before deploying any pods to the cluster!

Download and rerun the playbook – This should work as well but I’d encourage you to delete all existing pods before doing this (even the ones in the Kube-System namespace!). There are configuration changes that occur both on the master and the minion nodes so you’ll want to make sure that once the playbook is run that all the services have been restarted. The playbook should do that for you but if you’re having issues check there first.

Regardless of which path you choose, I’m going to assume from this point on that you have a fresh Kubernetes cluster which was deployed using my Ansible role. Using my Ansible role is not a requirement but it does some things for you which I’ll explain along the way so no worries if you aren’t using it. The goal of this post is to talk about Calico, the lab being used is just a detail if you want to follow along.

So now that we have our lab sorted out – let’s talk about deploying Calico. One of the nicest things about Calico is that it can be deployed through Kubernetes. Awesome! The recommended way to deploy it is to use the Calico manifest which they define over on their site under the Standard Hosted Installation directions. If you’re using my Ansible role, a slightly edited version of this manifest can be found on your master in /var/lib/kubernetes/pod_defs. Let’s take a look at what it defines…

That’s a lot so let’s walk through what the manifest defines. The first thing the manifest defines is a config-map that Calico uses to define high level parameters about the Calico installation. Calico relies on a ETCD key value store for some of it’s functions so this is where we define the location of that. In this case, I’m using the same one that I’m using for Kubernetes. Again – this is a lab – they don’t recommend you doing that in non-lab environments. So in my case, I point the etcd_endpoints parameter to the host Ubuntu-1 on port 2379. Since we’re using cert based auth for ETCD I also need to tell Calico where the certs are for that. To do that you just need to un-comment lines 46-48 in the config-map. Do not change these values assuming you need to point that at a real file location on the host!

The second item the manifest defines is a Kubernetes secret which we populate with the ETCD TLS information if we’re using it. We are so we need to populate these fields (lines 46-48) with base 64 encoded versions of each of these items. Again – this is something that Ansible will do for you if you use my role. If not, you need to manually insert the values (I removed them from the file just to save space). We haven’t talked about secrets specifically but they are a means to share secret information with objects inside the Kubernetes cluster.

The third item the manifest defines is a daemon-set. Dameon-sets are a means to deploy a specific workload to every Kubernetes node or minion. So say I had a logging system that I wanted on each system. Deploying it as a daemon-set allows Kubernetes to manage that for me. If I join a new node to the cluster, Kubernetes will start the logging system on that node as well. So in this case, the daemon-set is for Calico and consists of two containers. The node container is the brains of the operation and what does most of the heavy lifting. This is also where we changed the CALICO_IPV4POOL_CIDR parameter from the default to 10.100.0.0/16. This is not required but I wanted to keep the pod IP addresses in that subnet for my lab. The install-cni container takes care of creating the correct CNI definitions on each host so that Kubernetes can consume Calico through a CNI plugin. Once it completes this task it goes to sleep and never wakes back up. We’ll talk more about the CNI definitions below.

The fourth and final piece of the manifest defines the Calico policy controller. We wont be talking about that piece of Calico in this post so just hold tight on that one for now.

First we notice that the eth0 interface is actually a VETH pair. We see that it’s peer is interface index 5 which on the host is an interface called [email protected]. So the container’s network namespace is connected back to the host using a VETH pair. This is very similar to how most container networking solutions work with one minor change. The host side VETH pair is not connected to a bridge. It just lives by itself in the default or root namespace. We’ll talk more about the implications of this later on in this post. Next we notice that the pod received an IP address of 10.100.163.129. This doesn’t seem unusual since that was our pod CIDR we had defined in previous labs, but if you look at the kube-controller-manager service definition. You’l notice that we no longer configure that option…

Notice that the --cluster-cidr parameter is missing entirely and that the --allocate-node-cidrs parameter has been changed to false. This means that Kubernetes is no longer allocating pod CIDR networks to the nodes. So how are the pods getting IP addresses now? The answer to that lies in the kubelet configuration…

Our --network-plugin change from kubenet to cni. This means that we’re using native CNI in order to provision container networking. When doing so, Kubernetes acts as follows…

The CNI plugin is selected by passing Kubelet the –network-plugin=cni command-line option. Kubelet reads a file from –cni-conf-dir (default /etc/cni/net.d) and uses the CNI configuration from that file to set up each pod’s network. The CNI configuration file must match the CNI specification, and any required CNI plugins referenced by the configuration must be present in –cni-bin-dir (default /opt/cni/bin).
If there are multiple CNI configuration files in the directory, the first one in lexicographic order of file name is used.
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI lo plugin, at minimum version 0.2.0

Since we didnt specify --cni-conf-dir or –-cni-bin-dir the kubelet will look in the default path for each. So let’s checkout what’s in the --cni-conf-dir (/etc/cni/net.d) now…

As we can see from the log of the container on each host, the CNI container created the binaries if they didnt exist (these may have already existed if you were using the previous lab build). It then created the CNI policy and the associated kubeconfig file for CNI to use. It also created the /etc/cni/net.d/calico-tls directory and placed the certs required to talk to etcd in that directory. It got this information from the Kubernetes secret /calico-secrets which is really the information from the secret calico-etcd-secrets that we created in the Calico manifest. The secret just happens to be mounted into the container as calico-secrets. The CNI definition also specifies that a plugin of calico should be use which we’ll find does exist in the /opt/cni/bin directory. it also specifies an IPAM plugin of calico-ipam meaning that calico is also taking care of our IP address assignment. One other interesting thing to point out is that the CNI definition lists the information required to talk to the Kubernetes API. To do this, it’s using the default pod token. If you’re curious how the pods get the token to talk to the API server check out this piece of documentation that talks about default service accounts and credentials in Kubernetes. Lastly – the install-CNI container created a kubeconfig file which specifies some further Kubernetes connectivity parameters.

So running the Calico manifest did quite a lot for us. Each node node has the Calico CNI plugins and the means to talk to the Kubernetes API. So now we know that Calico is driving the IP address allocation for the hosts, what about the actual networking side of things? Let’s take a closer look at the routing for net-test container…

Nothing matching that IP address here. So what’s going on? How can a container route at an IP that doesnt exist? Let’s walk through what’s happening. Some of you reading this might have noticed that 169.254.1.1 is an IPv4 link local address. The container has a default route pointing at a link local address meaning that the container expects this IP address to be reachable on it’s directly connected interface, in this case, the containers eth0 address. The container will attempt to ARP for that IP address when it wants to route out through the default route. Since our container hasnt talked to anything yet, we have the opportunity to attempt to capture it’s ARP request on the host. Let’s setup a TCPDUMP on the host ubuntu-3 and then use kubectl exec on the master to try talking to the outside world…

In the top output you can see we have the container send a single ping to 4.2.2.2. This will surely follow the container’s default route and cause it to ARP for it’s gateway at 169.254.1.1. In the bottom output you see the capture on the host Ubuntu-3. Notice we did the capture on the interface cali182f84bfeba which is the host side of the VETH pair connecting the container back to the root or default network namespace on the host. In the output of the TCPDUMP we see the container with a source of 10.100.163.129 send an ARP request. The reply comes from 2e:7e:32:de:8c:a3 which, if we reference the above output, will see is the MAC address of the host side VETH pair cali182f84bfeba. So you might be wondering how on earth the host is replying to an ARP request for which it doesn’t have an IP interface on. The answer is proxy-arp. If we check the host side VETH interface we’ll see that proxy-arp is enabled…

By enabling proxy-arp on this interface Calico is instructing the host to reply to the ARP request on behalf of someone else that is, through proxy. The rules for proxy-ARP are simple. A host which has proxy-ARP enabled will reply to ARP requests with it’s own MAC address when…

The host receives an ARP request on an interface which has proxy-ARP enabled.

The host knows how to reach the destination

The interface the host would use to reach the destination is not the same one that it received the ARP request on

So in this case, the container is sending an ARP request for 169.254.1.1. Despite this being a link-local address, the host would attempt to route this following it’s default route out the hosts physical interface. This means we’ve met all three requirements so the host will reply to the ARP request with it’s MAC address.

Note: If you’re curious about these requirements go ahead and try them out yourself. For requirement 1 you can disable proxy-arp on the interface with echo 0 > /proc/sys/net/ipv4/conf/<interface name goes here>/proxy_arp. For requirement 2 simply remove the hosts default route (make sure you have a 10’s route or some other means to reach the host before you do that!) like so sudo ip route del 0.0.0.0/0. For the third requirement point the route 169.254.0.0/16 at the VETH interface itself like this sudo ip route add 169.254.0.0/16 dev <Calico VETH interface name>. If you do any of these, the container will no longer be able to access the outside world. Part of me wonders if this makes it a bit fragile but I also assume that most hosts will have a default route.

The ARP process for the container would look like this…

In this case, the proxy ARP requirements are met since the host has a default route it can follow for the destination of 169.254.1.1 so it replies to the container with it’s own MAC address. At this point, the container believes it has a valid ARP entry for it’s default gateway and will start initiating normal traffic toward the host. It’s a pretty clever configuration but one that takes some time to understand.

I had mentioned above that the host side of the container VETH pair just lived in the hosts default or root namespace. In other container implementations, this interface would be attached to a common bridge so that all connected containers could talk to one another directly. In that scenario, the bridge would commonly be allocated an IP address giving the host an IP address on the same subnet as the containers. This would allow the host to talk (do things like ARP) to the container directly. Having the bridge also allows the containers themselves to talk directly to one another through the bridge. This describes a layer 2 scenario where the host and all containers attached to the bridge can ARP for each others IP addresses directly. Since we don’t have the bridge, we need to tell the host how to route to each container. If we look at the hosts routing table we’ll see that we have a /32 route for the IP of the our net-test container…

Notice that these destinations are reachable through the tunl0 interface which is Calico’s IPIP overlay transport tunnel. This means that we don’t need to tell the upstream or physical network how to get to each POD CIDR range since it’s being done in the overlay. This also means that we can no longer reach the pod IP address directly. This conforms more closely with what the Kubernetes documentation describes when it says that the pod networks are not routable externally. In our previous examples they were reachable since we were manually routing the subnets to each host.

We’ve just barely scratched the surface of Calico in this post but it should be enough to get you and running. In the next post we’ll talk about how Calico shares routing and reachability information between the hosts.

Hi, very enlightening article. I did deploy manually calico following the documentation on a simple cluster (1 master and 1 worker) and the strange issue I have is that when I fire up 2 net-test image, they can’t ping each other. Inter-pod communication om the same worker seems to be broken. In term of control plane I’ve both cali created and both proxy-arp has been enabled on these interface, also have the 2 routes pointing to dev scope link. My worker has multiple interface. I tried 4.2.2.2 but could not get any echo reply. Seems that this is a forwardign issue. Any idea/Hint where this issue should comes from. I suspect an iptables issues. (I have similar issue with kubenet and need to manually add chain on (cbr0: sudo iptables -A FORWARD -i cbr0 -j ACCEPT sudo iptables -A FORWARD -o cbr0 -j ACCEPT) would it be possible to dump your iptables ? Thanks anyway for ALL of these articles.

Thanks for your reply. I found the issue to this problem. Actually when you do a manual install (in order to understand the automatic installation) Everything is block by default by iptables because of the default drop network policy. In the documentation it is explicitely mentioned that all authorized traffic traffic must be explicitely allowed but this is unclear “how” to to do. At least for a calico newcomer. Once applied a fault profile per namespace (kube-system, default namespace) everything is back up up. In order to see if there is an iptables issues just issue iptbles-save and analyse packet “jump” sequences. I have an other issue now. I tweaked my installation so much that I don’t know why /26 routes and blackholes routes are not installed by bird container. Any idea where to look at this or how to troublshoot this. Calico log are not meaningful enough. (at least from my perspective) Thanks ! Bgrds, Frederic

Found the issue. Problem was that my minions had multiple interface and FELIX resolve using gethostbyname while mu calico-node was registered using IP adress of the interface. Setting FELIX_FELIXHOSTNAME solved the issue. When you have such behaviour just “recursively ls” etcd. Cheers, Frederic