Kubernetes services and ingress under X-ray

Posted on January 30, 2017
| 20 minutes
| 4232 words
| Milos Gajdos

I haven’t blogged here for over 2 years. It’s not that I had nothing to say, but every time I started writing a new post I never pushed myself into finishing it. So, most of the drafts ended up rotting in my private Github gists. Although my interests have expanded way beyond the Linux container space, my professional life remained tied to it.

Over the past two years I have been quite heavily involved in Kubernetes (K8s) community. I helped to start Kubernetes London Meetup as well as Kubecast, a podcast about all things K8s. It’s been amazing to see the community [not only in London] to grow so fast in such a short time.

More and more companies are jumping on board to orchestrate their container deployments to address the container Jevons paradox. This is great for the project, but it’s not a free lunch for the newcomers. New and shiny things often make the newcomers anxious. Especially when there is a lot of new concepts to grasp to become fully productive. Changing the mindset is often the hardest thing to do.

Over the past few months I have been noticing one thing in particular. K8s abstracts a lot of infra through its API. This is similar in other modern platforms like Cloud Foundry and the likes. Hiding things away makes “traditional” Ops teams feel uneasy. The idea of “not caring” what’s going on underneath the K8s roof is unsettling. This is hardly surprising, as most of us polished our professional skills whilst debugging all kinds of crazy OS and hardware issues (hello Linux on desktop!); we naturally tend to dig deep when new tech comes up. It’s good to be prepared when a disaster strikes.

Quite a few peolpe have asked me recently, both in person and via Twitter DMs, about what’s going on underneath the K8s when a HTTP request arrives in the cluster from the external traffic i.e. traffic from outside the K8 cluster. People wanted to know how do all the pieces such as service, ingress and DNS work together inside the cluster. If you are one of the curious folks, this post might be for you. We will put the service requests through X-ray!

Cluster Setup

This post assumes we run our own “bare metal” standalone K8s installation. If you don’t know how to get K8s running in your own infrastructure, check out the Kubernetes the hard way guide by Kelsey Hightower which can be easily translated into your own environment.

We will assume we have both the control plane and 3 worker nodes up and running:

We will also assume that we have DNS and K8s dashboard add-ons set up in the cluster. As for DNS, we will use the off-the shelf, kube-dns. [Remember, the add-ons run as services in kube-system namespace] :

Cluster state and configuration

K8s cluster stores all of its internal state in etcd cluster. The idea is, that you should interact with K8s only via its API provided by API service. API service abstracts away all the K8s cluster state manipulating by reading from and writing into the etcd cluster. Let’s explore what’s stored in the etcd cluster after fresh installation:

$ etcdctl --ca-file=/etc/etcd/ca.pem ls
/registry

/registry key is where all the magic happens in K8s. If you are familiar with K8s at least a bit, listing the contents of this key will reveal a tree structure referencing keys with names of familiar K8s concepts:

Let’s have a look what’s hiding underneath /registry/services key, which is what we are interested in in this blog post. We will list the key services key space recursively i.e. sort of like when you run ls -lR on your command line:

Output of this command has uncovered a wealth of the information. For starters, we can see the two service namespaces: default and kube-system under the specs key. We can assume that particular K8s service configuration is stored in the values stored under the keys named after the particular service names.

Another important key in the output above is endpoint. I’ve noticed in the community that not a lot of people are familiar with K8s endpoints API resource. This is because normally, you don’t need to interact with it directly. At least not when doing the usual K8s work like deploying and managing apps via kubectl. But you do need to be familiar with it when debugging malfunctioning services or building ingress controllers or custom loadbalancers.

Endpoints are a crucial concept for K8s services. They represent a list of IP:PORT mappings created automatically (unless you are using headless services) when you create a new K8s service. K8s sercice selects particular set of pods and maps them into endpoints.

In the context of K8s service, endpoints are basically service traffic routes. K8s service must keep an eye on its endpoints at all times. K8s service watches particular endpoints key which notifies it in case some pods in its list have been terminated or rescheduled on another host (in this case it most likely gets a new IP:PORT allocation). Service then routes the traffic to the new endpoint instead of the old [dead] one. In other words, K8s services are K8s API watchers.

In our cluster we only have kubernetes service running right now. It is running in the default namespace. Let’s check its endpoints using kubectl:

Now that we have scrutinized K8s services a bit, let’s move on and create our own K8s service and try to route some traffic to it from outside the cluster.

Services, kube-proxy and kube-dns

We will create a simple service which will run two replicas of nginx and we will scrutinize the request flow within the K8s cluster. The following command will create a K8s deployment of 2 replicas of nginx servers running in separate pods:

We could equally check the contents of /registry/deployments and /registry/replicasets keys, but let’s pass that for now. The next step is to turn the nginx deployment into a service. We will call it nginx-svc and expose it on port 8080 inside the cluster:

We could also query etcd and see that K8s API service has taken care of creating particular service and endpoints keys and populated them with correct information about connection mappings.

Now, here comes the first newcomer “gotcha”. When the service is created it is assigned a Virtual IP (VIP). Many people often try to ping the IP and fail miserably. This leads them to do all kinds of debugging until they get frustrated and give up. Service VIP address is only really useful in combination with the service port (we will get back to this later on in the post). Pinging service VIP gives you no luck. However, when accessing the service endpoints from any pod in the cluster, you are perfectly fine. We will see that later on.

If you don’t specify a type of service, K8s by default uses ClusterIP option, which means that the new service is only exposed only within the cluster. It’s kind of like internal K8s service, so it’s not particularly useful if you want to accept external traffic:

Let’s move to more interesting service exposure options now. If you want to expose your service to the outside world you can either use NodePort type or LoadBalancer. Let’s have a look at the NodePort service first. We will delete the service we created earlier:

NodePort type, according to the documentation opens a service port on every worker node in K8s cluster. Now, here comes another newcomer gotcha. What a lot of people ask me is, “ok, but how come I can’t see the service port listening on any of the worker nodes?” Often, people simply run netstat -ntlp and grep for the exposed service port; in our case that would be port 8080. Well, bad news is, they won’t find any service listening on port 8080. This is where the magic of kube-proxy happens. Instead the service port is mapped to a different port on the node, NodePort. You can find the NodePort by describeing the service:

Now that you have a port open on every node you can configure your external load balancer or edge router to route the traffic to any of the K8s worker nodes on the NodePort. Simples! And indeed this is what we had to do in past before ingress has been introduced.

The “problem” of NodePort type is that the load balancer (or proxy) that routes the traffic to worker nodes will need to balance between the K8s cluster nodes, which in turn will load balance the traffic across pod endpoints. There is also no easy way of adding TLS or more sophisticated traffic routing. This is what the ingress API resource addresses, but let’s talk about kube-proxy first as it’s the most crucial component with regards to K8s services and also a bit of source of confusion for the newcomers.

kube-proxy

kube-proxy is a special daemon (application) running on every worker node. It can run in two modes [configuratble via --proxy-mode command line switch]:

userspace

iptables

In the userspace mode, kube-proxy is running as a userspace process i.e. regular application. It terminates all incoming service connections and creates a new connection to a particular service endpoint. The advantage of the userspace mode is that because the connections are created from userspace process, if the connection fails, kube-proxy can retry to a different endpoint.

In iptables mode, the traffic routing is done entirely through kernelspace via quite complex iptables kung-fu. Feel free to check the iptables rules on each node. This is way more efficient than moving the packets from the kernel to userspace and then back to the kernel. So you get higher throughput and better latency. The downside is that the service can be more difficult to debug, because you need to inspect logs from iptables and maybe do some tcpdumping or what not.

The moral of the story is: there will always be a kube-proxy running on worker nodes regardless of what mode it is running in. The difference is that in userspace mode it acts as a TCP proxy intercepting and forwarding traffic whilst in iptables mode it will configure iptables rather than proxy connections itself. The traffice forwarding is done by iptables automagically.

kube-dns

Now, kube-proxy is just one piece of the K8s service puzzle. Another one is kube-dns which is responsible for DNS service discovery. If the kube-dns add on has been set up properly you can access K8s services using their names directly. You don’t need to remember VIP:PORT combination. The name of the service will suffice. How is this possible? Well, when you use kube-dns, K8s “injects” certain nameservice lookup configuration into new pods that allows you to query the DNS records in the cluster. Let’s have a look at our familiar tutum/curl pod we created to test services.

You can see that the IP address of the kube-dns service (see at the top that this is the kube-dns VIP) has been injected into the new pod along with some lookup domains. kube-dns creates an internal cluster DNS zone which is used for DNS and service discovery. This means that we can access the services from inside the pods via the service names directly:

Ok, so this is really handy. No more remembering IP addresses, no more crafting and hacking our own /etc/hosts files within the pods - this almost feels like “No-Traditional-Ops” (oops) ;-)

We won’t talk about LoadBalancer type in this post as it’s only handy when running your cluster in one of the supported cloud providers and like I said, this post is about running K8s on bare metal deployment - we have no luxury of ELBs and the likes! Either way, now we should be well equipped to take the next step and look into the magic of ingress.

Services and ingresses

Let’s talk about Ingress resource and how it can address the “shortcomings” of the NodePort service type. Don’t forget to check the documentation about Ingress API resource. I will try to summarize the most important bits and show you how you how it works undernath.

Ingress is an API resource which represents a set of traffic routing rules that map external traffic to K8s services. Ingress allows external traffic to land in the cluster in a particular service. Ingress on its own is just one part of the puzzle. It merely creates the traffic route maps. We need one more piece to make this work: ingress controllers. Ingress controllers are responsible for the actual traffic routing. So we need to:

Create Ingress (API object)

Run Ingress controller

What we do in practice has actually the opposite order. First you create an ingress controller which handles the traffic and wait until it’s ready. Then you created and “open” the route in for the incoming traffic. This order makes sense: you need to have your traffic controller ready to handle the traffic before you “open the door”.

Ingress

Let’s create a simple ingress to route the traffic to our nginx-svc service we created earlier. Before that we need to create a default backend. Default backend is a special service endpoint which will handle the traffic that arrives at the ingress and does not match any of the configured routes in the ingress route map. It is sort of like a default “fail over” host known from various application and http servers. We will use the default-http-backend available in the official documentation. We will expose it as a new service:

Now, before we create ingress API object we need to create an ingress controller. We don’t want to be caught off the guard exposing the service before we are ready to handle it. You are spoilt for choice here. You can use the Rancher one or NGINX inc. one build your own more specialized controller. In this guide we will stick to the basic nginx ingress controller available in Kubernetes repo. So let’s create it now:

Notice that the nginx-controller we have created is just a simple application running in a K8s pod which has some special powers as we will see later on. What ingress controllers do unerneath is, they first register themselves into the list of controllers via API service and store some configuration there. We can list the available controllers in the cluster by listing our familiar etcd cluster registry. In this case we are interested in /registry/controllers/default key (default implies default K8s namespace):

Great, so both default-http-backend and nginx-ingress-controller have registered themselves correctly. We should be ready to create some ingress rules and bring the external traffic into the cluster now. For the purpose of this post I will use the following ingress:

What this will do is, it will create an ingress API resource which will map all the incoming requests which have HTTP Host header set to foobar.com into our nginx-svc service. All the other requests coming to this ingress point will be routed to the default-http-backend. Please note that we will be mapping only the root URL, but you have the option to create maps to particular URL endpoints. Let’s go ahead create the ingress now:

Excellent! The ingress has now been created as per the specified yaml config shown earlier. As is always the case, every ingress stores its configuration in the etcd cluster. So let’s have a look there:

We can see that the ingress maps all the traffic for the foobar.com into our nginx-svc service as expected and it is available through an external IP address which has been redacted in this post. This is the IP to which you would point your DNS records and which routes to the ingress controller IP address exposed externally.

Why is this etcd key so important? Well, the ingress controllers are actually K8s API watcher applications which watch particular /registry/ingress endpoints changes to keep an eye on particular K8s service endpoints. That was a mouthful! The key takeaway here is: ingress controller monitor service endpoints eg. Ingress controllers don’t route traffic to the service, but rather to the actual service endpoints i.e. pods. There is a good reason for this behavior.

Imagine one of your service pod dies. Until K8s service notices it’s dead it won’t remove it from it’s list of endpoints. K8s service is like we should know by now, just another API watcher which simply watches /registry/endpoints. This endpoint is updated by K8s controller-manager. Even if the K8s controller-manager does pick up the endpoints change, there is no guarantee kube-proxy has picked it up and updated the iptables rules accordingly - kube-proxy can still route the traffic to the dead pods. So it’s safer for the ingress controllers to watch the endpoints themselves and update their routing tables as soon as controller-manager updates the list of endpoints.

Now, what is all this buzz around ingress controllers? Well, for starters they can terminate the traffic and load balance it across service endpoints. The load balancing can be quite sophisticated and it’s entirely up to the ingress controller designed. Finally, you can have them doing the SSL/TLS termination heavy lifting and relieve the actual services of doing so. You can indeed get the TLS configuration done through the secrets API.

Now that we have both Ingress and Ingress controllers in place we should be able to curl our nginxsvc directly from outside the cluster as long as we set foobar.com HTTP Header. Let’s try to do that:

Conclusion

Thanks for staying with me until the end! Let’s quickly summarize what we learnt in this post. Everything what’s happening in the K8s cluster is done through API service. API resources are implemented as API watchers that watch particular set of keys by registering watches in K8s API service. K8s API service in turn reads from and writes into etcd cluster which stores all of the cluster internal state.

K8s service Traffic is routed directly into service’s pods, which are service’s endpoints, via a sophisticated iptables kung-fu performed by kube-proxy or via the kube-proxy itself. The service name addressing is done through DNS discovery done by DNS add-on; in the simplest case this is kube-dns, but you are spoilt for choice, so pick one that suits you the best.

Ingress is an API resource which allows to map network traffic to ingress controllers. Ingress controllers are applications deployed in pods that work as K8s API watchers monitoring /registry/ingress key through K8S API service and update their routes based on the service endpoint changes.