Another Java technology blog from a developer far away from home

If you work with Docker, there’s no doubt you heard about Kubernetes. I won’t introduce this amazing gift from Google. This post is about a particular ressource, the Ingress and its controller. Introduced as a beta feature when releasing Kubernetes 1.2, the Ingress ressource is the missing part to open your cluster to the world.

Sure, if you’re lucky enough to run Kubernetes in a supported cloud environment (AWS, Google Cloud…), you can provision a load balancer automatically when creating a new Service. But…that’s it. If you need to do SSL termination, or just some simple routing rules, you’re stuck. This is where the Ingress ressource steps in.

The beauty of the Ingress Controller is the freedom of choice. Of course Google offers its own implementation (based on NGINX), but you also have NGINX Inc, offering an implementation, Rancher, HAProxy, Vulcan etc.

The main differences between each implementation are the tweaking possibilities (basic authentication, rewrite rule…). But, they have something in common: the way they denormalize a Kubernetes Service into a configuration (NGINX, HAProxy etc.).

Let’s imagine you have a Servicemy-wordpress-svc with 2 pods running behind it:

But wait a minute, why don’t we use the Service virtual IP in the upstream configuration, instead of fetching the endpoints? It sounds like a good idea:

No need to update the configuration in case of scaling up/down or deployment update, Service VIP don’t change.

No risk of sending requests to un-existing pods (Ingress Controller is not always synchronised with the API, it’s just a watcher in the end, there might be a delay).

No risk of un-expected behaviour (non-idempotent requests are automatically retried by NGINX on connection error, timeout, http 502, http 503, http 504…. by passing the request to the next server).

In practice, by using the Service VIP, the Ingress Controller would not have to worry about any pod change happening, and that would guarantee a seamless scaling/deployment of your service.

Well, that’s what I thought, and that’s what other people thought too, and we were wrong. The idea sounds good, but in theory there is no guarantee at all when using the Service VIP. No more than letting the Ingress Controller maintaining the pod list.

Why is that? First, let’s describe how scaling up works, imagine this scenario:

Replication Controller creates 2 pods

Pods become ready

The controller manager updates the endpoints

Kube-proxy detects a change with the endpoints and uptable iptables

Now, what about scaling down, we set the number of replicas to 1:

Replication Controller deletes 1 pod

Pod is marked as Terminating

Kubelet observes that changes and sends SIGTERM

Endpoint controller observes the pod change (terminating) and removes the pod from Endpoints

Kube-proxy observes the Endpoints change and updates iptables

Pod receives SIGKILL after grace period

The important part here, is that 3 and 4 are triggered in parallel. There is no synchronisation, one can happen before the other. That means your pod might be shutting down, and the endpoints have not been updated yet.

If the endpoints have not been updated, the Service VIP will still serve traffic on those pods. And even if the endpoints have been updated, there is no guarantee that kube-proxy has picked up on those changes yet, and the iptable rules will still route traffic to those dying pods.

Just like the Ingress Controller, kube-proxy is an API watcher. It does not work in a synchronised way when scaling up/down. It detects changes at some point, and tries to apply those changes.

That’s the reason why using the Service VIP in the configuration generated by the Ingress Controller, does not provide more resiliency.

So if it’s the same, why do Ingress Controller bypass the Service and maintain the endpoint list?

Advantages of bypassing the Service VIP

There are multiple advantages of having the pod’s endpoints in the NGINX upstream configuration: