How 6 of the world’s largest companies use Kub + Sysdig.

Welcome to the second half of this two-part series focusing on troubleshooting Kubernetes services with sysdig. In the first part, we studied how a Kubernetes service works: in particular, we explored an HTTP transaction across a Kubernetes service, starting from the “magic” DNS resolution done by SkyDNS and followed by the Kubernetes proxy intercepting requests leveraging the Linux Netfilter capabilities and forwarding them to the backend pods.

After studying the behavior under normal conditions, we are now ready to troubleshoot some real application failures. The key here is that we now know what ought to be expected from a properly behaving system, because we’ve thoroughly analyzed it already. As we’ll see, this workflow of “troubleshooting by visually spotting differences” will be very powerful.

Scenario #1: Wrong port configuration

Here’s the setup: I’ve now deployed a new Kubernetes cluster with pretty much the exact same application as before in Part 1: two nginx pods with a single Kubernetes service to load balance them. But this is what I get now:

Let’s explore with sysdig what happens under the hood. Once again, I’m going to use the same command lines as before for capturing the system events and reading them during the analysis. For brevity, I’m going to omit the steps of the troubleshooting that match the behavior of the previous use case, jumping right to the core of the issue.

Here we can see curl, after having correctly done the DNS resolution, trying to connect to the service virtual IP address (which matches the output of kubectl):

While curl is connecting to the service, we can see the proxy process accepting the connection (correctly translated by iptables, like before), and the proxy process immediately tries to connect to one of the nginx pods (172.17.0.6), in order to forward traffic to it:

Do you see why it didn’t work? The reason is that the proxy is trying to reach my pods on port 8080, not port 80, where nginx is listening! In this case, I forgot to specify –target-port=80 when declaring my service, so by default Kubernetes used as a target port the same one exposed by the service itself. When things like these happen, troubleshooting can be very painful!

Scenario #2: Wrong labels

Let’s try another mystery issue. In this example, curl is giving me the same error as before:

But this is pretty much the only thing that happens. The proxy doesn’t seem to receive any incoming connection or do any network traffic like we’ve seen in the previous cases, so it definitely can’t forward any request to the backend pods. It seems almost as if there was a system misconfiguration in iptables and the routing logic for the virtual IP address wasn’t there. Let’s check:

Completely empty! No wonder curl gives me a timeout error, since it is actually trying to reach a real host with IP address 10.0.0.226 somewhere according to my network routing table, which obviously doesn’t exist. The culprit here is a custom script that I ran that flushes all the iptables rules and, although Kubernetes recreates them periodically to ensure long term consistency, when I ran curl I was still in this vulnerable time window and the traffic wasn’t intercepted by the proxy. The solution here is to tweak my script to not be so disruptive and avoid flushing iptables rules for the Kubernetes chains!

Scenario #4: Accessing a service from the host

So far I’ve been accessing the service using a client pod running inside Kubernetes itself. Sometimes it might be convenient to access the service directly from the host. This is what happens if I run the same curl command from my host:

But this is clearly wrong, notice how there isn’t any activity from the skydns process, as we’ve seen before, and yet curl received a DNS response. Obviously curl is contacting another DNS server, 172.16.189.2, which is the one offered by my provider. In fact, curl then tries to connect to a totally random host on another network:

This will eventually fail because that host either does not exist or is just letting all connections on port 8080 timeout. Correlating this with the previous normal baseline behavior, I can see that skydns listens to requests on localhost (:::53 in the initial trace), and so the fix is as simple as adding “nameserver 127.0.0.1” to my /etc/resolv.conf file, and trying again:

With this little change, now I can contact my Kubernetes service, using its user-friendly DNS name, right from my host.

Conclusions

As we’ve seen, Kubernetes brings a lot of components that distributed application developers can leverage to be more efficient, and one of those is definitely the concept of Kubernetes services. We have also seen how these components, while working in an almost magical way, can face significant troubleshooting challenges when something doesn’t work as expected. Hopefully though, container-native troubleshooting tools like sysdig can help you overcome these difficulties and deploy Kubernetes in production with confidence.

Sysdig is available open source at sysdig.org, or sign up for a free trial of Sysdig Cloud today and monitor your entire distributed Kubernetes cluster in a single pane of glass!