10 Steps to Debugging Containerized Applications

Aug 15, 2017

I’ve recently been building a new product using Rails based on an approach that I call Breaking the Monolith. Rather than build a traditional Rails monolith, I use multiple microservices / small Rails applications and deploy them all with Docker into a distributed system. The hardest part about distributed systems is always debugging and I’ve written this guide as a step by step approach for how to go from a misbehaving application right down to a malfunctioning container – this is the actual process that I follow.

All of this is being done on Ubuntu under AWS but the debugging process applies to any *nix type environment or Platform as a Service (PAAS).

Disclaimer: But You Don’t SSH into Containers…

A lot of the examples below are based around SSH’ing into a server and diagnosing the error in context. I’ve seen a lot of things since the advent of containers that seemingly want you to believe, that in this brave new world, you just don’t SSH in anymore. Now, perhaps I am doing things wrong, but I have not found that to be the case at all. It may be that once I am out of active development that I will no longer be SSH’ing into servers but, for now, SSH is still a dear old friend.

01: Failure Context - 504 Gateway Error

The general system for a failure on this application seems to be a 504 Gateway Error which basically means that the application load balancer (ALB) isn’t receiving output back from one of the HTTP subsystems.

02. Check the Url in Development after a Server Restart

Every single time you do a deploy with Docker, your entire Gem stack along with any initializers is rebuilt and that means that a stack level change that you made in development but failed to catch can break everything. So the first diagnosis step is to stop the development server and make sure that things come back up correctly. Each of my applications runs on a different port so I can’t give a single example here – do the ctrl+c on Puma and then restart it and check the status.

Note: If you don’t have a simple health check for your application then I would strongly recommend it. Here’s a simple gist showing a /health url for a Rails application.

If your Rails app is running on say port 3200 then you can just do:

curl http://localhost:3200/health

and you should see:

ok

03: Test the Health Check Logged into the Server

Once you’ve verified that things are correct in development, the next step is to log into the server and run the same curl test on the server where the failure is occurring. In order to make this type of debugging extremely simple for me, I run all my applications server side on exactly the same port structure that I do in development. Even http services like the main web site run on their development port since I can let the load balancer handle translation back to 80. Having a deployment environment that mirrors development is a huge conceptual boon. Assuming our same 3200 port example, we would:

curl http://localhost:3200/health

and you should see:

ok

Note: If the failure is happening solely within the same subsystem then this usually is sufficient to reveal the problem.

04: Check the Application Docker Logs

The next step is the application level Docker logs. My deployer engine, Dockerano, generates a per application shell script which generates logs for the “main” container called dshell so I see something like this

I spent a lot of time on this project trying desperately to use T2.micro instances because, well, they’re cheap and, at best, it was a false economy. Severe bloat within the Docker AUFS filesystem found me continually running out of disc space after multiple deploys even though my containers were actually tiny. This is a known Docker Moby issue that has been open for over a year and a half now and is still unassigned to anyone.

In order to avoid this bug, I ended up moving from multiple T2.micro instances to a single m4.large instance and then doubling the underlying storage from 8 gigs to 16. And, when I did that, a lot of my issues just magically disappeared. Being cheap truly was a false economy here because I ended up with fewer instances and not only did my reliability go up but my bill went down.

06: Check the CPU Usage and Ram Usage

If you don’t have htop installed on all your instances then you really, really should. htop kicks the absolute snot out of classic top. Install it with:

sudo apt-get install htop

And then invoke it with:

htop

At this point you can easily see the underlying machine load, etc.

07: Look at Individual Container Status

If you’re having an issue with a given application then you want to look at all the containers for that application. The easiest way is to grep by name. Let’s say that your underlying application is called seirawatchwebapp:

The thing to be concerned about here is 23c1b98a2add and the reason is that it generally shouldn’t be continuously restarting which is what this view shows.

08: Application Level logs - Timber.io

I’ve recently started using Timber.io which is a cross application logging environment and I’ve been very, very happy with it. If you haven’t looked at Timber.io for your Rails development, I’d recommend it. Even the free tier is actually quite useful.

Timber.io is a full web app rather than a command line tool so you need to log into the Timber service and then select your application where you want to view the logs.

09: Check Your Error Logger

10: The Answer: Check All Your Containers

What I’m building is a multi-container system, a distributed system in truth, with formal APIs between each of the components and what this means is that a container failure in subsystem X can affect subsystem Y or subsystem Z without it being clear as to why. The trouble with this type of debugging is getting a high enough level view to understand it as a whole.

The easiest way to do this on a single machine is to use the docker stats command:

The curious thing here is that one container, c520fc5504f1, is showing – for CPU % and all other metrics. Let’s zoom in on that one. Personally I find the view above to be more granular than needed and missing the application specific details that I need so my deployer generates a dstats shell script which does this:

So what’s going on here is clearly at the Sidekiq stack layer and it is some type of connection to the underlying containerized Redis instance. Once we know that, troubleshooting this should actually be pretty simple; it eventually turned out to be a missing pair of files - config/sidekiq.yml.erb and config/initializers/sidekiq.rb that had been overlooked in my initial configuration.

11: Go Nuclear - Restart the Docker Daemon

The absolute nuclear approach here is to simply restart the docker daemon itself. On Ubuntu, this is:

sudo service docker restart

I’m not going to go so far as to say that you don’t have to, rarely, restart the docker daemon but it is just that – rare. Your problems are far, far, far more likely to be application side errors, even when it looks like Docker is at fault. I’ve mistakenly pointed the finger at Docker too many times – and I was wrong.

Pitch for a Friend: Learn from Nick

All my Docker knowledge came from Nick Janetakis’ Dive into Docker course and he does a great job teaching about Docker. He also kibitzed with me on this debugging process although he never saw the final draft before it went live. Any errors are mine not his. Strongly recommended.