Debugging an outage without an internet connection

The Monday of this week, I was drafted in to help resolve a production incident on a system that I had helped build before I moved teams. What makes this unusual is that I had no way to actually debug it at the time. So here’s a small experience post about what I did and what I learned.

If I was debugging alone, I would likely be jumping back and forth between my hunches, trying to cross them off as quickly as possible. The nature of this new dynamic required a much slower and measured approach.

Don’t say “Can you read out the logs from 3 o’clock to 5 o’clock?”

We’re trying to identify possible causes of the issues, and these (caveat: in my opinion) are better phrased as questions rather than requests

Do say “There was a deploy at 3 o’ clock today, is there anything unusual in the app log?”

If you have the source code available to you, referring to files and line numbers is really helpful — this enabled us to identify particular lines of configuration that might be causing the issue.

Lesson 2: Say why you’re asking the question

I’m not going to treat my colleague like he’s just a pair of hands for me, and I felt it was really important to clarify why I was asking the question. It gives an opportunity to short-circuit the query if it’s something that’s already been eliminated.

Don’t say “Can you search the logs for AWS request errors?”

Bonus points: ask about the output, not the act. I don’t really mind how we get to the answer, more that we eliminate or prove a potential cause.

Do say “I think that the problem might be being caused by a lack of correct permissions for the AWS credentials. Are there AWS permission denied errors in the app log?”

Lesson 3: They are the system owners, and the experts. Treat them as such.

I was pulled in because I’d worked on the system before but well over a year ago. The system might have changed in ways that I don’t know about, so it was important for me to recognise that my mental model might be out of date and I needed to tailor my questions as such.

Do say “I remember it behaving like X due to Y. Is this still true?”

In this scenario, there were a number of things I could probably eliminate based on my previous knowledge of what failure causes would look like for network issues, datastore connections, etc, but I didn’t want to assume anything.

Given that we’re trying to solve the issue in the shortest timeframe available, this process can make things even more stressful if things get out of hand.

There are several things you can do but they depend on how the other person works — for example, are they someone who likes to talk a lot, or more measured with the way they speak?

In the former, I would try to keep a conversation going with my thought process to normalise the conversation, but I wouldn’t do this in the latter scenario.

I hope will be a useful read for people who do on-call and who might encounter something like this — tl;dr try your best to help, be clear and up-front with questions, show empathy and care to not make things more stressful.