10 Tough Incident Manager Interview Questions (Download PDF)

I was on my way to lunch when I got the call…

When things break, you need someone on the bridge call who not only understands the protocol, but they also have the skills and experience under their belt to drive the investigation in the right direction and to closure.

Putting the pieces back together!

Incident response is critical, which is why incident managers are a must-have for companies that have 24/7/365 web services that host online environments where millions of users (or customers) are impacted by a single device going offline.

Think for a moment about the impact to customers if the website for your bank account, online school, or even Amazon were to go offline. HUGE!

I’ve worked in places where downtime was measured in dollars, not minutes. And let me also tell you there isn’t anything to compare to the STRESS of being on a bridge call with VIPs having amygdala hijacks every 5 minutes!

So what is the skill set of an incident manager?

Unlike with VMware interviews and skills, the person you’re looking for doesn’t just have technical expertise and in-depth, hands-on skills. Moreover, most solid incident managers have grown through the ranks from a hands-on role into an incident manager role.

Ideally, the perfect prospect you are looking for has a broad perspective on all technologies such as networks, servers, virtualization, cloud, database, applications and web servers. And, let’s not forget both Windows and Linux variations.

However, like I said, technical skills are not enough!

Continue reading, and I promise to share the secret ingredients for incident managers, all revealed in the 10 interview questions below and intended to help you screen your job applicants for the best candidate(s).

Incident Manager Interview Questions

Note for Interviewing Managers: There are no answers to these questions so you’ll need to pay close attention to the responses and body language during your interview. Anyone with experience will bubble to the top and have good examples to share. Look for good ‘been there and done that’ examples.

Scenario (Leadership):

It’s 2 o’clock in the afternoon on Monday. And to make things interesting, it’s month-end. You have just been called and asked to join a bridge call. There’s already a team troubleshooting an outage on your central tier 1 application that is used for invoicing customers. Severity 1!

Question #1:

Explain your ability to coordinate a large group of technical contributors during this high severity incident and retain control of a fast-paced conference call?

Scenario (Analysis):

Over the last 2 days, you’ve been on many bridge calls that all seem to have the same cause. Now it’s time to lead the investigation into how this problem happened in the first place.

Question #2:

Explain how you would lead an incident investigation (a.k.a. Root Cause Analysis or RCA)?

Scenario (Tact):

You’ve been called to a bridge call that is already in progress, and things are not going well. You can sense people on the call are tense and withdrawn because a high-level person has taken over the call and they are very frustrated. You asked for a summary of where things are. Immediately, the VIP tells you it’s broken and needs to get fixed ASAP!

Question #3:

How would you maintain a professional demeanor and attitude while being assertive to this person that you will take it from here?

Scenario (Confidence):

You have been invited to an important meeting to share your investigation findings for a full-scale outage that was caused by a storage failure. At the meeting with you are the CIO and his executive staff to hear your analysis. After introductions, your boss has turned the conversation over to you. You have brought your laptop and have your RCA projected on the big screen.

Question #4:

Please give an example of a time when you faced a situation similar to this one and had to exude the ability and confidence to act decisively and exercise influence over a wide range of individuals at all levels of technical and business leadership?

Scenario (Judgement):

You are managing an incident that has been going on for over 2 hours, and now things are starting to heat up. You have the network admin checking a possible switch issue, a DC specialist checking the physical connection on an ESXi host that seems to be having network connection problems which are impacting 15 VMs. And unfortunately 2 of the VMs hosted are key systems to separate critical applications that are not redundant, yet. If this isn’t enough going on at once, while all this is happening, you are getting text messages from the boss wanting status updates.

Question #5:

Please share an example of a time when you had to multi-task and make sound judgments in a fast-paced, high-stress environment, while at the same time keep people informed?

Scenario (Diversity/Culture):

You have been assigned to the global incident response team which has staff spread out across the US, India, Mexico, and Brazil. When a severity 1 or 2 issue happens people from each location are asked to join a bridge because the problem can be anywhere due to the distributed workflow designed into the application.

Question #6:

Share an example of a time when you had to interact with people/groups of widely varying disciplines, cultures, and backgrounds. Explain how you influenced them to follow your lead?

Scenario (Awareness):

In today’s world, most environments are using virtualization for hosting their servers and applications. Many operations are also using cloud service such as AWS and Google for IaaS. This creates a new challenge for understanding infrastructure topology for virtual servers. But not only is this new; some operations have added Docker containers and PaaS to an already complex world.

Question #7:

This question has 2 parts.

First, briefly explain your technical background and be specific about the functional breadth of your expertise. Explain how you would be able to ask the right questions about a virtual server, and even question the responses from the admin if you thought something didn’t sound right?

Second, if everything points to a defective KVM host, yet the admin is insisting it’s OK, give an example how you would challenge the admin’s assessment if the overwhelming evidence says it’s the host?

Scenario (Depth):

If you’ve been around IT staff for any amount of time then maybe you noticed our level of passion and ownership. Now imagine that you are the new incident manager on the call who has to work through each technology stack from the ground up until you find the problem. You’re working with a diverse group of personalities which includes admins and engineers from the data center, network, server, database and application teams; all wanting to prove it isn’t their problem. And in some individual cases, you may have a couple of managers, directors, and business partners on the call.

An incident response bridge has been opened for a tier 1 online application that is running in a local vSphere cloud on VMs (IaaS). You have been called to lead the call.

Question #8:

Please demonstrate your telephone and oral skills by sharing an example of how you would begin the incident investigation and then move through each technology group?

Scenario (Trust/Respect):

An incident manager who is new to the team may have problems getting people to follow their lead, which is why building trust by meeting with service and application owners is essential. But even more important is building rapport with key players on all the different teams.

Question #9:

This question has 2 parts.

First, if you were selected to be our new incident manager, explain how you would establish strong interpersonal & relationship with our technical staff and managers?

Second, provide an example of your social skills, ability to learn complex systems, and an estimate of how quickly you would be able to get up-to-speed?

Scenario (Communication):

You’ve just spent all night on a bridge call troubleshooting an application problem that could have been resolved in 10 minutes if the right admin would have joined the call. But unfortunately, they did not, and what you had to work with was the junior admin who was on-call. It’s a good thing you are technical because, in the end, you had to log into Windows Server and fix the issue yourself.

Question #10:

This is a 2 part question.

First, how would you handle communication to the senior level staff waiting for the problem to be solved?

Second, if you found out the key person was just not answering the call to join the bridge, how would you handle the communication with the admin’s manager after the incident was resolved?

Retrospective:

I’ve given you a list of 10 questions that all have a unique attribute: Leadership, Judgement, Depth, Trust, etc…and here is my point.

Whether you are a small mom and pop shop doing the hiring, or a multi-billion dollar corporation, I understand how costly incidents are!

Which is why these incident manager interview questions come from my own experiences. They are crafted to help managers quickly determine the level of expertise an applicant has.

And on the other hand, they will help the aspiring incident manager understand what’s expected so you can get training. Here’s a great place where you can find ON DEMAND IT Training in case you want to improve your technical skills in Linux, cloud, web services, etc.

Final thought…

This is a fact! Finding the right person to be your incident manager could save you hundreds or even millions of dollars in penalty fees for breaching your SLA. Let that sink in…

Related Interview Questions For Hiring Technical Managers:

Thank you for your interest and please feel free to contact me if you need someone to screen your candidates.

Honesty Disclosure: VMinstall.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites. Ads and prices last updated on 2018-05-24 at 12:28.

Thank you for the comment and feedback. Interesting. I’ll look into it but Chrome, Firefox, IE, Fire and Safari and all tested and working fine, as well as mobile versions. What OS/browser are you using?

Syed

I’ve been on incident calls for every type of outage that happens in IT such as Storage, Network, VMware, DB, Apps, Cloud, Telecom, ISP, Water Leaks, Fires, ETC. The one thing I can say is key to handling these situations is to stay calm.

In some of my experiences, I’ve had VIPs yelling on the call and even threatening to fire people. None of that helps the situation but it happens.

The key is to stay calm and only have the people on the call that need to be there. If possible keep all extra people off the call so they are not asking questions that distract from figuring out the problem. I call these rabbit holes, and when a VP or CIO gets on the call and goes down a rabbit hole everything from there is just guessing.

So stay calm, only have the key players on the call, and if you’re not familiar with the technology systematically go up the ISO i.e. physical, data-link, network, transport, session, presentation, and application. Or if you know the App then go straight to the issue.