As our application teams move to a cloud and services world - where they are no longer bound by physical capacity and the procurement thereof - they will face new opportunities and decisions. One such decision is whether to consume our Service Monitoring Service, whether to build their own monitoring capabilities, or whether to consume a Service Monitoring Service from elsewhere. In order to make sure that our service lands well in support of our business, we must build the application teams’ trust, deliver capabilities that make the applications teams desire our service, and avoid landmines to reduce the adverse-trust-impacting-events that will make the application teams lean towards going their own way even though it may be counterproductive to our business goals. Of course, if the right business decision is for the application teams to go elsewhere, we should help them move with all expediency as our top goals must always be focused on the business outcomes.

I am stating the obvious here, but to build trust in our Service Monitoring Service, we must run it as... a Service. That means that we do all of the things an application team would/should do for a business service.

Of course, we will need a Service Description. This should include the normal ITIL content, and in particular, it must clearly and simply define the following:

How to onboard a new application/service onto our Service Monitoring Service. We will want to make sure the steps are clear and simple. The worst thing we could do is have a complex (or undefined), lengthy, red-tape-laden on-boarding process if we expect the application teams to use our service.

How to onboard a new monitor onto our Service Monitoring Service. The development teams will be moving very quickly, and we want to enable that agility, not block it. In a future post, I will cover this in more detail and will include some suggestions that can enable the development teams to onboard new rules without any effort or change by the Service Monitoring Service team

How to update business rules within our Service Monitoring Service and ensure the application teams will have a simple way to edit them with minimal complexity, minimal lead time, and minimal red tape.

We will also want to assign a Service Manager (or Service Owner) for our Service Monitoring Service; someone who is ultimately accountable and will stand in front of our customers and our executives and attest to its quality, or lack thereof. Along with this accountability comes implied authority. This person is empowered to work with other Service Owners from underpinning areas (like Network and Telephony) to drive improvement requirements into their areas as well and is expected to break down any walls that stand in the way of us meeting our Service Monitoring Service objectives.

Of course, we will need Key Performance Indicators (KPI) for our service, and we will need to be very open with our achievements as well as our failures. Before we begin the discussion about our KPI, let us look at a simple timeline diagram:

We all know that KPI vary, yet there are some high priority opportunities including the following:

But what should our targets for the above KPI be?What is acceptable latency for end-to-end time? Is 5 minutes acceptable to our customers? 2 minutes? It is likely that we will eventually land on an end-to-end target of 2 minutes with an Input-to-Alert target of 1 minute and Alert-to-Output target of 1 minute. Even then, if the application team has a target of 99.99%, our 2 minute target equates to approximately 5% of their total possible minutes which is a considerable amount of time.

H.Availability.
a. The fun question is always “how do we measure availability in a way that quantifies the user experience?” For our Service Monitoring Service, the question remains the same. If we step back and think about what it does, it becomes a quite simple answer; it takes inputs (e.g. events) and turns them into outputs (e.g. tickets and phone calls). So the way to measure availability for our service is to calculate the “% of inputs that made it to the desired output within the target timeframe”.

So for example,
i. For all inputs that were to be ticketed (X), how many were ticketed within the allotted time (Y)? (Y) divided by (X) = % availability for ticket output
ii. For all inputs that were to be paged (M), how many were paged within the allotted time (N)? (N) divided by (M) = % availability for ticket output
iii. Etc.

b. Our aspirational target should be 100% for each output. Reality may dictate that we also set interim targets to drive improvements by over time, but that does not change the fact that our aspirational target should be 100% and we should manage aggressively towards this. Our customers (the application teams) will rightly have zero tolerance for failure as their availability has a firm dependency on our availability. For example, if they do not know that they are impacted because they did not get paged for their alerts that fired, they have no opportunity to meet their targets.

I.Reliability.
This is a simple opportunity. For every output, how many opportunities were not successful within the target time (i.e. (M minus N) and (X minus Y))

This level of measurement is neither purely simple nor overly difficult. And in terms of building trust, when we measure and report this level of data to our customers (our application teams), they will be impressed with the focus on service quality that we have. To take it a step further, we should treat every instance where we do not hit our target as an Incident and then as a Problem that we share root cause, improvement actions, and improvement results with our customers.

Now, let’s take a step back for a moment. How many of us are running our monitoring platform this way today? How many of us measure outcome-based-availability this way? Do we even know how many minutes of our customers’ availability targets we are consuming with the simple act of turning events into outputs? The likely answer is that not many of us are. And is it any wonder that our application teams are considering alternatives?The quantum leap forward that we all need to make should be getting clearer. We can do it! And when we do, trust will follow.

More blog posts in the Building Service Monitoring as a Service with an Eye on the Cloud series

Comments

A good reference material for the beginners. I have noted down some important points where I can apply it at my work place and I can use some of the points to suggest to my company Management to adopt the best service monitoring management approach.