Making the Service Monitoring Service Viral

Earlier in the series, we discussed the importance of enabling features in our Service Monitoring Service that make the application teams want to use the service rather than building their own “page myself” scripts.There are many features that we may add, but there are three features that we should prioritize once our base service is in place.

We need an API

Even if our Service Monitoring Service is 100% available, if the application teams get bogged down in our processes to add and remove rules, we hurt our business and we will fail as a provider.

We need to stand by two of our core principles that “monitoring rules should live with engineering” and “the monitoring service is abstracted from the rules”

Our API must enable the application teams to interact with our service

Read Alerts

Interact with Alerts (e.g. clear alerts, change status of alerts)

Write Alerts. This alone will fundamentally change the way the application teams interact with our service. We will continue to maintain event and email-based inputs, but this will be the final feature that will “make it click for the application teams”; enabling them to completely abstract the monitoring rules and intellectual property from the monitoring platform. The rules should and will live on as part of the application code base. When the application needs to alert, the application will simply call our service to invoke the action.

We need a Correlation Layer

We all know why correlation is so important... fewer alerts reduces costs, improves employee morale, and, most importantly, helps ensure that we do not lose the important alerts in the noise.

Application teams need a way to say “if X, Y, and Z all fire within A minutes of one another, combine them into one alert.”

We need a Net-New-Alert-Layer

Similar to the Correlation Layer, but slightly different, our application teams need to be able to say “if these three Priority 2 alerts fire, go ahead and ticket them as normal AND IF they all fire within X minutes of one another, ALSO fire a net-new-alert as a Priority 0 alert.” This will enable the application team to codify scenarios that have historically required human oversight and human recognition.

We need to update our API to allow the application teams to add/change/remove correlation and net-new-alert rules

From a roadmap perspective, iterative delivery is preferred: get the base service monitoring service in place, add API read, add API Interact and API Write, add Correlation and Net-New-Alert, update our API to account for Correlation and Net-New-Alert.

Avoiding Landmines

As with any service, we have unlimited landmines to avoid. Luckily, there are a few key ones that we can focus on to give us the best opportunity to succeed.

Simplicity is key

We need to lead the paradigm shift that “simple is beautiful”. The simpler our service, processes inputs and outputs are etc, the more successful we will be.

Self-Service is a must

If the application teams have to go through a workflow with us to onboard rules, to clear alerts, to write alerts, etc, we will fail. For example, if the application team has to fill out a web page request, wait 48 hours for a response from us, and then have human work booked on our part to implement their request, we will have succeeded only in adding needless overhead and in driving our application team to implement a faster solution on their own.

A simple fact is that the application teams will be agile with or without us. As we covered at the beginning of this blog series, there are great business reasons that we want them to be successful with us. So we must be agile and we must remove unnecessary steps.

NOTE: I am not implying that there should be no change management or change control for rule updates. What I am stating is that we should not have a cumbersome workflow to interact with our service, and we definitely should not stand behind a mantra of control to try and justify a cumbersome monitoring service. We should make our service simple to interact with, and we should make sure the right processes, tollgates, and controls are in place for our business. It is an “and” rather than an “or”.

Avoid the urge to get back into the rules business

As “monitoring people”, we have always been accountable for the rules. But the reality is that the people who know the apps the best should build the rules.

We need to clearly draw the line between rules and platform, and we need to enforce that line

NOTE: we can and should consult with the application teams on best practices for eventing and alerting, but we must withstand the urge to do it for them.

Beware of dependencies

For example, if your monitoring service sends emails, do not rely on an email service that you are monitoring with your service. Why? Because if email is down, the alert emails will not be sent.

Looking ahead

In future posts in this series, we will explore how to enable the application development teams to succeed with our Service Monitoring Service. And finally, we will discuss how even perfect monitoring can still end in failure. Happy thinking!

More blog posts in the Building Service Monitoring as a Service with an Eye on the Cloud series