Pre-release tests are essential, but the ability to debug, monitor and observe your application suite post-release is what allows you to detect, and quickly fix, the production problems that will inevitably rise.

Introduction

Much has been written about how to ensure quality in the software we write and deploy to production. Unit tests, integration tests, PACT and consumer driven contracts, manual and exploratory testing done by QA teams. And the pre-production phase of testing is something I’ve focussed on a lot too. I’ve given talks on testing and quality and blogged on Test Doubles, Testing exceptions and various open-source testing libraries including AssertJ, FEST, Hamcrest and Easymock.

This post however, is essentially about testing and debugging in production.

Disclaimer

This is not an original work. I have heavily used several sources, particularly:

And other sources listed in the references section. Those articles were the inspiration for this post, and in some cases I may have simply copy and pasted parts (though I have tried not to). In the worst cases I have bastardized the sources into points the original authors likely never intended. I hope they can forgive me. They say imitation is the greatest form of flattery 🙂 I’ve found the best way for me personally to understand something is to copy, paste, slice, dice and rehash…

Embracing failure

As we use increasingly complex tech stacks, architectures and cloud deployments, the number of things that can go wrong also increases. We live in an era when failure is the norm: a case of when, not if. And things will break in ways you haven’t imagined.

Failures are easiest to deal with (detect, diagnose and recover from) when they are explicit and fatal. A fatal error stops the system. No more insanity or unpredictable behavior. Maybe you’ll get a core dump to assist with debugging. As The Pragmatic Programmer put it: “Dead programs tell no lies! Crash, Don’t Trash.”

Non fatal and implicit errors are much harder to deal with however.
Non fatal errors, when a system continues despite failure, may cause cascading problems and data corruption. Implicit errors, when a system continues to operate “correctly” but not “well” (e.g. slow), can be especially difficult to debug and find the root cause for.

A small error in one service cascades and causes catastrophic failures in another. Butterflies cause hurricanes. (See The Hurricane’s Butterfly)

Whatever the error type (implicit, explicit, predicted, or a not-so-pleasant surprise), we need to embrace the failures by designing services to behave gracefully when failure inevitably happens.

Avoid failures when you can, gracefully degrade if you can’t, tolerate the unavoidable, and overall try to build systems to be debuggable (or “observable”) in production for when all hell breaks loose.

Avoiding failures can be as simple as retrying. Graceful degradation can include techniques such as, timeouts, circuit breaking (e.g. hysterix), rate limiting and
load shedding. Tolerating failure can include mechanisms such as region failover and
eventual consistency (it’s OK if we can’t do it now, we’ll do it eventually) and multi-tiered caching (e.g. if you relying on a data store and its down, can you write to a cache as an interim alternative?). (See more at Monitoring in the time of Cloud Native)

Quantify the wellness of your app

All of this tolerance however comes at the cost of increased overall complexity, and the corresponding difficulty of reasoning about the system. It also limits how fast new products & features can be developed/released and increases dev costs.

So what should we aim for? Judiciously and proactively decide on trade offs. The risk the business is willing to handle versus the expense of building ever more available and concomitantly complex services. Make a service reliable enough, but no more.

So, how do you define what your goals are? Well, a good place to start is by recognizing and acknowledging where you are! What is you current level of performance? There is little point aiming for “five nines” when you are currently down for hours every day.

And a good place to start to measure your current state, or the wellness of your system, is to use KPIs.

KPIs

KPIs, or Key Performance Indicators, are basically important metrics about your system.

Some commonly used KPIs are:

Number of Users

Requests Per Second

Response Time

Latency

Also, and what we will focus on here,

Number of errors

Mean Time to Detect and Restore (MTTD/MTTR)

Application Performance Index (Apdex)

Errors

Error counts (along with code coverage), are among the most common metrics used to monitor software quality. And I get that errors are important and should be dealt with. “No broken windows” (to reference The Pragmatic Programmer, again). But error counts may not be the best, and are certainly not the only, metrics for examining the health of your system.

For example, if an error occurs in your infrastructure but there is no user impact, do you care?
There are other metrics than errors to be concerned about, and indeed I think are more important.

MTTD/MTTR

MTTD: Mean Time To Detect

MTTR: Mean Time To Restore

How do you record MTTD and MTTR? If you have automated monitoring tools that can detect the downtime and subsequent service restoration, great, but manually recording is also an option. Either way, documenting the times is a key part of an RCA (Root Cause Analysis).

Calculating involves recording three key event times:

Problem start time (start)

Problem detection time (detect)

Problem resolution time (resolve)

And then calculating as follows:

MTTD = detect – start

MTTR = resolve – start

(Note I have yet to find a good, definitive definition on how to calculate TTR for an incident; I’m sure there are places that calculate it as “resolve – detect”, rather than the “resolve – start” I am recommending above. If you know of a good article or book discussing this, I would love to hear)

A final point on MTTR is that it really helps to have CI/CD in place!

MTTF

Another related metric is MTTF (Mean time to failure). How much time passes between failures?

Given the choice between tracking MTTF and MTTR, you should track MTTR. Why is MTTR more valuable?

1) Perception.
Imagine
* If you’re site is down for 24 hours
* If you site is down for less than a second several times every day
People may never notice the latter, or at least be not inconvenienced by it.
Everyone, including your customers, your CEO, and perhaps your stock price, will notice the former

2) Wrong Incentives.
The best way to keep a system stable is to never change it! This may incentivize teams to release less often. But releasing less often is the anithesis of the DevOps movement.

Apdex

An industry standard to measure users’ satisfaction with the response time of web applications and services. It’s a simplified Service Level Agreement (SLA) solution that gives application owners better insight into how satisfied users are.

Apdex is a measure of response time based against a set threshold. It measures the ratio of satisfactory response times to unsatisfactory response times. The response time is measured from an asset request to completed delivery back to the requestor.”

The application owner defines a response time threshold T. All responses handled in T or less time satisfy the user. You can define Apdex T values for each of your New Relic apps, with separate values for APM apps and Browser apps. You can also define individual Apdex T thresholds for key transactions.

For example, if T is 1.2 seconds and a response completes in 0.5 seconds, then the user is satisfied. All responses greater than 1.2 seconds dissatisfy the user. Responses greater than 4.8 seconds frustrate the user.

SLAs

OK, so we have discussed KPIs. SLAs, or Service Level Agreements, are an agreement between a service provider and a client on the acceptable values of those measurements. For example: We will be 99.9% available on average in again given month, or we will refund some portion of your subscription fee. See my post on AWS S3 SLAs for example.

Since SLAs typically deal with some form of compensation if the agreement is not met by the service provider, I don’t think they are particularly suited for use with internal microservices. KPIs may be a better fit there.

One thing to bear in mind about SLAs is that if you have several components that you depend on, your availability is a product of, and hence always less than, those dependent’s SLAs.

What if your KPIs are in fact indicating non performance? Your Apdex is much closer to 0 than 1? Things are slow, erratic, or plain old failing? To use the analogy from earlier, given a hurricane, how do you find the butterfly?

Monitoring and Observability

Monitoring

I think it is safe to say that monitoring (and the logging and alerting that it entails) are very mainstream.

Some definitions of monitoring may include:

Observe and check the progress or quality of something over a period of time;

Keep under systematic review

Maintain regular surveillance over

For our services, we often “monitor” something to:

confirm it is acting in an expected way i.e., report the overall health of systems

to check if it is failing in a specific manner e.g. splunk alerts for a specific error message

This ties in with the “explicit” errors we were discussing earlier. You log specific errors, and alerts on those specific errors. But monitoring for expected success and expected failures only gets you so far.
What about the unexpected? Maybe you can just create a splunk query for”:

“host=*myservice* error”

But this approach doesn’t work so well when failures become less well defined, more nebulous. When failures become more implicit. Step in Observability…

Observability

Observability is about being able to understand how a system is behaving in production. An observable system makes enough data available about itself that you can generate information to answer questions that you had never even thought of when the system was built.

Some definitions of observability that I’ve seen include:

Provide highly granular insights into the behavior of systems along with rich context,

Provide visibility into implicit failure modes

Provide on the fly generation of information required for debugging

Another way think of observability is monitoring (as described above), plus the ability to debug, understand and analyze dependencies (source Cindy Sridharan’s Monitoring in the time of Cloud Native)

Debugging: debug unepected, rare and/or implicit failure modes

Understanding: using the data to understand our system as it exists today, even during normal, steady state. e.g. How many requests do we receive per day and what is the typical response time?

Dependency analysis: Understand service dependencies. Is my service being impacted by another service, or worse, contributing to the poor performance of another service.

The Three Pillars of Observability

The three pillars of observability are

Logs

Metrics

Tracing

Logs

Start with logging. “Some event happened here!”. We log key events as a easy way to track an app’s activity. Logs are a basic and ubiquitous aspect of software, so I am not going to cover them much here.

Metrics

How many of these events happened? Metrics are simply the measurement of something; data combined (“statistical aggregates”) from measuring events, that can be used to identify trends.

You can record metrics for almost anything you are interested in, but the four golden signals (à la Google SRE-book) are : latency, saturation, traffic, and errors.

Traces

Tracing is about tracking a request through a system. For example, from web server though various microservices, to your database, and back. As well as tracking the request path however, a trace can also provide visibility into the request structure. By that I mean you can track the bifurcation/forking and asynchronicity inherent in multi threaded environments.

Tracing is often supported through the use of Correlation IDs (also known as a Transit IDs). That is random identifiers (e.g. GUIDs) generated an entry point to (as as early as possibly in) a distributed system, and passed through each layer as a way of linking the multiple calls that constitute the lifecycle of a request. Correlation IDs are an integral part of microservices. As Sam Newman, author of the excellent Building Microservices, put it:

Finally, I like this diagram from Peter Bourgon that shows the relationship of the 3 components of Obervability:

Monitoring vs Observability

You can think of monitoring as being more proactive. You write code that logs specific messages and errors, and create alerts around those messages and errors. At service deployment time for example, it is not unusual to keep an eye on the logs for the messages that reflect the system is acting as expected, or for the “known” errors.

Observability is more reactive. You “observe” the system in production, trying to debug and understand it, particularly when things go wrong.

Observability is about the unknown unknowns. A system is observable when you can ask any arbitrary question about it and dive deep, explore, follow breadcrumbs.

And while monitoring and observability have many things in common, the nuanced terminology can be very useful for distinguishing use cases, and maturity. Or not:

I personally think the terminology is useful, but that mastering the tools needed is more important. It doesn’t matter what you call it, but when you are dealing with a production outage, being able to quickly detect, diagnose and resolve is critical.