Mapping Plumbr to Monitoring Terminology

There are four “cornerstones” upon which the whole galaxy of monitoring services is built. These are Availability, Latency, Throughput, and Capacity. Whether our applications are monoliths or microservices, these parameters help define how well applications are performing. There have been some variations from these exact data points (details follow). The theme, however, remains largely the same.

It is important to note that evolving cloud-based systems and architectures have made this complex. These metrics aren’t as trivial as measuring CPU usage, free RAM, inodes, etc. Microservices, containers, FaaS, and other modern infrastructure choices have altered the landscape. It may not be as easy to find things to measure any more. Measurement is simple in the case of web servers or app servers, and gets complex as we add databases, caching mechanisms, load balancers, and other sub systems which are required for any web applications at a reasonable scale while in production. The next layer of complexity arises from separating signals from the noise.

Defining Availability

The accepted definition for application availability is the fraction of time an application was able to meet operational requirements. Any time duration for which a user is unable to consume an application renders it unavailable. Uptime, and its converse – Downtime, are typically used to express availability or the lack of it. It is typically expressed in percentages. The famous five nines (99.999) and other notations are also used to communicate availability. The ideal case for availability is 100%, which is not realizable in practice. Each nine means an order-of-magnitude increase in reliability. While 99.999% allows for 5 minutes of downtime per year, adding just one more nine makes it 30 seconds.

Defining Latency

Latency is primarily a phenomenon of the network that lies underneath. It can however be applied to applications as well. Latency for web applications refers to the time that it takes to load a web page. There could be numerous assets that form a web page. A user may have to wait for all of these assets to load and render, before a web page becomes usable. The time that a user is forced to remain idle, gives the exact measure of latency in web applications. In trading systems, it is vital to measure the time between the trading decision has been made until the order was executed. The goal is to minimize this metric and keep it low to every extent possible.

Defining Throughput

Throughput is a metric that originated from a time when software systems were typically used for batch processing. It is a measure for what volume of output is produced by a given volume of inputs to a system. In the scheme of interactive systems, like typical web applications, throughput is more of a scaling factor than something that is optimized on its own. Measuring a system’s behaviour under different levels of load can give good insight about the architecture, reliability and performance of the system.

Defining Capacity

Keeping tabs on the resources that are available to servers lies at the core of capacity metrics. Measures of CPU, disk space, memory footprints, and other system level parameters constitute capacity. If capacity is insufficient, other metrics such as availability or latency may suffer. Capacity also has its origins in older resource utilization paradigms. It is important to be able to forecast patterns of usage for the various systems and subsystems. Even in the age of cloud systems, keeping the compute bill transparent and predictable relies on capacity planning and resource usage tracking.

There are some popular variants for these common measures that are in practice today.

1. The Four Golden Signals

Google, in their popular SRE book define latency, traffic, errors, and saturation as good metrics for all user-facing web applications.

Plumbr is a real-user monitoring system. While this means that we also monitor the health of web applications, we put the focus on the user-end rather than on the server-side of things. Here’s how we stack up against the four primary paradigms of monitoring –

1. Availability – Plumbr captures all types of errors in applications. In addition to capturing an error, Plumbr attaches all the context information it has collected. A consolidated list of errors gives visibility into all the times an application failed to fulfill the needs of a user. Importantly, the impact of the errors is measured in terms of actual affected users. This allows for rapid setting of the right priority to fix detected errors.

2. Latency – Plumbr monitors applications and identifies all the slow performance. Whether these are interactions that took longer than expected to complete, or server calls that are delayed in getting responses, Plumbr records all the occurrences faithfully. The most common bottlenecks, be it browser SSL handshake delay or server-side lock contention, are automatically recognized and exposed. The impact for various bottlenecks is exposed as well.

3. Throughput – Plumbr exposes throughput as the number of users of a frontend service or the number of API calls received at the server side. These are captured over time, as are the slow or failing requests. Degradations in throughput are typically noticeable as an increase in slow user interactions or AP calls.

4. Capacity – Plumbr integrates well with other systems that specialize in monitoring the capacity of various system resources. Additionally, Plumbr provides memory content insights for Java applications that struggle with memory leaks, Garbage Collection overhead, or suffer OutOfMemoryErrors.

Whether you do monitoring using typical metrics as signals, or you begin your own ways and means to create monitoring systems is immaterial. To make software faster and more reliable for users, monitoring is a necessary part of every engineer’s toolchain. Using real-user monitoring shows a higher degree of maturity in the tools chosen.