An Introduction to Infrastructure and Application Monitoring

by Justin Ellingwood

In this series, we explore what metrics and monitoring are and how to best use them to gain visibility into your systems and the responsiveness of your team. Collecting metrics from your infrastructure gives you insight into the health and performance of your systems. Metrics can be used to create dashboards to troubleshoot issues or give a summary of the state of your applications and resources. Alerts can be defined to notify you as soon as situations require your attention.

Understanding the state of your infrastructure and systems is essential for ensuring the reliability and stability of your services. In this guide, we will discuss what metrics, monitoring, and alerting are. We will talk about why they are important, what types of opportunities they provide, and the type of data you may wish to track. We will be introducing some key terminology along the way and will end with a short glossary of some other terms you might come across while exploring this space.

Metrics are the primary material processed by monitoring systems to build a cohesive view of the systems being tracked. In this guide, we will start by discussing a popular framework used to identify the most critical metrics to track. Afterwards, we will walk through how those indicators can be applied to components throughout your deployment. This will focus on the fundamental resources of individual servers at first and then adjust the scope to cover increasingly larger areas of concern.

In this guide, we will talk about the components of monitoring systems and how to use them to implement your monitoring strategy. We will the basic responsibilities and elements of an effective, reliable monitoring system. Then, we'll talk about how best to translate your monitoring policies into dashboards and alert policies that provide your team with the information they need without requesting their attention at unwarranted times.

In this guide, we will take a look at how monitoring and metrics collection changes for highly distributed architectures and microservices. The growing popularity of cloud computing, big data clusters, and instance orchestration layers has forced operations professionals to rethink how to design monitoring at scale and tackle unique problems with better instrumentation. We will talk about what makes new models of deployment different and what strategies can be used to meet these new demands.