Monitorama 2018 Recap

Monitorama is one of my favorite conferences to attend. It is about one of my favorite subjects (monitoring) and is hosted in one of my favorite cities (Portland). It always surprises me how similar the themes are across the talks at a single-track conference like this; one might think that the speakers get together beforehand to decide what are the primary points to communicate. As a speaker, I can tell you this doesn’t happen, but nor is it a happy coincidence. There is a lot of thought and consideration that goes into deciding what the themes of a conference should be.

Below are the highlights and themes that emerged for me over the three days.

Evolution of Monitoring Needs

Applications change, organizations mature, and monitoring needs change with them. This doesn’t mean the previous solution you implemented was bad, it simply means that your needs have changed. When implementing a monitoring solution, organizations need to weigh the impact of delays, cost, and scalability. There is value in implementing a solution quickly and at minimal cost even if it may not scale infinitely. As the organization grows and the system can no longer scale, a new solution needs to be found. At the time, the right choice regarding timing and cost was made; as the organization grew, so did their needs.

An early stage start-up may choose to invest in purchasing a solution that doesn’t have all the bells and whistles because it meets their immediate needs and price point. In early stages, many companies prefer to spend more money on hiring engineers to develop features and functionality rather than building internal tools or buying a “high-end” solution. Money is spent on the must-haves rather than the nice-to-haves. As companies get older and change, the nice-to-haves become must-haves.

Many people at the conference referred to Google’s Service Reliability Hierarchy. This aggregates all of monitoring together, but for me, there is a hierarchy within monitoring as well which plays a part in organizations out-growing their monitoring solution. Organizations start out focused on availability, and as they grow and mature, they consider other aspects such as latency, third-party components, and understanding the competition.

The way we monitor and operate our systems is based on what we have learned from previous incidents and outages. The more we learn, the more our monitoring solutions will need to change.

Observability is a relatively new term, and as a result there are multiple definitions or ways people describe observability. It was interesting to hear the way different speakers defined what observability means to them. It also raises a great deal of concern. If we are using a word that has different definitions for different people, that will lead to mass confusion.

In monitoring, we often talk about the concept of alert fatigue. The purpose of alerting is to be notified when something is not working properly; knowing what action to take from an alert is a key component of resolving incidents quickly. If you don’t understand why an alert is firing, you don’t understand if it’s real or not, or what action to take to resolve it. Alerts that are difficult to act on become ignored.

Having the correct context within an alert drives more efficient decision making. Context provides the ‘why’ behind the alert, which in turn helps to drive decisions. If we are unable to make decisions, the likelihood of mistakes occurring increases. But it’s not just about providing the right context; the information has to be communicated in a way that everybody understands.

Context often comes from our understanding of previous experiences. New team members won’t have this context, so we need to figure out a way to share our collective knowledge and context with those just starting out. Franka Schmidt (@franschm), shared one method of helping new employees learn from incidents they were not involved in with: the on-call simulator. I loved the idea of creating a game to walk people through diagnosis and troubleshooting using real-world examples.

One reason people go to events like Monitorama is to spend time with others in their community and to learn. These social aspects can be harder than the technology we work with on a day-to-day basis.

So many of the talks stressed the importance of learning, sharing that knowledge with others, and building inclusive communities. Learning new concepts isn’t an easy thing to do, but Logan MacDonald provided some great tips on how to remember concepts and convert short-term memories into long-term learning.

But it’s not enough to learn and hold onto that knowledge; what we learn needs to be shared with others. Socializing our knowledge can make it easier for others to learn. Explore the community – look at the ideas they are gravitating to, and what ideas and concepts are they walking away from. What can you learn from this?

One of the most important aspects to building a community is inclusion. If your community does not make people feel welcome and looks to only invite select individuals, then it is not much of a community. Monitorama does a good job of creating an inclusive environment. Some of the ways this was portrayed this year:

A large number of speakers were first-time speakers or those relatively new to speaking at conferences. They created new stars and powerful voices by giving fresh faces an opportunity to share their stories, instead of creating a line-up for well-known names. When you see the same speakers at events year over year, that doesn’t make others feel included.

A diverse speaker line-up including both gender and racial diversity

A women and non-binary breakfast with close to 40 people attending

Open invitations on the Slack channels for people to join groups for lunch to dive deeper into certain topics