The Ambiguity of Service Availability

In a previous blog entry, I promised to describe the subtle issue that caused a debate about the meaning of the term “availability.” The post provided a deceptively simple mathematical definition of availability, and pointers to resources for understanding the topic in more depth.

Sometimes it’s hard to tell whether a system is actually performing useful work! A system’s users may be unhappy, even when your monitoring systems tell you that your site is serving traffic and responding to each request with a successful response (HTTP 200 OK, for example). For example, your site might be misconfigured to show a default page to all users.

In many cases, this kind of problem can be solved with monitoring. If the monitoring system tests the content of responses, then we can detect HTTP 200 responses that return a default page instead of useful information. If our monitoring system tests content for each client type, we can tell when an important constituency (like users with iPhones or Android devices) might be unhappy.

But things can get even more complicated…

For the last few years, Raymie and I were leaders at a company that sold big data analytics (in the form of Hadoop, Hive, Spark, and similar systems) to other companies. There were times that our service was functioning as designed, but our customers were still unhappy. That’s because they were using our service to write their own programs in SQL, Java, Scala, Python, and other languages. We saw many situations where a bug in a program caused problems not just for the person running the program, but also for other users at the same company. For example, a bug in a SQL JOIN statement could easily generate enough data to fill up a petabyte-sized file system.

It was this kind of subtlety that was at the heart of the debate about the meaning of “availability.” How far does a service engineering team need to go to ensure that the users of a system are happy? To the extent that “customer happiness” can be measured and it’s possible to negotiate a related service level objective (SLO), it makes sense for service engineers to accept the challenge to keep users happy. For example, we changed our product so that it automatically notified customers to delete data (or to purchase a larger data plan) when they filled up file systems. With this feature in place, individual users may have been unhappy that their programs could not write data, but their unhappiness was decoupled from the definition of availability.

When running complex services, there are many similar sources of ambiguity in the definition of availability. Resolving the relationship between customer happiness and availability often requires cooperation between customer success managers, product managers, developers, and service engineers. The process involves resolving customer problem reports, recognizing patterns of unhappiness, building features that improve the customer experience, and ultimately providing less ambiguous ways to define and to measure availability.