Showcasing observability to supervisors

Cædman Oakley

Published: September 28, 2017

00:00:00

00:00:00

Introduction

Caedmon: So I’m going to be as short as I possibly can because I know you guys want to get back on the loop. And also, there’s cocktails. Ilan: There is beer after this. Beer after this.. Caedmon: There is beer after this, so I’ll try and be short. The reason that Ilan is stumbling over my name is, he knows me clearly as Caeds, and they keep putting Caedman on my slides now, which is very confusing for him.

Example: Castlight Health

So, Castlight Health is a platform for large-scale self-insured employers to be able to allow their employees to navigate easily through their benefits, their health, and their wellness. We have just done a migration across from… That didn’t work. That was interesting. We’ve just done a migration across from New Relic to Datadog. Tech: Left one, right? Caedmon: Yep, that’s where I was. And of course, we have some interesting little stories and anecdotes along the way.

The Base State

We are going to talk about the base state, where we actually started off. We have this thing called a “SOAlith”, which is kind of a weird name. The journey that we took, and some of the results. And, more about how we talk about this to managers and supervisors, to actually get observability in. One of the things that we’ve definitely talked about a lot today is context. It’s not just enough to monitor, but it’s like knowing the context of your infrastructure. And the same thing is true about when you’re talking about how you are going to manage observability and monitoring and alerting to the wider organization.

So we had this massive thing. It’s a monolith. Everybody’s done this at some point. You’ve had a monolith that you kind of think about, oh, we want to try and break this down into smaller stages and actually have a micro service architecture.

Everything deployed everywhere

And the first thing that you do, is you go, “Right, we’ve got this monolith. We’ll add another feature to it and stick it over here.” And then that’s like service-oriented, right? And that’s really sort of a step in the… It’s simple to develop. You can throw things on it. It’s a step in the right direction. But, it’s sort of a wrong step, right? And that’s exactly where we are. Literally everything at Jiff, which was the company that Castlight bought that I joined, literally everything was deployed everywhere. So, the exact same services, deployed everywhere.

Monitoring all the moving parts

Our monitoring was kind of minimal as well. We had single systems that were monitored, once there was an error on them. Right? There were certain contraints that were monitored as well. Hey, we’ve got some database constraints that we need. They were, again, monitored typically after error, right?

And we had zero understanding of the overall load or system architecture. Bearing in mind that we did actually have tooling in place that should have allowed us to do this, but we just didn’t actually use it effectively. We had APM in place. And that was used only as a debugging tool. Right? And we had a single APM dashboard that said, “Hey, look, this is what it kind of looks like right now.” And the only thing that anybody would ever do is go, “Oh, it looks different. Caeds, do you know anything that is going on?” Right? There is a quote that I like from Charity Majors, of Honeycomb, which is, “A dashboard is an artifact of failure.” Right? And that’s exactly where we were.

I don’t think that’s necessarily strictly true. I think if we plan our dashboards out right, we get better information and we get observability into things. And that’s exactly why we are here.

Alerting only when something’s gone wrong

Here was a state of alerting. We had two systems, Icinga and Nagios. Alerting was done via email, and on the dashboards only. There was no PagerDuty. There was no SMS-ing. There was no Slack. There was nothing. Right? And alerts were only tied to something that is going wrong. And that’s true.

Alerts are useless unless urgent, important, actionable, and real

We didn’t actually alert on any predictive stuff. We didn’t actually say, “Hey, look, this system looks like it’s going to run out of memory soon, you should probably go and have a look at it.” That’s kind of something you should know. Right? And, everything alerted. Right? It was like, “Okay, you’ve got to 70% memory usage. Alert. You’ve got to 99% memory usage.

Alert.”

There was no difference in the urgency at all. Right? Alerts are useless, unless they are urgent, important, actionable, and real. I don’t want to be told that my dev cluster has 15 deployments going on it, and that’s scarily different. Because, that’s my dev cluster, I don’t really care. I do want to be told that there’s a deployment going on in production, when nobody knows about it. Right? This is, like I said, I’m going to be fairly fast and fairly quick.

Don’t let tools dictate future

So, this is our journey. Our journey was really based off of one simple thing, what were we trying to achieve. Right? And this is the thing that we need to sell to managers, and the thing that we need to sell to supervisors, is, yes, we’ve got a current tool set. And we think that that’s what we want. But, we’re going to start from scratch. We’re going to actually evaluate everything from the ground up.

What are we trying to achieve? Don’t let your current toolset dictate your future. Right? If your toolset says, “APM is the way to go,” or very specifically, you have to only ever monitor CPU usage, and you’ll be fine. Right? That’s a problem. Right? Especially if you have databases that are highly memory-intensive.

Evaluation to plan for future

So again, this is our evaluation. What are we trying to achieve? That’s the very first step. And selling that and saying, going to the manager and saying, “Hey, look, you know what, I’ve got these things. I know that we want to be resilient. I know that we want to be robust. I know that we want to be scaleable. That’s what I’m trying to achieve. I’m trying to see what’s happening in the infrastructure so I can better predict the future.

Better plan for the future.” A manager is going to come back and say, “Okay, why? What do you need from that? How does your current toolset… We spent N dollars on X tooling. So, why is that not doing that for you?”

So you need to know, what does the toolset need to do? That’s fairly easy, right? What’s your budget? Does your budget actually include being able to do this change right now? Or, is it something that you are going to have to wait for, right? I’ve managed to set up everything right now at Castlight so that we are on January 15th renewal dates. So, anybody who’s a vendor here? That means that, up until January 15th, I can’t really do anything. My budget comes around at that time.

Know your audience

The other part of this is to make sure that your audience is aware of what you are trying to do, and who your audience is. It’s great that, you know, we can throw Datadog dashboards up there. We can throw Spark lines for all sorts of different types of data. We can show wonderful little AWS CLI things to our own people. Right? And taking that information to our manager, who worries about money, is going to go, “Yeah, cool. What does it do? How much is it costing me?” Right?

Toolset problem

The current toolset doesn’t allow us visibility, didn’t allow us visibility. All it told us was what is happening right now. It didn’t tell us what was going on in the infrastructure. It didn’t allow us to see our hosts. Partly, that’s an educational problem. Partly, that’s a tooling problem. Partly, that’s history. Partly, that’s the way in which the environment was set up. But you have to sit there and go, “Okay, is this something that we need? Can we actually see what’s going on?” And then the tooling didn’t actually allow us to use, to address our current issues anyway.

We used the APM tool 5 times, from November 1st to June 1st. And that’s just in the last 12 months. Okay? And so then we had a cost, okay, in order to move from tooling A to tooling B, or to upgrade in tooling A to tooling B. What was our cost in terms of work time? How much time is this going to take? The big thing there is to try and sell a manager that pain up front is way easier than pain over time.

It is much better to spend 3 weeks rewriting something, and hurting, and doing all of the stuff that you’ve got to do, than to have all of the pain driveled in over PagerDuty and on-call, and 3 in the morning, and 5 in the morning, and on the weekends, for maybe 20 minutes each time.

Resilience engineers

Resilience engineers, we call our SREs resilience engineers, site reliability engineering is a really, really useful term, and everybody knows it. But, with the DevOps movement being much more widespread and across the organization, we call ourselves resilience engineers because we help motivate and educate engineers to be much more concerned about how their systems work. So that’s why you’ll see resilience a lot. Resilience engineers, their metrics are things like systems, how’s the system’s health? What’s the latency like? What’s the efficiency of the system? Am I getting good throughput?

Managers’ metrics are much more about cost, stability, it’s a very nebulous term. Generally, it boils down to being functionally usable. And productivity. How productive is the team? Can I get away with only having 3 people on this team and supporting 180 engineers? Or, do I need 18 people on this team?

So, the first port of call that we took, and this was how we managed to market it, was, okay, what we actually need to do? From an infrastructure and SRE point of view, we already knew this. It’s like, we need to know what is happening. We need to know, is this different than normal behavior? Somebody define “normal,” right? Is there a way that we can define normal in our systems or not? Can we see what we need with Datadog?

Or do we need to do something else? Do we need to go down the SOAlith’s path? Do we need to go down the New Relic path? Do we need to create our own tooling? And, you know, with our SOAlith, what’s the actual cost of breaking this apart? How much does it cost to actually run a SOAlith?

We were using very, very large machines. We were using c4.4XLs for most of our machines, which is not the greatest thing in the world. So, what’s the actual cost of running that, versus splitting it out into smaller machines, and seeing that cost? And that meant that we basically had like a good set of tools that we were kind of using.

Providing information that managers want

We did a big bunch of investigation. Yes, we ended up with Datadog at the end of it. There are lots of reasons for that. It’s not just the tooling. Getting buy-in was the final piece, really. You’ve got a manager who’s interested in something. Some managers are really interested in making sure that they can serve 30,000 users per second. And they want to know that their bandwidth is high, and that’s great.

So we were getting graphs to show that. And that’s what we did. We basically generated a bunch of metrics, and a bunch of graphs, to go to our managers and say, “Okay, hey, Manager A, you are really concerned on AWS cost. Here’s our AWS costing stuff.” “Hey, Manager B, you’re really, really interested in the way in which the platform interacts with a single user.” “Hey, Manager C, you really, really want to see mobile usage stats, and how much traffic, and how much engagement we’re getting.” The important thing there was to be able to do those graphs quickly and effectively, so using the APIs to generate the graphs.

The whole thing about being able to generate graphs from Terraform, that was awesome, by the way, earlier on. Thank you, that was amazing. That really…that’s something that, if we don’t take that away today, like, please start thinking about that in the next two, three, four days, not weeks. All right? It’s important.

Monitoring that allows you to be proactive

And show that your monitoring and your tooling can allow you to be proactive. Right? We really need to be proactive in what we’re doing. Because, I don’t know about anybody else here, but I hate being woken up at 3 in the morning. Because some system, that really honestly doesn’t need to be up, is telling me that it’s not up. And, ah, I could have told you that yesterday. If you had told me that we were at 60% of CPU usage when we were doing this one thing, and somebody told me that they were going to load another file. You know? Okay.

So the pressure zones that came out of that. How do we manage deployments? What are the toolset constraints? Where’s the automation? Because DevOps doesn’t work without automation. We don’t work without automation. Unless we want to turn all of engineering into, like, our stuff, and just have a tiny engineering department.

Monitoring helps control costs

And then the last thing, and this was really important for us because we were post-acquisition, what does the actual infrastructure cost? We went from a privately held company, of around about 100 people, to a publicly held company, of about 600 people. Overnight.

And suddenly, we went from “Hey, we are gonna have some big traffic. I don’t care how much money you throw at it, just solve it. I want it to work.” To, “Just exactly how much did that thumb drive cost?”

So, we’re looking for zero downtime deployment. We’re looking to have the same tools in the cloud and in the data center. We are a hybrid organization. We run in AWS and in our own data centers, in Denver and Phoenix. Okay?

Automating dashboards and alerts

We needed automation to be able to set up the dashboards. We wanted to automate new dashboards. And we wanted to automate the alerting. That’s fairly normal. From a toolset point of view, we needed to reduce complexity.

Introducing new tools

If you imagine taking two organizations who have got their own methodology of doing stuff, you are taking all of that stuff, and trying to cram everybody together and go, “Everybody use everything.” One of the biggest things that we saw was that, when we had two teams that were being merged into one, with two different toolsets, it was actually better to find a third toolset that did most things, and give everybody the pain. Because if you did it the other way, what you ended up with was Team A going, “That’s not the way to do it.” And Team B going, “I don’t want to do it anyway.” Right?

The other toolset constraint that we had, of course, was to increase efficiency. What I mean by efficiency in this state, what I mean by efficiency here is, when I see a problem, I can get to the source of it very quickly.

Back in the ‘SOAlith world’

Imagine in the SOAlith world, that you’ve got… Sorry, I’m on call, so that’s me being told something needs my attention somewhere. They can deal with it for the time being. In the SOAlith world, if you have a number of different services… Excuse me. I’m going to turn that all the way off. There we go.

If you have a number of different services that are all running on the same machine, and they’re all split into different places… And the typical SOAlith looks kind of like this. Big load balancer, lots of little load balancers, that then point to every single machine. And there are good reasons for doing that, by the way. There are some very good reasons for doing that. There are lots of architects around here who can tell you why.

But if you do that, it means that if Service A is having a problem, but you see it in Service B’s interactions, it takes a long time to figure out, “Oh, it’s actually not Service B, it’s Service A that is giving me the issue.” Service B looks like it’s running out of memory, but it’s not, it’s being used by Service A. Hence, thank you very much for Datadog’s wonderful processes. I’m looking forward to that. Thank you.

Cost considerations

Again, I’m going to come back to the money thing, because the money thing is the thing that drives business, unfortunately. I wish it was just building cool stuff. Total cost post acquisition was a huge pain point for us. Deployment cost. How much does it cost for me to deploy this to every single node? And we all know what the pricing is of our competitors. And then, we needed to reflect and predict change in cost. So, we could actually see it when we spun up a new instance. What is that going to look like? Ahead of time, not, “Oh, I spun it up. I threw that on there. And here’s my bill for another $185 for an hour.” You’re like, “What did I do?”

When we did this, we then found that there are some feedback loops that we don’t necessarily normally run into. And behavior became the clearest indicator of success.

Management use as indicator of success

I’ve worked at a number of different organizations now who have used Datadog in the past and currently. Ilan and I used to work at a company together. The biggest indicator that we knew that Datadog was actually working well for the organization was when we saw the CEO, well, actually, Ilan saw the CEO, I shouldn’t say we, it was him, who saw the CEO, using a dashboard to show traffic and show engagement on his own laptop.

My manager is the one now who comes to me and goes, “Can you show me again where that AWS costing thing is? Where did we get to? Have we seen a trend down? Are we seeing a trend up?” He wants to see it. And he shows it to his other managers. And to his peers. That’s great.

Real information

Other behaviors, if you’re sick at home and you somehow end up on call, which happens, being able to pinpoint exactly where things are from the safety of your couch with your cat and your chicken noodle soup, and tell your engineer, especially your prima donna engineer, this is an actual anecdote, that no, he has a thread leak. You can see it. It’s right in this graph right here. And it happened in staging. And it happened in QA before that. And you told them about it. Means that that engineer then looks to you and listens to you and asks for that graph again and again and again.

Again, this is kind of that artifact of a failure thing. But, it’s still that point where it’s like, “Oh, I have real information. I can actually observe what’s going on.” So, real-world results. I’ll talk about the big one at the end. There’s two big ones.

Infrastructure plan based on metric comparison

So, there is an infrastructure plan now. Right? We can actually show, using this, that breaking the SOAlith into smaller pieces works. We, on our initial breakout, we took one service out. We went from 35 C42XLs, to 6. Peaks at eight with auto-scaling. We can generate new auto-scaling metrics based on metric comparison.

So we can actually see the metrics from today versus the metrics from yesterday, and go, “Oh, when we did this, we are using 50 percent CPU, but we are using the same amount of memory, the same amount of disk, and the old version was using 75 percent of CPU across the service.” “Oh, and now it’s using, well, it’s still using 73 precent.” Okay, so it’s not this thing that’s the problem, it’s something else.

Observability reduces costs

But this thing is now actually useful and separated. So, we have an ongoing project to dismember the SOAlith. The direct observable cost to managers was that, two weeks after our original Datadog deployment, on our 460 servers at the time, we managed to save 93 percent of our annual Datadog cost. That was it. We basically looked at the infrastructure and went, “Oh, we don’t need that.” Done.

So effectively, we broke even within two weeks. That’s annual AWS spend. It’s gone up. We keep adding more instances. We’ve got more stuff going through Datadog. That’s pretty cool. We’ve got the whole Denver data center to roll everything out to as well. But we can at least manage where we are going now.

And the future thinking of this, is, because our infrastructure is observable, because people have context about what it is that we’re doing, the environments are now owned by those who can actually observe the environment. Which means that we get much more collaboration between the SRE and a normal, “normal,” product development engineering organizations. When we say we want to break this service away from the SOAlith, engineers don’t go, “That’s going to take weeks of work.”

They go, “Sure, I’ll see if I can get you an engineer next sprint.” This is our wonderful graph. This is actually, basically, only, I think it’s about a month’s worth of stuff, from pricing. You can see that, when we do deployments, we have a nice little drop, and then an increase in cost. That’s the big yellow line at the top.

I’ve taken away all the numbers so you can’t actually see how much money we spend. I’ve also taken away the stats on the load, so you can’t see how much system load there is. But we basically hold AWS costings very easily. And this is the dashboard that my manager wants every single time. Every time we do a deployment, he’s like, “Okay, what did we end up at? Where are we over the last month? Are we tracking on budget? Are we under budget?”

Find metrics that help the business

The biggest takeaway here, really, is that, figure out what it is that makes the business want to interact with you, and that you can give the business. Because, it’s great that we know exactly which traffic is going where, exactly what service is doing whatever. But when you’re talking to a manager, when you’re talking to a sales person, when you’re talking to a financial person, they don’t care about that.

When you’re talking to a front-end person, they care about, “Well, how many hits did I get?” I don’t care I was only using 3 percent of the resources on this box. It doesn’t matter. Could I serve my 50,000 users? Sure. That’s pretty much it.