I’ve spent the last few months going into some technical aspects of DevOps here, and I’m going to pivot a little bit this week and go back to a more philosophical DevOps topic.

A colleague today swung by my desk to get an opinion on a situation, and it basically shakes down to this:

If I need to release a fix to the public that resolves a critical issue and has gone through it’s appropriate soak cycle, but introduces a new minor issue, do I release the initial fix or wait for both issues to be resolved and soaked?

At first glance it seems like a bit of a no-brainer, but as you peel back the layers and apply different software methodologies things can get a bit muddled.

Up until this week we’ve been utilizing edge optimized custom domains within our API Gateways. This has been really easy to set up using CloudFormation and has been a great way for us to tightly control the URLs used to access our REST platform.

In order to support a more global expansion of the platform on AWS, we’re ditching the Edge optimized custom domains and getting into the Regional custom domains which were launched by AWS at reInvent 2017.

This caused me one major headache – the AWS::ApiGateway:DomainName resource in CloudFormation currently has no way to look up what the underlying URL is for me to add the appropriate Route53 record.

Before I get into the sample code and solution, let’s do a quick introduction into the difference between edge, and regional API Gateway endpoints.

Depending on your view, the speed at which AWS updates and changes can either be a complete nightmare or something that keeps you coming into work every day. Luckily for me, I see it as the latter.

Constant AWS updates means every time I come back to revisit a problem, or I’m surfing the documentation around CloudFormation there’s always a new way to solve something or a different and better way to do things.

When I first started tackling ACM certificates in CloudFormation you couldn’t specify Subject Alternate Names as part of the request – which left you creating a separate certificate for every subdomain you need a certificate to cover. Luckily back in the 70’s Larry Tesler invented Copy & Paste so it wasn’t too big a deal…at least until you run into the maximum number of certificates you can request in a year (which thankfully was increased from when I first started doing ACM requests in CloudFormation so you shouldn’t realistically hit that limit).

Now that Subject Alternate Names (SANs) are supported this simplifies my CloudFormation quite a bit, but it gets a bit tricky for doing the certificate approvals since I’m going to be adding subdomains and I need the certificate approval to come to an email address domain that actually exists.

First off, apologies for the brief hiatus. I hit a bit of a busy period with work and fell off the posting wagon.

AWS recently introduced support for API Gateway to use a Lambda custom authorizer in API Gateway. Previously the Lambda custom authroizer had to exist in the same AWS account as the API Gateway, which causes problems in our architecture since we want to use a singular token service for REST APIs across all accounts.

We originally solved this problem with what we dubbed the Auth Proxy. The Auth Proxy Lambda lives in an S3 bucket in a shared account that can be deployed with CloudFormation by referencing it’s location. The bucket policy for the bucket in S3 is configured to allow CloudFormation to get the zip package during a deployment. Finally, when we run the CloudFormation stack we do a lookup for the ARN of the Token Lambda, store it as an environment variable for the Auth Proxy, and then add permissions for the Auth Proxy in that account to be able to do a Lambda invocation on the token lambda.

Phew, that’s a lot of steps for something that would be so much easier if we could just point API Gateway at the Lambda in the other account. Now that API Gateway supports exactly that, handling the delicate process of opening up permissions cross account needed to be tackled.

In a recent chat with our AWS Solutions Architect, he pointed me in the direction of some really cool open source DevOps tools from the guys over at Stelligent. They have a bunch of neat utilities and frameworks on their public GitHub, and they do a bunch of very interesting DevOps podcasts that are worth taking some time to listen to.

Anyways, one of the utilities I really like is cfn-nag which is designed to scan CloudFormation templates for known vulnerabilities and bad practices. This is something that fills a gap I have in my DevOps portfolio since I know most best practices and know what I should or shouldn’t do from an infrastructure stand point (most of the time), but my developers shouldn’t have to be AWS security experts. Since stacks are maintained by developers after I hand them off, I need a way to ensure they don’t accidentally introduce security leaks into the platform. As is one of my mantras I have to automate everything, I can’t expect developers to remember to scan their stacks before they get deployed.

Generally speaking I’m pretty lucky because all our micro services operate in their own accounts (check out AWS Super Glue: The Lambda Function for why we do this) and I don’t have to worry too much about uniqueness of AWS resources. There are a few exceptions of course for resources that must be unique across the entire AWS platform, like S3 bucket names, and CloudFront CNAME’s. Since I’ve been so lucky I haven’t had to solve this problem of stack outputs and the requirement for them to be unique within an AWS account. Our micro services are fairly lightweight so I can get away with all my infrastructure existing within a single CloudFormation script. In my recent Web App Blue/Green deployment however I started running into stack outputs colliding.

Before we get too far into this, let’s take a step back and bring the CloudFormation rookies up to speed. If you just want to check out my solution, go ahead and skip down.

As part of an upcoming post around how we achieved Blue/Green functionality within AWS for I wanted to cover off a bit of a technical hurdle we overcame this week around how to build and test a web app in AWS CodeBuild.

So what’s the big deal? AWS CodeBuild lets you use a whole bunch of curated containers that have all kinds of frameworks and tools built in. If that’s not enough for you out of the box the buildspec gives you excellent control over running scripts including installing packages in Ubuntu. Even then if none of that works for you, you can simply curate your own docker image, publish it to the Elastic Container Registry (ECR) (though more on limitations of ECR in CI/CD in another post), or the Docker Hub. With all these tools and approaches at my disposal, there’s no way I can’t have an AWS CodeBuild environment that meets my needs. But, spoiler alert, the environment isn’t my problem – it’s the build and test framework my developers have implemented.

About halfway through our development cycle, in a meeting with our AWS Solutions Architect we received what we affectionately refer to as the “AWS Bomb”.

Up until that point we had been developing our platform with the idea that all the micro services and resources required to run them all should exist within a single AWS account. Just about all of the AWS documentation up until this point, and guidance from AWS, revolved around the idea of a single account…however AWS had just come up with this new architecture to help keep cloud platforms secure, and minimize the impact (or blast radius) of a security breach. If you’re interested in reading more about what this architecture looks like, or why you should use it – check out the Multiple Account Security Strategy Solution Brief from AWS.

We immediately started working towards adopting this new architecture which came with a whole other host of headaches that we slowly worked through and solved on a micro service by micro service basis. This was fairly easy for the micro services as they generally only communicate within their own account, and when they have to go to another micro service they can just do so with the external API routes.

Where we ran into some interesting challenges was in our CICD process. Luckily we were just starting to design and develop the CICD process, so we didn’t have to start over with this new architecture in mind, but it did mean we had to pioneer some new things that AWS hadn’t covered in their build in CICD tooling.

I realized that in writing this blog, I assumed that everyone knew why the DevOps model is so important, and what CI/CD means, and why we do it.

That’s probably a bad assumption. As a new(ish) development ethos, DevOps isn’t widely adopted and it can be difficult to sell DevOps as a revolution in development to an organization. I’m going to take a bit of time to explain why we do DevOps here, and what our guiding principals are whenever we develop our CI/CD processes.

Why we DevOps

Okay, I know some people are going to say “hey, James…DevOps isn’t a verb!” but it’s a made up word and a contraction of Development and Operations anyways so I’m gonna use it in weird ways simply because I can.

The core tenant of DevOps is exactly that, amalgamating Development and Operations teams. For us, we do this because it allows our development teams to develop, and push to AWS from anywhere anytime. This is achievable because we automate just about everything, and the automation allows us to put in all kinds of security blankets for builds, testing, and security to help developers not make mistakes and push bad code directly to the public. This is a huge improvement over waterfall and really empowers our developers to drive change in the product at a rapid pace. If someone is excited about a new feature and works until 4AM getting it working, why wait another year before actually releasing it? Here’s the best part, when developers know that when they commit that code, it’s going up to the cloud platform right away, and they’re more likely to check and recheck for good measure before it goes (even with the automated safety nets).

My team is all about removing barriers for Developers, while maintaining safety of the platform for our customers. My job is done right if a Developer doesn’t even notice all the checks and balances that go on…that is unless they make a mistake and we catch it.

Finally we cover off the Operations side of things. With all this automation done, the demand for monitoring and notification is high.

If a build fails in the cloud, and nobody is around to see it, did it really fail? -Me, just now

We utilize CloudWatch metrics extensively through DataDog to handle a vast majority of our day to day cloud operations. Everything that goes on shows up on a fancy dashboard I’ve built and I can see the health of the platform at a glance. Beyond that, we have a number of monitors and alarms that will send emails, and messages to our operations slack channel whenever something has gone wrong. We make extensive use of outlier detection to determine what kind of behaviors are out of the norm.

In addition, we have a number of services and checks that we have designed and implemented to watch the health of our actual service. It’s hard to use third party tools out of the box when you’re developing a first party product, so this one went hand in hand between the DevOps team and the engineers working on the microservices. They put in API routes to query health, and we put in automated checks and alarms that leverage those routes. There’s a handful of other lambda’s that check status of things like our CodeBuild projects to notify teams when the builds and deployments fail, or notify the DevOps and media teams when the STUN services are experiencing issues.

This proactive monitoring and alarming lets us address and fix issues at the development level, before a customer ever even contacts our support team. We typically identify, and at a minimum backlog a fix, before our customers ever notice there was a problem. This makes for a really great experience for everyone, developers, operations, support, managers, and most importantly our customers.

The Tenants of our DevOps Practice

Now with all of that introduction out of the way of what DevOps is, and why it’s so important and awesome…let’s get into the meat and potatoes of my DevOps practice. These are sort of my commandments when it comes to developing, designing, and implementing DevOps and CI/CD processes.

Automate everything. Everything that can be automated, should be. Even if it starts as a manual process to figure out how to do it, we always immediately turn that into code and make it run automatically. If we didn’t, DevOps just wouldn’t scale with our platform.

Remove barriers. The last thing I want to be is the guy holding the master key for everything, because I don’t want my phone ringing in the middle of the night because an emergency code fix has to be pushed into the cloud. If I followed commandment #1, then there’s no reason to call me. Just go ahead and push the code.

Implement safety nets. Nobody is perfect, and we all make mistakes. As soon as you admit that, and allow developers to make mistakes, they’re going to be much faster at iterating on code and producing some neat innovations. Let them make mistakes, just catch them before they become a problem for the public.

Never stop improving. The speed of cloud development demands that we are never fully done with our work. AWS releases new services, improves security, and gives us new features and functionality. We are always dedicated to going back and re-working code and processes to use the latest and greatest in services and features. The more I can push off to AWS to handle instead of my custom code, the better.

Always be teaching. Especially in a company where DevOps isn’t a thing yet, it’s important to always be teaching others how to do their own DevOps work. Not only does this make the concept of DevOps proliferate your organization, but it also shares the burden of work across the entire development team. It takes time, but I promise you will reap the rewards when developers maintain their own CI/CD pipelines, leaving you to keep focused on commandment #4.

Secure everything. It’s worth the effort to start with the least possible permissions across the board up front, than to have to go back and dial in back in later. Dialing in later causes outages, downtime, and unintended behavior. Spent the extra bit of time in the cycle up front and make sure you’re not leaving things wide open (especially S3 buckets).

Generally these are sort of my top level drivers for everything. There will be “sub” commandments within each of those to drive me down the right path, but so long as I keep these in mind with every top level decision I make – I know I’m moving in the right direction.

Got a good one I missed? Send me an email I’d love to hear what some of your guiding practices are.

Captains log, April 20th 2018. It’s day ??? of winter. I’ve packed my entire desk up and decided to move to Sunnyvale. There’s no way there’s winter in a place that has “Sunny” in the name of the city right?

This post is going to be a bit more philosophical than practical, but I’m hoping to give people some insight into the culture we’re trying to cultivate within the CloudLink team and are trying to instill back into the wider organization.

I’ll start with a little back story:

On April 4th 2018 the team had what, as my boss in a slight feverish panic standing next to desk put it, an “operational event”. Also known as an outage. This was for all intents and purposes the very first outage for us in our public cloud infrastructure. Sure we’d had a few things here and there that may have caused a slight service degradation, or a planned maintenance window that we did some work within, but this was the first bona fide time that something went wrong and we didn’t know about it. I’ll spare you all the nitty gritty details here. Long story short, we had a maintenance script that ran at 9PM, it deleted a bunch of users, and at 930AM the next morning we realized what had happened. By 1030AM everything was back up and running. We actually spent a good portion of time discussing the impact and how widespread the issue was than it actually took to fix the problem.

That we had an outage isn’t really the point of this blog post, it’s more about how we handled it and what we did about it.

We’re trying to build a culture of transparency, and to avoid falling into the trap of finding who to blame for a problem. This is a fairly popular model in some of the newer, hipper, technology companies like Netflix and we’re finding this a very positive and constructive approach to take. I’ll focus on each of those things independently.

Transparency

Right off the bat, all of our laundry is available automatically on our public status page https://status.mitel.io. This page is dynamically driven so if we’re having a problem it doesn’t require an engineer, or support/operations person to actually flip the switch from green to red. That’s a fairly scary step, but we stand behind the product we’ve developed and our culture around quick resolutions that this isn’t too big a deal. Nothing ever really stays red without some sort of explanation as to why. Which is where our incident reporting comes in. We’ve adopted the idea that if there is a problem that is affecting more than one customer, that it’s an incident and we post about it. Much better a customer or partner see a red flag, and then immediately see’s that we know about it and we’re working towards a resolution as opposed to everything looking peachy in our status page but their application is not working. There’s a theory out there called the Service Recovery Paradox that says a customer will think more highly of a company after they experience an outage with their service. The reasoning being that a successful recovery from a fault leads to an increased feeling of confidence in the company. I believe this is true, but only up until the point that a technology company is transparent about the issue. If there was a problem, a customer experienced it, and it got fixed – without ever a word from the company, that customer is probably going to assume the company never knew about it and it magically fixed itself. Even if the problem did magically fix itself – which in some cases it does – it’s still beneficial to explain what happened.

This is where the three pillars of transparency are going to come in, and if you read the postmortem linked above (with the exception of the apology pillar) I wrote following this model.

Apologize

Show your understanding of what happened

Explain your remediation plan

I wont go into detail about these, but you should go watch the first 10 minutes of this video for a really good explanation on them, which is exactly where we got this philosophy from.

In this specific case we haven’t decided if we’re doing public postmortems yet, but for the purposes of ensuring the company still has faith in our product it’s important to still cover these off at least internally for now. The powers that be need to see that we understand exactly what happened, and how we’re going to get better to help keep it from happening in the future. We really do hope to extend this transparency to our customers, and will continue to bang the drum of culture change to allow us to do that.

Playing the No-Blame-Game

This is a tough one. As part of any investigation into an outage, or root cause analysis, or fix development – it’s easy to slip into the mode of trying to find out who caused the problem and blame them, as opposed to fixing the DevOps process that could have prevented bad code from getting out into the wild in the first place, or the platform for not being robust enough to handle whatever happened. This shifts the focus from a developer or team from learning via a negative experience (which is never a good way to learn) into learning through a positive one. It becomes a technical challenge to overcome (which engineers love…don’t you) as opposed to being that one thing in your career you never forget and hang your head in shame (I have a few of those myself ). By taking this approach, not only does everyone feel more comfortable knowing they can make mistakes without getting fired, but it actually improves your overall product. You focus on making the platform more robust, and making the automated processes of DevOps more intelligent to the type of work you’re doing. Everyone wins.

Out of this entire experience, we ended up 9 actionable back log items to address the outage. As such, our processes, and DevOps automation are better than ever, and it highlighted exactly why we utilize the release cadence that we do.

So that’s a little bit of what we’re trying to do here from a DevOps culture perspective to try and improve how we do things at Mitel. As always, we’re always learning, and always growing – so things will change and improve over time.