In helping enterprises transform their engineering organizations and apply DevOps practices on AWS, we are often introduced to all types of legacy systems and processes.

Let’s imagine an enterprise that might have hundreds of brownfield applications and services and they want to move some of these applications to AWS. Let’s also assume they don’t see AWS as another data center with APIs but more as a platform on which to transform their applications and services. That said, they have an existing group of engineers who are excited but not yet ready for the shift of moving all of their applications to leverage a Serverless paradigm – at least for application development. They want to ensure there are tight controls and auditing on all changes to the software systems including infrastructure, application code, configuration, and data.

Seeing a cloud provider like AWS as ‘just another data center’ is a common misperception of enterprises when first embarking on the cloud.

Based on these constraints, and from a transformation perspective, what kinds of organization, process, culture, and tooling changes might you make?

While the company names change, this is a pretty typical baseline scenario that we encounter with our customers. In addition, there are often many other constraints as well such as specific tools and AWS services that can be used based on their constraints or, perhaps, perceived constraints. For this scenario, we’re assuming that this enterprise wants to leverage as many AWS capabilities as possible.

Wants, Assumptions, and Constraints

Here’s a list of the assumptions and constraints of this enterprise:

Embrace patterns used by other enterprises that have successfully leveraged DevOps practices on AWS

Strategic applications/services are moving to the cloud and would like to best leverage AWS’ capabilities vs. a “lift and shift” model or treating the cloud like a data center

Leverage the use of containers for brownfield and greenfield applications

Modify the backend tooling that enables Continuous Delivery to leverage AWS managed services as a way of scaling capabilities and change

Experiment with different organizational, process, tooling, and cultural changes with a small number of teams at first. Learn and iterate based on feedback

Break down systems into microservices – using the Strangler pattern – so that product teams can deploy changes to production multiple times a day

Ensure security and compliance policies are the same or better by properly leveraging AWS services

Ensure there’s an end-to-end deployment pipeline that automates all the steps between a commit and production with the exception of a manual approval prior to release. This deployment pipeline embeds the codification of build, deploy, test, security checks, and other steps. All of this code is versioned and the environments are locked down from modification unless it’s through this versioned automation

Ability for product teams to use self-service tools and documentation to get up and running with all infrastructure and pipeline resources without human intervention

Restructure how established teams in the enterprise work with these new product teams to ensure these product teams can deploy to production without human intervention – with the exception of a business approval before deploying to production

Ensure key continuous delivery practices are embraced on product teams including committing to the mainline multiple times per day, stopping the line, immutable environments, feature toggling, and test-driven development.

DevOps is about Speeding up Effective Feedback

Figure 1 illustrates a typical software development lifecycle for a web application or service.

On the one side you have customers and the other, developers. A developer comes up with an idea for a new feature, implements it, builds it, tests it and puts it through a release process until it gets delivered to production where customers start to use it.

It’s only once it gets into the hands of your customers do you start to learn from it.

Developers might be able to obtain customer usage data or get direct feedback from customers or start to make educated decisions on what to do next. They might decide to refine a feature, improve it or develop a new feature. This is where this loop starts again.

There’s two important considerations in this lifecycle.

The first is that the faster you’re able to get through this loop determines how responsive you are with customers and your ability to innovate.

The second is that – from your customers perspective – you’re only delivering value when you’re spending time on developing high-quality features. From your customers perspective, any time you spend on building the pipeline itself or manually making changes through this pipeline is not delivering value.

Therefore, you want to maximize the time spent on feature development and minimize non value added time spent on building, deploying, testing, and releasing software.

Ultimately, any efficiency that can speed up effective feedback between users and engineers is DevOps. These efficiencies might be organizational, cultural, process, or tooling improvements.

The rest of this post describes an ideal set of recommendations based on the assumptions and constraints of an enterprise seeking to move applications to the cloud in the most effective manner. Often, we have other constraints such as teams wanting to use existing tooling that requires us to help them stand up environments on which these tools run (for example, standing up Jenkins on EC2 or ECS clusters and managing other attributes of high availability and fault tolerance). In addition, some enterprises aren’t willing to change their organizational models to support effective DevOps practices. I liken this to the often quoted line: “Insanity is doing the same thing over and over again and expecting different results.” As you might imagine, these enterprises often don’t receive the benefits they’re seeking. So, this blog post covers some typical constraints and an ideal and realistic mindset that helps embrace these changes.

mu

Based on the identified assumptions and constraints, this enterprise is choosing a framework to manage the provisioning and management of all of the AWS resources. While we could potentially write a lot of the infrastructure using a bespoke solution using some combination of native CloudFormation templates, bash scripts, configuration files, a high-level programming language like Python, and other tools that support running builds, deployments, and different types of tests in a fully automated manner, we chose to use a framework that embeds these practices in the code which allows us not to repeat ourselves across multiple product teams that need the same behavior.

The framework we’re choosing in this scenario is called mu – which is an open source framework that supports DevOps on AWS infrastructures and pipelines. With mu, you use declarative configuration files to provision environments, pipelines, and services. These primitives provision all the infrastructure necessary to run a full application/services stack on AWS. This includes VPCs, databases (RDS and DynamoDB), ECS, ECR, EKS, Fargate, ECS tasks and services, load balancing, and deployment pipelines. mu is stateless and generates AWS CloudFormation stacks as code templates. mu is flexible through the use of mu extensions, which can be defined to provision resources that are not native in the framework. Below is an example of a mu.yml file that defines environments, a service, and a pipeline. With this definition, mu provisions and integrates over 20 AWS services so that an application/service can be deployed to production at any time there’s a business decision to do so.

For more information on creating a basic mu stack, see the basic example on the mu wiki.

AWS Account Management

One of the ways of organizing your AWS accounts is to separate accounts by function and by application/service. The AWS Landing Zone strategy recommends an approach of utilizing AWS Organizations, Service Control Policies, organizational accounts for security, logging, etc. and an account per deployable service using a vending machine pattern. This helps manage AWS resources more effectively and limit the blast radius if there are security incidents. Figure 2 illustrates this approach.

All of these resources can be defined in code via tools like AWS CloudFormation.

Pipeline Factory

A pipeline factory is a construct that generates deployment pipelines for applications/services that can be modified/extended to include additional stages and actions based on the application’s specific requirements. These deployment pipelines build quality into the product, provide quick and effective feedback, require minimal manual interaction, use the same process (and binaries) from commit to production, and allows them to deliver (almost) any version any time. [Source]

For more information on Pipeline Factories with mu, see Pipeline Factoryon the mu wiki.

Networking

In AWS, networking can be defined in code. For example, you can define VPCs, Subnets, Security Groups, Internet Gateways, Route Tables and Routes, and NACLs in CloudFormation.

mu provides support for VPCs and associated resources so that instead of writing many hundreds or thousands of lines of CloudFormation code, you only need to provide a few lines of YAML configuration and get some built-in best practices for defining networks on AWS.

By using the environments primitive in mu, you get a VPC public/private setup by default as shown here:

---
environments:
- name: acceptance

If you want to override the default VPC configuration, you can embed CloudFormation parameters into the mu configuration as shown here:

For more information on VPC networking automation with mu, see Environments->VPC on the mu wiki.

Load Balancing

Elastic Load Balancing (ELB) automatically distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, and IP addresses. It can handle the varying load of your application traffic in a single Availability Zone or across multiple Availability Zones. [Source]

You can provision load balancers in AWS using CloudFormation directly as part of a fully automated deployment pipeline.

With mu, you can define configuration that provisions and allows you to configure Application Load Balancers (ALBs) for your applications/services.

service:
...
# The paths to match on in the ALB and route to this service.
pathPatterns:
- /bananas
- /apples
# The hostnames to match on in the ALB and route to this service.
hostPatterns:
- my-service.*
# The priority for resolving the pathPatterns from the ALB
priority: 25

Compute

You use compute resources to run applications whether that’s on EC2, Containers, or Lambda. For AWS Containers, mu supports ECS, EKS, or ECS with Fargate. ECS is a scheduling an orchestration for Docker service, EKS is an engine for Kubernetes, and Fargate provides serverless capabilities for provisioning the instances that make up ECS clusters.

Since the enterprise wants to leverage AWS as much as possible, a solution we might suggest is to use ECS with Fargate.

To do this in mu, you simply list the provider as ecs-fargate in the environments primitive in the mu configuration.

For examples of using different providers, see the mu FAQ and search for Fargate, EKS, and EC2.

Databases

With AWS, there are many types of database options from which to choose. Most notably: RDS and DynamoDB.

RDS supports Aurora, MariaDB, MySQL, PostgreSQL, Oracle, and SQLServer. It also has an Aurora Serverless option. RDS handles a lot of the database management for you and with Aurora Serverless, you don’t need to worry about scaling or fault tolerance either.

DynamoDB is a fully-managed NoSQL database that scales out and in based on your desired capability.

mu supports both RDS and DynamoDB through its service primitive. An example configuring a MySQL RDS database in mu is shown below.

For more information on services with mu, see Services on the mu wiki.

Service Discovery

Service discovery allows a service to register itself with a predictable DNS name that other services can refer to in order to make connections – particularly as these services scale out or in.

With ECS Service Discovery, the ECS service automatically registers itself with a predictable and friendly DNS name in Amazon Route 53. As your services scale up or down in response to load or container health, the Route 53 hosted zone is kept up to date, allowing other services to lookup where they need to make connections based on the state of each service.

mu uses ECS Service Discovery. You can see an example of how to configure this capability in mu below.

For more information on Service Discovery with mu, see Service Discovery on the mu wiki.

Extensibility

With any platform or framework you want the flexibility to extend the functionality when you want to use different tools or approaches.

mu provides this capability through extensions, which provide the ability to include custom CloudFormation configuration to run as part of a mu invocation. See the example of referring to an existing extension below.

DNS

“Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS) web service. It is designed to give developers and businesses an extremely reliable and cost effective way to route end users to Internet applications by translating names like www.example.com into the numeric IP addresses like 192.0.2.1 that computers use to connect to each other. Amazon Route 53 is fully compliant with IPv6 as well”. [Source]

You can automate the provisioning of many features of Route 53 using AWS CloudFormation.

mu makes this automated provisioning even easier by using the environments primitive.

At Rest

AWS makes encryption at rest easy using the Key Management Service (KMS) which is a managed service that makes it easy to create and control the encryption keys used to encrypt data, and uses FIPS 140-2 validated hardware security modules to protect the security of your keys [Source]. KMS can be used encrypt data/files at rest for the following services: CloudTrail, DynamoDB, EBS, RDS, S3, and others.

Secrets Management

As part of applications and deployments, you’ll often need to store secret configuration or parameters such as passwords in an encrypted format. To do this in AWS, you can use AWS Secrets Manager and AWS Parameter Store.

In addition, you might extend capabilities to use Secrets Manager or Parameter Store in different contexts.

Development Environments

AWS Cloud9 is an Integrated Development Environment (IDE) for writing, running, and debugging code, all from your web browser as part of the AWS platform and console. Under the hood, Cloud9 is running on an EC2 instance and you can configure it to run within a VPC.

It provides the following benefits:

Code with others in the same environment in real time.

Deep integration with Serverless tools such as AWS Lambda

Direct terminal access with tools such as the AWS CLI and Git already configured

Security and Compliance

DevSecOps is about making security an integral part of software systems rather than some check tacked on right before (or after) a software system goes to production. It embraces the “shift left” concept of moving security checks earlier and making them integral to the software development lifecycle.

Practically speaking, teams that embrace the DevSecOps approach define security as code. For example, using tools like cfn_nag to run static security analysis checks against CloudFormation code prior to instantiating an environment on AWS, or automatically running through the OWASP Top 10 for application security. In addition, security outside of deployment pipelines are also defined in code whether it be AWS Organizations, IAM, Macie, GuardDuty, or KMS – to name a few. Here’s an example of referring to the cfn_nag mu extension in order to automatically run security checks against all CloudFormation templates:

Compliance

With AWS, we can define our compliance as code using AWS Config Rules. Config Rules provides built-in managed rules and the ability to create custom rules with the AWS Config Rules Development Kit (RDK) which helps developers set up, author and test custom Config rules. It contains scripts to enable AWS Config, create a Config rule and test it with sample ConfigurationItems.

Segregation of Duties

Segregation of Duties (SoD) is a common concern particularly in enterprises where there’s significant risk if systems are breached.

Here are concrete steps for adhering to SoD while embracing the DevOps mindset of speeding up feedback loops:

All changes to the software system of record (the application/service code, configuration, infrastructure, data and anything else that makes up a software system) are made only through code committed to a version control system. Using the principles of least privilege, some might have read only access to production systems as well.

For every commit, someone who wasn’t involved in authoring the change, reviews the code immediately after the commit to version control. For example, a pull request in the version control system – such as GitHub, CodeCommit, and Bitbucket

The entire workflow is fully automated and specific controls are in place that log and limit access to production environments. All pipeline events are logged and can be audited at any point. Traceability exists for linking code commits to features, tests, and other artifacts in the issue tracking systems (e.g. JIRA or others)

Engineering Practices

There are numerous everyday engineering practices that embrace the DevOps mindset. I’ve highlighted a few of the important practices in this section.

Everything as Code

Here’s a six-part heuristic that we often employ with enterprises. The six parts are: document, test, code, version, continuous and monitor. This is another way of encouraging the enterprise to use good software engineering practices when it comes to infrastructure and the rest of the software system. Here are some details and examples:

Document – This might include READMEs, architecture diagrams and written instructions.

Version – All software system assets are versioned – application and test code, configuration, infrastructure and data. For example: GitHub or CodeCommit.

Continuous – In this context, Continuous means you’re building, testing, deploying and releasing your software with every good change. For example: AWS CodePipeline provides the kind of organization and structure to help you orchestrate the sequence and execution of all of the code.

Monitor – Enterprises might monitor using CodePipeline itself along with other tools such as CloudWatch, CloudWatch Logging, CloudTrail and many of the built-in logging services to AWS.

Immutable Infrastructure

Immutable Infrastructure is when you deploy new infrastructure when making changes instead of making changes to an existing environment. Most teams apply this pattern by only making these infrastructure changes through versioned automation. There are no manual changes made to environments. Furthermore, the infrastructure is always brought up in its entirety. [Source] AWS CloudFormation is an example of a tool in which you can define infrastructure as code to provide this immutability.

Stop the Line

Fix software delivery errors as soon as they occur; i.e., stop the line. No one checks in on a broken build as the fix becomes the highest priority [Source]. This practice is essential for any well-functioning pipeline and team as it reduces the complexity and cost of changes since these changes aren’t stacked up with many other changes – resulting in errors. This practice works well with committing in small batches.

Automated Software Delivery Metrics

In the book, Accelerate, by Nicole Forsgren, et al., four metrics are described as a way of measuring effective software delivery. They are: lead time for changes, deployment frequency, time to restore service, and change failure rate. Each is described in more detail below:

Lead time for changes – “the time it takes to go from code committed to code successfully running in production”. For example, in AWS you can calculate these times by using the CodePipeline API to get the time a revision was started and the time that same revision is deployed to production.

Deployment frequency – “A proxy for batch size since it is easy to measure and typically has low variability. By “deployment” we mean a software deployment to production or to an app store.” To calculate deployment frequency in AWS, you can calculate the number of times revisions were deployed to production via CodePipeline

Time to restore service – On average, how long does it take to restore service. From AWS CodePipeline, you can run synthetic tests in production and calculate the time between when a failure was identified and once it was fixed.

Change failure rate – How often to deployment failures occur in production that require immediate remedy (particularly, rollbacks). For example, you can use the data from time to restore service to calculate the change failure rate over time based on the deployment frequency.

Test-Driven Development

This was already addressed in the 6-step heuristic but worth pointing out again as enterprises will want to apply test-driven development (TDD) to all parts of the software system. This might include unit, infrastructure, acceptance, load, and security testing.

A common practice to employ with TDD is known as red-green-refactor in which you start by writing a failing test, make it green, then refactor the code. This approach often helps engineers consider the design of the component they’re developing and ensures that tests can be run as part of a continuous delivery lifecycle. Here’s an example of a static infrastructure test using RSpec:

describe('dromedary_security_group') do
it 'will not allow all traffic on 22' do
...
cidr = twentytwo.first.ip_ranges.first.cidr_ip
expect(cidr).not_to eq "0.0.0.0/0"
end
end

Organizational Models

There are many different strategies and team topologies to choose from that embrace a DevOps mindset.

For this enterprise, we’re choosing a bit of a hybrid of Type 1: Dev and Ops Collaboration and Type 3: Ops as Infrastructure-as-a-Service (Platform) as described at DevOps Topologies in which we’re leaning toward the Type 3 model. We’re choosing this approach because it has greatest chance of scaling in an enterprise that is new to many of the DevOps concepts.

In this scenario, we establish a platform team with some key capabilities that we know all product teams will use and create these features as services that can be consumed without human intervention.

The AWS multi-account vending machine pattern is employed so that the product team has 1-n accounts to support their development and getting the application and its services into production. Security and compliance are baked into the configuration of these accounts and this includes real-time compliance checks as previously discussed to ensure compliance across AWS accounts. What’s more, product teams can use the pipeline factory to generate pipeline on which they can extend its capabilities with more stages and actions and running tests and other checks specific to the services they are developing. The product teams have access to all the code to make changes and there are guardrails in place via AWS Service Catalog, Service Control Policies, IAM permissions, and Config Rules to ensure they work according to the policies of the enterprise.

There will be a group of engineers/coaches who work with a handful of product teams onboarding them to the platform and iterating through improvements to the capabilities of the platform. The product team has the freedom and the responsibility that a platform like this encourages.

Learning at Scale

Often in enterprises, there’s a small percentage who are early adopters and the vast majority of others who are riding the adoption curve. The first few teams you engage with often need hand holding which includes coding, testing, etc. These first teams are often the most progressive and open minded to the change in mindset when it comes to DevOps.

This is a time to start getting the word out about how to embrace DevOps by holding workshops, dojos, or other immersive training. It’s good to cover the key concepts and benefits but most engineers want to get their “hands dirty” and work through some scenarios in anticipation of applying it to their team later.

Experiment and Scale

In our experience, enterprises often want to solve the entire “problem” and move to a DevOps way all at once. In other words, they want to solve from an enterprise perspective at once. This usually has limited success. We find it’s important to have an enterprise vision for DevOps but in terms of implementation of certain practices, it’s best to start small, experiment, and improve.

The first few pilot teams should do their best to carve out a small service with limited dependencies and start incorporating changes when it comes to DevOps practices and tooling. They should be able to deploy the service to production but also have the ability to conduct small experiments in terms of DevOps practices along the way. This is where tracking the software delivery metrics can help in guiding behavior on where to focus improvement efforts.

Summary

In this post, I described how to leverage AWS services, organize teams, make changes to engineering practices, and improve processes based on an enterprise with certain requirements and constraints. To be clear, this is only one type of implementation based on some example assumptions and constraints. Every enterprise is different.