SignalFx is the only real-time cloud monitoring platform for infrastructure, microservices, and applications. The platform collects metrics and traces across every component in your cloud environment, replacing traditional point tools with a single integrated solution that works across the stack.

Today we are blogging a little differently than usual. I’m Jim Hirschauer the Operations Guy, and this is my esteemed colleague Dustin Whittle the Developer. In this blog post we’re going to discuss how we would take an application from inception through development, testing, QA, and into production. We’ll each comment on the different stages and provide our perspective on the tools that we need to use at each stage and how they help with automation, testing, and monitoring. Along the way we’ll call out the potential collaboration points to identify the areas where the DevOps approach provides the most value.

The software development loop looks like this:

Inception and Working with a Product Team

From an operational perspective, my first instinct is to understand the application architecture so that I can start thinking about the proper deployment model for the infrastructure components. Here are some of my operational questions and considerations for this stage:

Are we using a public or private cloud?

What is the lead time for spinning up each component and ensuring that they comply with my companies regulations?

When do I need to provide a development environment to my dev team or will they handle it themselves?

Does this application perform functions that other applications or services already handle? Operations should have high-level visibility into the application and service portfolio.

From a development perspective, my first milestone is to make sure the ops team fully understands the application and what it takes to deploy it to a pre-production environment. This is where we the developers sync with the product and ops team and make sure we are aligned.

Planning for the product team:

Is the project scope well-defined? Is there a product requirements document?

Do we have a well-defined product backlog?

Are there mocks of the user experience?

Planning for the ops team:

What tools will we use for deployment and configuration management?

How will we automate the deployment process and does the ops team understand the manual steps?

How will we integrate our builds with our continuous integration server?

How will we automate the provisioning of new environments?

Capacity planning: Do we know the expected production load?

There’s not a ton of activity at this stage for the operations team. This is really where the DevOps synergy comes into play. DevOps is simply operations working together with engineers to get things done faster in an automated and repeatable way. When it comes to scaling, the more automation in place the easier things will be in the long run.

Development and Scoping Production

This should start with a conversation between the dev and ops teams to control domain ownership. Depending on your organization and your peers' strengths this is a good time to decide who will be responsible for automating the provisioning and deployment of the application. The ops questions for deploying complex web applications:

How do you provision virtual machines?

How do you configure network devices and servers?

How do you deploy applications?

How do you collect and aggregate logs?

How do you monitor services?

How do you monitor network performance?

How do you monitor application performance?

How do you alert and remediate when there are problems?

During the development phase the operations-focused staff normally make sure the development environment is managed and are actively working to set up the test, QA and Prod environments. This can take a lot of time if automation tools aren’t used.

Here are some tools you can use to automate server build and configuration:

Meanwhile, the operations staff should also make sure that the developers have access to tools that will help them with release management and application monitoring and troubleshooting. Here are some of those tools:

Testing and Quality Assurance

Once developers have built unit and functional tests we need to ensure the tests are running after every commit and we don’t allow regressions in our promoted environments. In theory, developers should do this before they commit any code, but oftentimes problems don’t show up until you have production traffic running under production infrastructure. The goal of this step is really to simulate as much as possible everything that can go wrong and find out what happens and how to remediate.

The next step is to do capacity planning and load testing to be confident that the application doesn’t fall over when it is needed most. There are a variety of tools for load testing:

MultiMechanize: Multi-Mechanize is an open source framework for performance and load testing. It runs concurrent Python scripts to generate a load (synthetic transactions) against a remote site or service. Multi-Mechanize is most commonly used for web performance and scalability testing, but can be used to generate workloads against any remote API accessible from Python.

Google PageSpeed Insights: PageSpeed Insights analyzes the content of a web page, then generates suggestions to make that page faster. Reducing page load times can reduce bounce rates and increase conversion rates.

The last step of testing is discovering all of the possible failure scenarios and coming up with a disaster recovery plan. For example what happens if we lose a database or a data center or have a 100x surge in traffic.

During the test and QA stages operations needs to play a prominent role. This is often overlooked by ops teams but their participation in test and QA can make a meaningful difference in the quality of the release into production. Here’s how.

If the application is already in production (and monitored properly), operations has access to production usage and load patterns. These patterns are essential to the QA team for creating a load test that properly exercises the application. I once watched a functional test where 20+ business transactions were tested manually by the application support team. Directly after the functional test I watched the load test that ran the same 2 business transactions over and over again. Do you think the load test was an accurate representation of production load? No way! When I asked the QA team why there were only 2 transactions they said “Because that is what the application team told us to model.”

The development and application support teams usually don’t have time to sit with the QA team and give them an accurate assessment of what needs to be modeled for load testing. Operations teams should work as the middle man and provide business transaction information from production or from development if this is an application that has never seen production load.

Production

Production is traditionally the domain of the operations team. For as long as I can remember, the development teams have thrown applications over the production wall for the operations staff to deal with when there are problems. Sure, some problems like hardware issues, network issues, and cooling issues are purely on the shoulders of operations–but what about all of those application-specific problems? For example, there are problems where the application is consuming way too many resources, or when the application has connection issues with the database due to a misconfiguration, or when the application just locks up and has to be restarted.

I recall getting paged in the middle of the night for application-related issues and thinking how much better each release would be if the developers had to support their applications once they made it to production. It was really difficult back in those days to say with any certainty that the problem was application related and that a developer needed to be involved. Today’s monitoring tools have changed that and allow for problem isolation in just minutes. Since developers in financial services organizations are not allowed access to production servers, it makes having the proper tools all the more important.

Production DevOps is all about:

deploying code in a fast, repeatable, scalable manner

rapidly identifying performance and stability problems

alerting the proper team when a problem is detected

rapidly isolating the root cause of problems

automatic remediation of known problems and rapid manual remediation of new problems (runbooks and runbook automation)

Your application must always be available and operating correctly during business hours (this may be 24x7 for your specific application).

In case of failures alerting tools are crucial to notify the ops team of serious issues. The operations team will usually have a runbook to turn to when things go wrong. A best practice is to collaborate on incident response plans.

Maintenance

Finally, we’ve made it to the last major category of the SDLC: maintenance. As an operations guy my mind focuses on the following tasks:

Capacity planning: Do we have enough resources available to the application? If we use dynamic scaling, this is not an issue but a task to ensure that scaling is working properly.

Patching: Are we up to date with patches on the infrastructure and application components? This is supposed to help with performance and/or security and/or stability but it doesn’t always work out that way.

Support: Are we current with our software support levels (i.e., have we paid and are we on supported versions)?

New releases (application updates): New releases always made me cringe since I assumed the release would have issues the first week. I learned this reaction from some very late nights immediately following those new releases.

As a developer the biggest issue during the maintenance phase is working with the operations team to deploy new versions and make critical bug fixes. The other primary concern is troubleshooting production problems. Even when no new code has been deployed, sometimes failures happen. If you have a great process, application performance monitoring, and a DevOps mentality collaborating with ops to resolve the root cause of failures becomes easy.

As you can see, the dev and ops perspectives are pretty different, but that’s exactly why those two sides of the house need to tear down the walls and work together. DevOps isn’t just a set of tools, but a philosophical shift that needs that requires buy-in from all folks involved to really succeed. It’s only through a high level of collaboration that things will change for the better. AppDynamics can’t change the mindset of your organization, but it is a great way to foster collaboration across all of your organizational silos. Sign up for your free trial today and make a difference for you organization.

SignalFx is built on a massively scalable streaming architecture that applies advanced predictive analytics for real-time problem detection. With its NoSample™ distributed tracing capabilities, SignalFx reliably monitors all transactions across microservices, accurately identifying all anomalies. And through data-science-powered directed troubleshooting SignalFx guides the operator to find the root cause of issues in seconds.