Thursday, 26 April, 2018 UTC

Summary

Dynatrace news

I don’t think there is a correct or one-size-fits-all definition of Shift-Left, neither is there one for DevOps, Cloud-Native or any of these other heavily used terms. What I do know, is that there are experts in our industry trying to make it easier for engineers to get faster self-service quality, performance and scalability feedback about their proposed code changes.

ADP, LLC, a Dynatrace customer, is on a transformation journey to NextGen Payroll Innovation for Global Platform, with advanced payroll features and modern architecture. The new ADP Payroll Innovation platform uses AWS public cloud. Part of this journey includes a transformation of their development practices, their delivery pipeline and how they provide feedback to engineering and business earlier in the development cycle!

This blog, therefore, is about Venugopal Kalicheti, Director Performance Engineering at ADP. It will examine how he and his team members, Kandaswamy Selvaraj, Principal Architect and Satish Gundreddy, Principal Developer, are leveraging cloud services from AWS, containers and monitoring from Dynatrace to provide faster performance, scalability and infrastructure feedback to the engineering and development teams. The views expressed on this blog are those of the blog authors, and not necessarily those of ADP.

Instead of executing performance tests at the end of the Kanban time-box or at the end of the release cycle, performance tests are run for every build deployment and daily scheduled runs for latest code changes of each micro service. This gives both the infrastructure and development teams more and faster feedback about the impact of code or configuration changes on performance, scalability and resource consumption.

In a remote desktop session, their performance engineering team showed me their Dynatrace environment. I took a couple of screenshots and notes and hope that this will inspire others to redefine performance engineering with the power of the cloud, containers and Dynatrace AI!

Let’s start with explaining what type of environment Venu and his team is providing to the engineering teams to get faster performance feedback. They run a Dynatrace Managed installation in their AWS VPC, running 10 different Dynatrace Tenants to separate the different pre-production environments they are monitoring (Feature1, Feature2. Feature n, Performance, etc). In the future they most likely will adopt Dynatrace Management Zones which makes handling of different environments easier within a single Dynatrace Tenant.

On the infrastructure side, their environments leverage Terraform and Ansible to deploy EC2 instances on which they run Mesos orchestration to host their Spring Boot Java services in Docker containers!

Here are some of the benefits they get out of Dynatrace, Cloud & Container Monitoring capabilities:

One challenge in very dynamic container environments, is to keep track of all the currently deployed containers, where they run and how many resources they consume. Making sure that those containers that should be running have enough resources and can reach all the other depending services is another challenge in highly dynamic environments. Dynatrace provides that type of visibility out-of-the-box!

Dynatrace provides full container visibility by either being deployed on the Docker host or running as a Docker container itself.

For a Docker host, Dynatrace provides visibility into the actual containers deployed on that host and shows how many resources these containers consume right now and over time:

Easy overview of all containers overall, grouped by image or by host. All these metrics are also automatically fed into the Dynatrace AI for automatic anomaly detection.

Benefit #3: Dynatrace leverages CloudWatch for Metrics and for Tags

ADP has also setup the Dynatrace AWS CloudWatch integration, which not only automatically pulls in key CloudWatch metrics of AWS Services such as S3, EBS, RDS, DynamoDB, Lambda, ELB … but also pulls in Tags from EC2 instances, which get automatically applied to the monitored hosts where a Dynatrace OneAgent is installed. This makes managing all performance data much easier because dashboards, filters and notifications can be setup using the same tags as already defined on AWS.

And here is the Dynatrace AWS Overview showing how many AWS services and resources the ADP teams are using in their Performance-Dev Environment:

Most of this data is pulled from AWS CloudWatch and gets combined with data captured from Dynatrace OneAgent, Plugins or pushed via the Dynatrace REST API.

TIP: If you want to learn more about basic AWS monitoring, I suggest you walk through my 101 AWS Monitoring GitHub Tutorial or watch my 101 AWS Monitoring Performance Clinic on YouTube. There, I explain the integration with CloudWatch, how to deploy OneAgents on EC2 and how to monitor applications deployed on Beanstalk, ECS or Lambda!

#2: Ensuring Healthy Cloud Infrastructure with help of Dynatrace AI

While having more monitoring data available is great, thanks to the Dynatrace OneAgent and the AWS CloudWatch integration, it doesn’t mean that Venu’s team has to spend more time analyzing more data points. This is where the Dynatrace automatic baselining, anomaly and root cause detection helps.

I asked Venu if he had an example of how the Dynatrace AI helps his infrastructure team to help a stable and healthy environment for the services that the development teams run on top.

He opened the Dynatrace Problem View, clicked on the Infrastructure filter and then walked me through the following screenshots. He showed me how Dynatrace detected a network connectivity issue of several haproxy instances running in Docker containers orchestrated by Mesos across several EC2 instances.

The Dynatrace OneAgent monitors every single container and all processes running in these containers. OneAgent automatically detects technologies and services such as haproxy, message queues, web- or application servers, databases, …

The Dynatrace Anomaly Detection understands which metrics are important for each type of service and reports an anomaly if a metric shows problematic or unusual behavior. Thanks to this auto-detection capability ADP’s infrastructure team can react much quicker to infrastructure related problems before they start impacting the services that run on them:

Automatic detection of Connectivity issues on this haproxy that runs in a container on Mesos on EC2

Benefit #2: Dynatrace Automatic Dependency and Impact Detection

Thanks to Smartscape, the infrastructure team not only knows which infrastructure components or critical services are currently in an unhealthy state, but the Smartscape also shows all the depending services that haproxy is connecting. Based on that information, it is easier to understand the potential impact to higher-level services, applications or even end users. This also helps to prioritize remediation actions – whether executed manually or automated.

Smartscape visualizes where the problematic haproxy actually “lives” and which other services are depending on it

TIP: If you want to learn more about how Dynatrace helps your IT Service and Operations (ITSM), check out the information around our ServiceNow Integration or how you can integrate Dynatrace with any CMDB YouTube Tutorial. If you want to see other examples of detected infrastructure problems check out my recent blog on AI In Action: RabbitMQ, Cassandra and JVM Memory.

#3: Shift-Left Performance Feedback with the help of Dynatrace AI

What runs on top of this dynamic cloud & container infrastructure? I am sure you guessed the answer: Services that Venus development teams are trying to get performance feedback on. The primary type of services they implement using Spring Boot expose REST APIs for their B2B offerings. Some of these APIs have well defined SLAs, which is why Venu decided to define several custom thresholds for the different REST endpoints.

When developers make code changes, those get automatically deployed with the next scheduled build and get automatically tested. LoadRunner and JMeter are used to generate the load against their various REST APIs. These tests typically run for little over one hour after which developers pro-actively reach out to the Dynatrace dashboards to analyze how their code was performing, where the hotspots are and where there might have scalability issues.

While Dynatrace gives them access to all data through the dashboards, the team started to see the benefit in time savings when using the Dynatrace AI, which automatically detected problems and root causes. The time saved can be better spent on building new features instead of manually analyzing the same metrics, log files, stack traces or CPU samples every time a test executes.

The following screenshot shows an automatically detected problem that happened during one of the automated load test runs. There was a 47% slowdown of a specific REST API endpoint caused by a CPU spike on that EC2 Linux machine, where Mesos hosts the Tomcat process in a container:

Dynatrace automates all the manual work a performance engineer would do. Highlighting slowdowns on individual endpoints and surfacing the potential root cause.

Benefit #2: Dynatrace Pro-Active Alerting of Dev Teams

Thanks to the host, process and service tagging capabilities of Dynatrace, each service is tagged with the name of the team responsible. In case Dynatrace detects a problem, the team automatically gets notified thanks to the Alerting Profile feature in Dynatrace. Alerting Profiles allow sending problem notifications ONLY to those teams of services where a problem was detected. The notification (email, JIRA, Slack …) also gets sent out immediately when the problem is detected and not only at the end of the test. This also speeds up the feedback loop cycle time in case a code change has an obvious issue which can be detected by Dynatrace within minutes.

Dynatrace can notify teams immediately when the problem is detected. This shortens feedback loop time

The CPU spike on that host makes you probably wonder: is it a problem with the infrastructure, the container or the actual app? When clicking into that root cause box in the problem ticket we end up seeing all the captured data from that EC2 Linux machine. We clearly see that there are a lot of other processes and containers running on that same box – all competing for CPU, memory, disk and the network. Very interesting to learn that the same box also runs Swagger (on Node.js), Kafka (two Jetty’s) and Filebeat (Go) besides our Tomcat that hosts the service under test:

The Dynatrace OneAgent gives full visibility into every process and container running on this EC2 Linux machine

A click on “Consuming processes” gives us a detailed CPU breakdown of all processes & containers on that machine – clearly highlighting that the cause is our Apache Tomcat process:

Dynatrace gives us key resource metrics for every process over time. Easy to spot that it was indeed Tomcat consuming all that CPU!

Now that we know the problem lies within Tomcat it is easy to find out what caused it. Dynatrace provides several hotspot detection options such as the response time analysis. Code execution is the clear “winner” in this case followed by two SQL statements:

Response time analysis highlights the top hotspots down to SQL statements, queue access, service calls or method execution.

As the SLA for their REST API endpoints is 200ms it is interesting to learn why most of this time is consumed in code execution. More interesting is WHERE in the code the time is spent. Clicking on “Code execution” in the response time analysis infographic, brings up the method hotspots view with a detailed breakdown. We got a winner: Hibernate!

44.9% of the total code execution time is spent retrieving data from the database through hibernate.

The above screenshot shows the hotspot across all the requests that exceeded the 200ms SLA. This is very useful and makes it easier to analyze and fix hotspots that are impacting many transactions and not just individual requests.

Thanks to the Dynatrace PurePath technology, every single transaction is also available for inspection. Seeing the PurePath allows engineers to better understand the sequence of code execution which is very useful in distributed or asynchronous transactions.

Dynatrace PurePath giving you full insights into every single end-to-end transaction. Very useful for engineers to understand where time is spent!

Tip: I get a lot of questions from users that integrate Dynatrace with their load testing tools. Make sure to check out my blog on Load Testing Redefined or watch my Load Testing YouTube tutorial. If you want to learn more about diagnostics option with Dynatrace then check out Basic Diagnostics with Dynatrace.

What else can be done with Dynatrace?

While Venu and his team already leverage Dynatrace to speed up continuous feedback cycles to engineering, as well as the infrastructure team there is more that can be done. Here are some additional ideas:

#1: Use the Dynatrace REST API to automate “sanity checks”:

How many containers are running vs. how many should run?

Do we run too many containers on a single host?

#2: Use the Dynatrace REST API to automate “deployment optimizations”:

Read up on my thoughts about Building the Unbreakable Delivery Pipeline

Learn how Dynatrace can be integrated into Atlassian DevOps Tools

If you have any further questions let us know. I’m happy to give you more insights into how Shift-Left can be done with Dynatrace and happy to share your own implementation. Just let me know!

The views expressed on this blog are those of the blog authors, and not necessarily those of ADP. The content on this blog is “as is” and carries no warranties. ADP does not warrant or guarantee the accuracy, reliability, and completeness of the content on this blog.