Introduction to DevOps Analytics

Discover the hidden value of using historical data to analyze and estimate a probabilistic measure of the success of a new release. This Refcard goes in-depth on the process to make the delivery lifecycle more efficient, the best analytics-as-a-service (AaaS) tools to use, and dashboards that will effectively help you display and understand your data.

Introduction to DevOps Analytics

Discover the hidden value of using historical data to analyze and estimate a probabilistic measure of the success of a new release. This Refcard goes in-depth on the process to make the delivery lifecycle more efficient, the best analytics-as-a-service (AaaS) tools to use, and dashboards that will effectively help you display and understand your data.

The Hidden Value of Your CI/CD Pipeline

Today’s world is driven by rapid change. Failing to keep up with the pace of the evolution of technology and business models might lead even a large corporation to collapse in the blink of an eye. DevOps allows software, and thus businesses, to increase the speed of development and adapt to an ever-changing market, maintaining a level of quality and reliability that meet customers’ expectations.

There is a lot of pressure on dev teams to release on time, and decreasing the time-to-market is essential in today’s highly competitive environment.

When something goes wrong with a software release, the costs associated can be overwhelming, going from urgently producing patches to rolling back to the previous version when possible, if not having to re-engineer or even redesign significant portions of the software.

What we lack is proper visibility into the development process that would allow us to estimate the risks attached to a software release correctly. We want to have historical data to analyze and estimate a probabilistic measure of the success of a new release.

The Continuous Integration and Continuous Delivery (CI/CD, for short) pipeline is the engine that drives the DevOps methodology and allows teams to streamline the iterations of software changes from the initial idea to the rollout to the entire customer base. With a fast and efficient CI/CD pipeline, all the people involved in the design, implementation, testing, and rollout of software changes can work together and communicate seamlessly at any stage.

The CI/CD pipeline produces a lot of logs and events during every stage, including audits and comments on the issue, reviews on the code changes, build logs and tests result up to the deployment, and production logs. The sheer amount and size of that information lead many organizations to stash or even discard most of them. The collection, storage, and analysis of that data allow us to understand and improve the entire E2E DevOps pipeline.

The aim of DevOps analytics is to uncover the hidden value of CI/CD logs and events and demonstrate various benefits at different levels.

Each employee needs to see what is relevant to them in a form that is suitable to their skills and understanding:

Release Engineer

Product Manager

Delivery Manager

Head of Engineering

Form

Deeply technical with deep links

Functional feature-based view

Timeline view of deliveries

Deliveries by dev teams

Value

Improve pipeline stability

Reduce time-to-market, maximize customer satisfaction

Prevent delays and minimize post-release issues

Maximize team cooperation, minimize risks related to a release

Section 2

Continuous Improvement DevOps Lifecycle

To implement DevOps analytics successfully, we should define an evolutionary process to focus on the final goal: speeding up the delivery cycle without compromising quality.

Let's walk through a real-life example. The project: development of the Gerrit Code Review Open-Source product; it has involved over 600 developers from all over the world with over 300 build jobs and millions of tests executed.

Section 3

The Process Explained, Step-by-Step:

1. Define goals and KPIs. Assess the current situation and define the parameters that you want to measure, typically based on time and quality. Start small, with a maximum of two KPIs per iteration. Example:

RT: Review Time: The interval between the first commit pushed and the merge into the main branch.

DQ: Deploy Quality: The percentage of features deployed divided by the number of bugs.

2. Extract logs and collect events. Understand what you have and collect it by keeping the metadata associated with the dimensions you are willing to measure. Example:

Commits and reviews: Extract all the version control code commits and the reviews activity associated.

3. Calculate KPIs and publish to dashboards. The amount of data accumulated needs to be transformed, processed, and re-aggregated on a constant basis. Example:

Get Gerrit Code Reviews metadata, and aggregate by Jira Issues associated with them.

Calculate the RT (average, p95, p99), DQ (average, p95, p99).

4. Make changes to the process. Identify which part of the process is more urgent to be addressed and improved. Focus on a single aspect to improve during each cycle. Example:

Improve RT by defining new policies

See if DQ increases or not

5. Assess the success of KPIs. Assess the change in trends for the next observation period after the change and see what has improved. Failures are part of learning.

6. Restart from Point 1.

Section 4

Batch and Stream Processing Capabilities

The CI/CD pipeline generates a colossal amount of data while requiring real-time collection and analysis of KPIs. Trying to analyze all the data at once would produce a lengthy and complicated job that would struggle to keep up with the pace of incoming data.

Think about the processing of the data as a funnel with a small opening. If too much liquid is poured into the funnel and the opening is not large enough, the liquid would overflow.

We need to get the data onto a distributed storage system to allow for parallel filtering and correlation tasks. We call this process the “batch processing layer” or BP. The incoming data is collected and stored in a time-series using a distributed storage implementation such as S3 or HDFS. BP is typically implemented using a parallel execution engine, such as YARN or Mesos, to run on a large-scale cluster to crunch all the historical data and distill the information that we need for analysis.

BP has a significant startup phase overhead because it needs to allocate the cluster, distribute the load, and run and collect the results. However, it does not need to run all the time because it works on a batch of data. Using a Cloud-based processing service such as Amazon Elastic MapReduce (aka EMR) would allow avoiding the setup and maintenance of the cluster and reducing costs by paying only for the amount of time needed for crunching the data and producing the results.

Secondly, we need to fill the gaps between batch executions. From when the BP starts and ends, new data is coming through, and we need to have a mechanism to adjust the results when new data comes in. A new “stream processing” layer — or SP — can manage this type of data flow. The amount of data is limited, and the way to transport and process the data is different.

SP needs to be activated on-demand as data is coming through and needs to be able to adjust the results obtained from BP with extra delta-results coming from the additional data received. Speed is more important than accuracy; the next BP phase can always adjust any small skew in the results. Data gets collected using a Pub/Sub mechanism such as Apache Kafka or Amazon Kinesis, and processing happens “on-the-fly” through stream-based elaboration.

BP and SP together are the data sources that collect the distilled data for further analysis.

Section 5

Data Dimensions

The CI/CD pipeline produces a lot of unstructured data that contains multiple facets. They all represent different views and aspects of what happens and how changes have evolved over time. When collecting the data, all these aspects need to be put into different axes that can be retrieved more efficiently during the analysis phase. We call each axis a dimension.

CI/CD pipeline data dimensions.

Time

All data is evidence of what has happened, and thus the timestamp is the first dimension we need when collecting and organizing the events.

The BP (Batch Processing Layer) can have a hierarchical data structure where data gets archived under the top-down folder structure of their associated timestamp.

Example: Year/month/day/hour.

Project

The project represents the set of repositories that are the source of the code or features produced, gives a vertical view of where the CI/CD pipeline is heading to, and groups all the events that relate to and influence its evolution.

Example of entities inside a single project: Jira projects, Confluence spaces, Jenkins pipelines, and a set of Git repositories.

System

The system identifies where the CI/CD pipeline is getting executed and gives a view of how the system performs and reacts based on the artifact produced.

Example of entities inside a single system dimension: Git Servers, Jenkins masters and slaves, Docker containers, and other deployment environments.

People

This represents the stakeholders and players that make sure that the CI/CD pipeline works, including the social links and organizational view of the people around the project.

Example of the people dimension: User identities, their organizational role, and group hierarchy.

Section 6

Log and Events Extractions

VCS Commit Info

The information of what code changes over time is extracted from the Version Control System (VCS). An advanced version control system, such as Git, contains the following information to be collected:

Field

Information associated

Commit SHA1

Global unique global identifier of the code change

Author

Who originally wrote the code

Committer

Who participated in the editing and merge of the code

Subject

The headline of what aspect of the functionality is involved

Issue ID

References to the feature associated with the code change

Branch

Line of development for all the features contained in a release

Dimensions involved: Time, repository, people.

Code Review Events

Extract and archive all interactions around the code, including feedback from other members of the team. When using Git, this translates to the extraction of the comments and associated score:

Field

Information associated

Commit SHA1

List of code changes associated with the review

Change/Pull-Request Id

Stream of commits associated to a feature under review

Reviewer

Other people involved in the validation of the code

Comment

Textual feedback provided on the overall change or individual lines of code

Label score

Overall voting (positive, negative, neutral) on one aspect of the change

CI Build Metadata and Logs

Stream and store the logs and metadata associated with all the builds triggered on your Continuous Integration system. Each build log contains precious information such as the metadata that helps correlate upstream code changes:

Field

Information associated

Commit SHA1

List of code changes associated with the build

Environment

Label of where the build gets executed

Stages execution times

How much time each stage of the build takes

Stage result

Success or failure of one of the stage of the build

Test results

Results of the execution of the code tests

Example: Jenkins CI has a logstash appender plugin that allows for streaming of the build logs and metadata to a topic stream. Test results are often published as JUnit XML results.

Issue Tracker Events

Sudit of the issue tracker transitions can tell where the time is spent in the early stages, even before the start of development.

The information captured is:

Field

Information associated

Project name

The high-level project that all the stories belong to

Issue ID

Unique ID of the feature inside the project

Headline

One-liner of what the story is about

Reporter / Assignee

The person that created and then worked on the story

Old/New Status

The transition of the story through the pipeline stages

The correlation of all the events with a single source story in the issue tracker allows the Product Manager to query the data from a business perspective while collapsing the project and source code view.

Section 7

Feeding Events Into a Data Lake

All the events happening on the CI/CD pipeline have different formats, availability, aggregation, and schemas. The first stage is the collection and storing of raw flat files in a time-based directory structure.

The traditional data warehouse model would approach the problem by developing some long and expensive ETL (Extract-Transform-Load) jobs that would feed a universal business intelligence multi-dimensional cube for analysis on a relational database. That approach would not scale well because of the following problems:

The DevOps Analytics lifecycle is based on continuous improvement: An ETL job would be forced to recalculate all the data from scratch every time it got new metrics.

The costs and time for maintaining the ETL job would impact the ROI of the analytics exercise.

A better approach is to split the Transformation phase (T) into two parts:

Small-T: Preserve the data as much as possible with maximum detail. Transformation is limited to the reduction of time-based granularity, ID normalization across systems, and data-type conversions.

Big-T: Join data from different sources and compute KPI values relevant to the business.

The E(small-t)L phase has the following characteristics:

Reduces disk space while keeping the same level of information.

Optimized for speed and selection.

The result is a massive reservoir of data called “Data Lake” that opens the door to further analysis in a very flexible and open way by all the other downstream analytics tools.

The Big-T phase, aka the KPI extraction, selects data from the Data Lake using a higher-order language and displays them into interactive dashboards.

Section 8

KPI Extractions

The goal of this phase is to distill information from the Data Lake and aggregate data across three dimensions: Projects, Systems, and People. Each of the target stakeholder roles need a different view of the same data on those dimensions.

Dashboards: Display and Understand Your Data

Data visualization is an easy way of radiating information and keep your KPIs visible, facilitating continuous monitoring, and focusing on the objectives set for you or your team.

It allows an easy way to compare data, spot patterns, and find correlations among different metrics.

Different information can be extracted depending on the number of data dimensions used:

1D (e.g.: pie charts, stacked bar charts): Gives an idea about the distribution of data point occurrence. It can be used, for example, to easily spot an uneven distribution of events. In the following graph it is easy to spot the predominance of commits from a single contributor in a project:

This can give an idea of the level of collaboration among the person in a team. A predominance of commits by one member of the team might indicate a poor collaboration and cohesion among the members of the team.

2D (e.g.: histograms): Show trends of the relation between 2 metrics, such time and the number of events. Peaks (sudden changes in the trends) and flatness over time can be meaningful patterns to spot. Here is an example showing the number of commits over time for a project:

Usually in a project, there are peaks of commits in the initial stage and around delivery times, the throughput is constant the rest of the time. A decrease in the throughput in the middle phase of a project might highlight a issue. Of course, this metric alone is not enough to reach a conclusion, it needs to be corroborated with other metrics, but it can ring an alarm bell.

3D (i.e.: heatmaps, treemaps): Show correlations among 3 metrics, such as time, commit count, and author. Here is an example:

This can show the contribution of a single team member over time. A 1D graph would give a snapshot at the time of this situation, while a 3D one can show the evolution over time. It can be used to spot changes in team activities and behaviors.

Since these metrics can inform measurements about team dynamics and code quality, KPIs can be set and easily monitored.

There are several open-source tools to achieve this; for example, Kibana or Tableau. The examples shown before were built using the former. In Kibana, a dashboard is a collection of visualisations. This gives the flexibility of composing different dashboards for different needs, reusing in components. Kibana is highly integrated with Elastisearch and Logstash, and the 3 are referred together as the ELK stack for log analytics.

Section 10

Tools and AaaS Solutions

All the techniques and tools described so far are freely available in the open-source community or on commercial standalone applications.

Open-Source vs. Commercial Tools

See below for a sample collection of open-source vs. commercial tools you may use to implement a successful DevOps Pipeline with Analytics:

Either choice for any category would require people with the necessary skills to build, install, and maintain the entire software stack and keep it active on a 24x7 basis.

The Skills Gap Problem

For large companies with a dedicated team configuring and managing the entire pipeline, developing the skills to master all the tools is a plus. However, it is always better to start the iterations with something pre-configured or ready “out-of-the-box” to provide useful insights into the CI/CD pipeline from the very beginning. Recent studies highlighted that only 1 in 5 businesses have the skills needed to start and complete a successful DevOps Analytics implementation by themselves.

While investing in training your team and learning new tools, you need something to use as a reference to learn and iterate on the identification of what are the most relevant KPIs for your DevOps analytics improvement lifecycle.

Analytics-as-a-Service

Analytics-as-a-Service (aka AaaS) comes to the rescue, where you can get started very quickly and leverage your existing assets and data and, at the same time, learn new concepts, train your team without impacting your costs, and keep your ability to follow your BAU and project activities.

AaaS tools are cloud-based services that provide you with the final result of what your main stakeholders need, updated in real-time and in the form that your users need. You can get AaaS regardless of whether your services are currently running on a public/private cloud or are deployed on your own data centers. AaaS refers in fact to the analysis and publication of the results of the data coming from your systems, but do not require you to run or operate them.

That is great because it allows you to keep your existing CI/CD pipeline infrastructure while giving you exactly what is missing: the insights on how to improve it.