Git protocol v2 landed in Gerrit 3.1 on the 11th of October 2019. This is the last email from David Ostrovsky concluding a thread of discussion about it:

It is done now. Git wire protocol v2 is a part of open source Gerrit and will beshipped in upcoming Gerrit 3.1 release.
And, it is even enabled per default!
Huge thank to everyone who helped to make it a reality!

A big thanks to David and the whole community for the hard work in getting this done!

This was the 3rd attempt to get the feature in Gerrit after a couple of issues encountered along the path.

Why Git protocol v2?

The Git protocol v2 introduces a big optimization in the way client and server communicate during clones and fetches.

The big change has been the possibility of filtering server-side the refs not required by the client. In the previous version of the protocol, whenever a client was issuing a fetch, all the references were sent from the server to the client, even if the client was fetching a single ref!

In Gerrit this issue was even more evident, since, as you might know, Gerrit leverages a lot the refs for its internal functionality, even more with the introduction of NoteDb.

Whenever you are creating a Change in Gerrit you are updating/creating at least 3 refs:

refs/changes/NN/<change-num>/<patch-set>

refs/changes/NN/<change-num>/meta

refs/sequences/changes

In the Gerrit project itself, there are currently about 104K refs/change and 24K refs/change/*/meta. Imagine you are updating a repo which is behind just a couple of commits, you will get all those references which will take up most of your bandwidth.

Git protocol v2 will avoid this, just sending you back the references that the Git client requested.

Is it really faster?

Let’s see if it really does what is written on the tin. We have enabled Gerrit v2 at the end of 2019 on GerritHub.io, so let’s test it there. You will need a Git client from version 2.18 onwards.

We have already discussed in previous posts how important it is to speedup the feedback loop in your Software Development Lifecycle. Having early feedbacks gives you the chance of evaluating your hypothesis and eventually change direction if needed.

The more information you have, the smarter can be your decisions.

We recently added in our Gerrit DevOps Analytics the possibility of extracting data coming from Code Reviews’ metadata to extend the knowledge we can get out of Gerrit.

Furthermore, it is possible to extract meta-data from repositories not necessarily hosted on the Gerrit instance running the analytics processing. This is a big improvement since it allows to fully analyse repositories coming from any Gerrit server.

Hashtags aggregation

One important type of meta-data contained in the Code Reviews is the hashtag.

Hashtags are freeform strings associated with a change, like on social media platforms. In Gerrit, you explicitly associate hashtags with changes using a dedicated area of the UI; they are not parsed from commit messages or comments.

Similar to topics, hashtags can be used to group related changes together and to search using the hashtag: operator. Unlike topics, a change can have multiple hashtags, and they are only used for informational grouping; changes with the same hashtags are not necessarily submitted together.

You can use them, for example, to mark and easily search all the changes blocking a particular release:

Hashtags can also be used to aggregate all the changes people have been working on during a particular event, for example, the Gerrit User Summit 2019 hackathon:

The latest version of the Gerrit Analytics plugin exposes the hashtags attached to their respecting Git commit data. Let’s explore together some use cases:

The most popular Gerrit Code Review hashtags over the last 12 months

Throughput of changes created during an event

see for example the Palo alto hackathon (#palo-alto-2018). We can see at the end of the week the spike of changes to release Gerrit 2.16.

The extend of time for a feature

Removing GWT was an extensive effort which started in 2017 and ended in 2019. It took several hackathons to tackle the removal as shown by the hashtags distribution. Some changes were started in one hackathon and finalised in the next one.

Those were some example of useful information on how to leverage the power of GDA.

The above examples are taken from the GDA dashboardprovided and hosted by GerritForge on https://analytics.gerrithub.io which mirror commits and reviews on a regular basis from the Gerrit project and its plugin ecosystem.

How to setup GDA on Gerrit

Hashtag extraction is currently available from Gerrit 3.1 onwards. You can download the latest version released from the Gerrit CI.

To enable hashtag extraction you need to enable the feature extraction in the plugin config file as follow:

# analitycs.config[contributors] extract-hashtags =true

For more information on how to configure and run the plugin, look at the analytics plugin documentation.

Conclusion

Data is the goldmine of your company. You need more and more of it for making smarter decision. The latest version of the GDA allows you to leverage even more data produced during the code review process.

You can explore the potential of the information held in Gerrit on the analytics dashboard provided by GerritForge on analytics.gerrithub.io.

As a Gerrit administrator, making sure there is no performance impact while upgrading from one version to another can be difficult.

It is essential to:

have a smooth and maintainable way to reproduce traffic profiles to stress your server

easily interpret the results of your tests

Tools like wrk and ab are simple and good to run simple benchmarking tests, but when it comes to more complex scenarios and collection of client-side metrics, they are not the best tools to use.

Furthermore, they only support Close Workload Models, which might not always fit the behaviour of your system.

For those reasons in GerritForge we started to look at more sophisticated tools, and we started adopting Gatling, an open-source load testing framework.

The tool

The Gatling homepage describes it this way:

“Gatling is a highly capable load testing tool. It is designed for ease of use, maintainability and high performance…

Out of the box, Gatling comes with excellent support of the HTTP protocol…..

As the core engine is actually protocol-agnostic, it is perfectly possible to implement support for other protocols…

Based on an expressive DSL, the scenarios are self-explanatory. They are easy to maintain and can be kept in a version control system…”In this article, we focus on the maintainability and protocol agnosticism of the tool.

What about Git?

Gatling natively supports HTTP protocol, but since the core engine is protocol-agnostic, it was easy to write an extension to implement the Git protocol. I started working on the Gatling git extension in August during a Gerrit hackathon in Sweden, and I am happy to see that is starting to get traction in the community.

This way we ended up among the official Gatling extension on the official Gatling homepage:

The code of the git extension if opensource and free to use. It can be found here, and the library can be downloaded from Maven central. In case you want to raise a bug, you can do it here.

Maintainability

Gatling is written in Scala, and it expects the load tests scenarios to be written in Scala. Don’t be scared; there is no need to learn crazy functional programming paradigms, the Gatling DSL does a good job in abstracting the underneath framework. To write a scenario you just have to learn the building blocks made available by the DSL.
Here a couple of snippets extracted from a scenario to understand the DSL is:

Jenkins integration

Running load tests can be tedious and time-consuming, but yet essential to spot any possible performance regression in your application.
Providing the least possible friction is essential to incentivize people in running them. If you are already using Jenkins in your company, you can leverage the Gatling plugin to scale your load quickly and provide easy access to metrics.

A real use case: Gerrit v3.0 Vs Gerrit v3.1 load test…in production!

Let’s go through a real case scenario to show how useful and easy to read are the metrics provided by Gatling.

The closest environment to your production one is…production!

gerrithub.io runs Gerrit in a multi-site configuration, and this gives us the luxury of doing canary releases, only upgrading a subset of the master nodes running Gerrit. One of the significant advantages is that we can run A/B tests in production.

That allows us also to run meaningful load tests against the production environment. See below a simplified picture of our Gerrit setup where it is possible to see the canary server with a higher Gerrit version.

We ran against 2 servers in Germany the same load tests which:

Create 100 chained Change Sets via REST API

Submit all the changes together via REST API

We then compared the server-side and client-side metrics to see if a good job has been done with the latest Gerrit version.

Server-side metrics

We can see the improvement in the latest Gerrit version. The mean requests time is almost halved and, of course, the overall duration is decreased.

Client-side metrics

Let’s see what is going on on the client-side using the metrics provided by Gatling.

Among all the metrics we are going to focus on one step of the test, since we have more data points about it, the creation of the change:

We can already see from the overall report the reduction of the response time in the latest Gerrit version:

If we look in-depth to all the response times, we can see that the distribution of the response times is pretty much the same, but the scale is different….again we confirmed the result we previously encountered.

What can we say…Good job Gerrit community!

Wrapping up

(You can see my presentation about Gatling in the last Gerrit User summit in Sunnyvale here <- add this when the talk will be sharable)

I have touched superficially several topics in this blog posts:

simplicity and maintainability provided by the Gatling DSL

Integration with Jenkins

Gatling extensions and reuse of the statistic engine

Example scenarios

I would like to write more in-depth about all these topics in some follow-up blog posts. Feel free to vote for the topic you are more interested in or suggest new ones in the comments section of this post.
If you need any help in setting up your scenario or understand how to run load tests against your Gerrit installation effectively, GerritForge can help you.

Jenkins and Gerrit are the most critical components of the DevOps Pipeline because of their focus on people (the developers), their code and collaboration (the code review) their builds and tests (the Jenkinsfile pipeline) that produce the value stream to the end user.

Accelerate the CI/CD pipeline

DevOps is all about iteration and fast feedback. That can be achieved by automating the build and verification of the code changes into a target environment, by allowing all the stakeholder to have early access to what the feature will look like and validating the results with speed and quality at the same time.

Every development team wants to make the cycle time smaller and spend less time in tedious work by automating it as much as possible. That trend has created a new explosion of fully automated processes called “Bots” that are more and more responsible for performing those tasks that developers are not interested in doing manually over and over again.

As a result, developers are doing more creative and design work, are more motivated and productive, can address technical debt a lot sooner and allow the business to go faster in more innovative directions.

As more and more companies are adopting DevOps, it becomes more important to be better and faster than your competitors. The most effective way to accelerate is to extract your data, understand where your bottlenecks are, experiment changes and measure progress.

By looking at the protocol and code statistics, we founded out that bots are much more hard worker than humans on GerritHub.io, which hosts, apart from the Gerrit Code Review mirrored projects, also many other popular OpenSource.

That should not come as a surprise if you think of how many activities could potentially happen whenever a PatchSet is submitted in Gerrit: style checking, static code analysis, unit and integration testing, etc.

We also noticed that most of the activities of the bots are over SSH. We started to analyze what the Bots are doing and see what the impact is on our service and possibly see if there are any improvements we can do.

Build integration, the wrong way

GerritHub has an active site with multiple nodes serving read/write traffic and a disaster recovery site ready to take over whenever the active one has any problem.

Whenever we roll out a new version of Gerrit, using the so-called ping-pong technique, we swap the roles of the two sites (see here for more details). Within the same site, also, the traffic can jump from one to the other in the same cluster using active failover, based on health, load and availability. The issue is that we end up in a situation like the following:

The “old” instance still served SSH traffic after the switch. We noticed we had loads of long-lived SSH connections. These are mostly integration tools keeping SSH connections open listening to Gerrit events.

Long-lived SSH connections have several issues:

SSH traffic doesn’t allow smart routing. Hence we end up with HTTP traffic going on the currently active node and most of the SSH one still on the old one

There is no resource pooling since the connections are not released

There is the potential loss of events when restarting the connections

That impacts the overall stability of the software delivery lifecycle, extending the feedback loop and slowing your DevOps pipeline down.

Then we started to drill down into the stateful connections to understand why they exist, where are they coming from and, most importantly, which part of the pipeline they belong to.

Jenkins Integration use-case

The Gerrit Trigger plugin for Jenkins is one of the integration tools that has historically been suffering from those problems, and unfortunately, the initial tight integration has become over the years less effective, slow and complex to use.

Configuring the Gerrit Trigger Plugin is more error-prone because requires a lot of parameters and settings.

Systems dependencies

Tightly Coupled with Gerrit versions and plugins.

Uses Gerrit as a generic Git server, loosely coupled.

Upgrade of Gerrit might break the Gerrit Trigger Plugin integration.

Gerrit knowledge

Admin: You need to know a lot of Gerrit-specific settings to integrate with Jenkins.

User. You only need to know Gerrit clone URL and credentials.

The Gerrit Trigger plugin requires special user and permissions to listen to Gerrit stream events.

Fault tolerance to Jenkins restart

Missed events: unless you install a server-side DB to capture and replay the events.

Transparent: all events are sent as soon as Jenkins is back.

Gerrit webhook automatically tracks and retries events transparently.

Tolerance to Gerrit rolling restart

Events stuck: Gerrit events are stuck until the connection is reset.

Transparent: any of the Gerrit nodes active continue to send events.

Gerrit trigger plugin is forced to terminate stream with a watchdog, but will still miss events.

Differentiate actions per stage

No flexibility to tailor the Gerrit labels to each stage of the Jenkinsfile pipeline.

Full availability to Gerrit labels and comments in the Jenkinsfile pipeline

Multi-branch support

Custom: you need to use the Gerrit Trigger Plugin environment variables to checkout the right branch.

Native: integrates with the multi-branch projects and Jenkinsfile pipelines, without having to setup anything special.

Gerrit and Jenkins friends again

After so many years of adoption, evolution and also struggles of using them together, finally Gerrit Code Review has the first-class integration with Jenkins, liberating the Development Team from the tedious configuration and BAU management of triggering a build from a change under review.

Jenkins users truly love using Gerrit and the other way around, friends and productive together, again.

Conclusion

Thanks to Gerrit DevOps Analytics (GDA) we managed to find one of the bottlenecks of the Gerrit DevOps Pipeline and making changes to make it faster, more effective and reliable than ever before.

In this case, by just picking the right Jenkins integration plugin, your Gerrit Code Review Master Server would run faster, with less resource utilization. Your Jenkins pipeline is going to be simpler and more reliable with the validation of each change under review, without delays or hiccups.

The Gerrit Code Review plugin for Jenkins is definitively the first-class integration to Gerrit. Give it a try yourself, you won’t believe how easy it is to set up.

Accelerating your time to market while delivering high-quality products is vital for any company of any size. This fast pacing and always evolving world relies on getting quicker and better in the production pipeline of the products. The whole DevOps and Lean methodologies help to achieve the speed and quality needed by continuously improving the process in a so-called feedback loop. The faster the cycle, the quicker is the ability to achieve the competitive advantage to outperform and beat the competition.

It is fundamental to have a scientific approach and put metrics in place to measure and monitor the progress of the different actors in the whole software lifecycle and delivery pipeline.

Gerrit DevOps Analytics (GDA) to the rescue

We need data to build metrics to design our continuous improvement lifecycle around it. We need to juice information from all the components we use, directly or indirectly, on a daily basis:

SCM/VCS (Source and Configuration Management, Version Control System)how many commits are going through the pipeline?

Code Review
what’s the lead time for a piece of code to get validated?
How are people interacting and cooperating around the code?

Issue tracker (e.g. Jira)
how long does it take the end-to-end lifecycle outside the development, from idea to production?

Getting logs from these sources and understanding what they are telling us is fundamental to anticipate delays in deliveries, evaluate the risk of a product release and make changes in the organization to accelerate the teams’ productivity. That is not an easy task.

Gerrit DevOps Analytics (aka GDA) is an OpenSource solution for collecting data, aggregating them based on different dimensions and expose meaningful metrics in a timely fashion.

GDA is part of the Gerrit Code Review ecosystem and has been presented during the last Gerrit User Summit 2018 at Cloudera HQ in Palo Alto. However, GDA is not limited to Gerrit and is aiming at integrating and processing any information coming from other version control and code-review systems, including GitLab, GitHub and BitBucket.

Case study: GDA applied to the Gerrit Code Review project

One of the golden rules of Lean and DevOps is continuous improvement: “eating your dog food” is the perfect way to measure the progress of the solution by using its outcome in our daily life of developing GDA.

As part of the Gerrit project, I have been working with GerritForge to create Open Source tools to develop the GDA dashboards. These are based on events coming from Gerrit and Git, but we also extract data coming from the CI system, the Issue tracker. These tools include the ETL, for the data extraction and the presentation of the data.

As you will see in the examples Gerrit is not just the code review tool itself, but also its plugins ecosystem, hence you might want to include them as well into any collection and processing of analytics data.

Wanna try GDA? You are just one click away.

We made the GDA more accessible to everybody, so more people can play with it and understand its potentials. We create the Gerrit Analytics Wizardplugin so you can have some insights in your data with just one click.

What you can do

With the Gerrit Analytics Wizard you can get started quickly and with only one click you can get:

Initial setup with an Analytics playground with some defaults charts

Populate the Dashboard with data coming from one or more projects of your choice

The full GDA experience

When using the full GDA experience, you have the full control of your data:

Schedule recurring data imports. It is just meant to run a one-off import of the data

Create a production ready environment. It is meant to build a playground to explore the potentials of GDA

One click to crunch loads of data

Once you have Gerrit and the GDA Analytics and Wizard plugins installed, chose the top menu item Analytics Wizard > Configure Dashboard.

You land on the Analytics Wizard and can configure the following parameters:

Dashboard name (mandatory): name of the dashboard to create

Projects prefix (optional): prefix of the projects to import, i.e.: “gerrit” will match all the projects that are starting with the prefix “gerrit”. NOTE: The prefix does not support wildcards or regular expressions.

Date time-frame (optional): date and time interval of the data to import. If not specified the whole history will be imported without restrictions of date or time.

Username/Password (optional): credentials for Gerrit API, if basic auth is needed to access the project’s data.

Sample dashboard analytics wizard page:

Once you are done with the configuration, press the “Create Dashboard” button and wait for the Dashboard, tailored to your data, to be created (beware this operation will take a while since it requires to download several Docker images and run an ETL job to collect and aggregate the data).

At the end of the data crunching you will be presented with a Dashboard with some initial Analytics graphs like the one below:

You can now navigate among the different charts from different dimensions, through time, projects, people and Teams, uncovering the potentials of your data thanks to GDA!

What has just happened behind the scenes?

When you press the “Create Dashboard” button, loads of magic happens behind the scenes. Several Docker images will be downloaded to run an ElasticSearch and Kibana instance locally, to set up the Dashboard and run the ETL job to import the data. Here a sequence workflow to illustrate the chain of events is happening:

Conclusion

Getting insights into your data is so important and has never been so simple. GDA is an OpenSource and SaaS (Software as a Service) solution designed, implemented and operated by GerritForge. GDA allows setting up the extraction flows and gives you the “out-of-the-box” solution for accelerating your company’s business right now.

Contact usif you need any help with setting upa Data Analytics pipeline or if you have any feedback about Gerrit DevOps Analytics.

Time has come to migrate gerrithub.io to the latest Gerrit v2.16, from the outdated v2.15 we had so far. The big change between the two is the full adoption of NoteDB: the internal Gerrit groups were still kept in ReviewDb on v2.15, which forced us to keep a PostgreSQL instance active in production. This means we can finally say goodbye to the ReviewDb 👋 and eliminated yet another SPoF (Single-Point-of-Failure) from the GerritHub high-availability infrastructure.

Migrating to Gerrit v2.16 implies:

Gerrit WAR upgrade

GIT repos upgrade because of a change in the NoteDb format

Change in the database used, from PostgreSQL to H2 (for the schema_version)

Introduction of the new Projects index

The above is a quite complex process and, here at GerritForge, we executed the migration on a running GerritHub.io with 15k of active users avoiding any downtime during the migration.

Architecture

This is the initial architecture we are starting the GerritHub.io v2.15 migration from:

In this setup, we have 2 sites, one in Canada (active) and one in Germany (active for analytics and disaster recovery). The latter is aligned with the active master via replication plugin.

The HA Plugin used between the 2 Canadian nodes is a GerritForge fork enhanced with the ability to align the Lucene Indexes, Caches and Events when sharing repositories via NFS with caching enabled.

NOTE: The original High-Availability plugin is certified and tested on Gerrit v2.14 / ReviewDb only and requires the use of NFS without caching, which requires a direct fiber-channel connection between the Gerrit nodes the disks.

The traffic is routed with HAProxy to the active node. This allows us easy code migrations with no downtimes, using what we call the “ping-pong” technique between the Canadian and the German site, which is inspired by the classical Blue/Green deployment with some adjustments for the peculiarities of the Gerrit multi-site setup.

The migration pattern, in a nutshell, is composed of the following phases:

Upgrade code in Germany
The Gerrit site in Germany is used for Analytics and thus can be upgraded first with low risk associated.
German site -> passive, Canadian site -> active

Redirect traffic in GermanyOnce the site in Germany is ready and warmed up, the GerritHub users are redirected to it. GerritHub is technically serving the v2.16 user-experience to all users.
German site -> active, Canadian site -> passive

Upgrade code in CanadaThe site in Canada is put offline and upgraded as well.
German site -> active, Canadian site -> passive

Redirect traffic back to CanadaOnce the site in Canada is fully ready and warmed up, the entire user-base is redirected back.
German site -> passive, Canadian site -> active

Each HAProxy has the same configuration with a primary and 2 backups as follow:

Timeline of events – 2nd of Jan 2019

2/1/2019 – 8:00 GMT: Starting point of the GerritHub configuration

Review-1 – Gerrit 2.15 – active node

Review-2 – Gerrit 2.15 – failover node

Review-DE – Gerrit 2.15 – analytics node, used for disaster recovery

2/1/2019 – 10:10 GMT: Upgrade disaster recovery server

Stopped all services using Gerrit on review-de (we use the disaster recovery to crunch and serve the analytics dashboard)

Disabled replication plugin

Stopped Gerrit 2.15 and upgraded to Gerrit 2.16

Restarted Gerrit

2/1/2019 – 10:44 GMT: Re-enabled disaster recovery server

Re-Enabled replication from review 1…boom!

First issue: mirror option of the replication plugin was set to true, hence all the branches containing the groups on the All-Users repo been dropped from the recovery server. All the Groups were suddenly gone from the disaster recovery server

Remove mirror option in replication plugin

Re-Enabled replication from review-1…this time everything was ok!

Migration re-executed and everything was fine

2/1/2019 – 11:00 GMT: Removed ReviewDB

Once we were happy with the replication of the Groups we could remove PostgreSQL

The only information left outside NoteDB is the schema_version table, which contains only one row and it is static. We moved it into H2 by copying the DB from a vanilla 2.16 installation and changing Gerrit Config to use it.

Before the next step, we had to wait for the online reindexing on review-de to finish (~2 hours).

Note: we didn’t consider offline reindexing since it is basically sequential, and it would have been way slower compared to the concurrent online one. Additionally, it does not compute all the Prolog rules in full.

2/1/2019 – 15:15 GMT: Reduce delta between masters

Reducing the delta of data between the 2 sites (Canada and Germany) will allow having a shorter read-only window when upgrading the currently active master

Manually replicate and reindex misaligned repositories on review-de (see below the effect on the system load)

Pro tip: if you want to check queue status to see, for example, if the replication is still ongoing this command can be used:

2/1/2019 – 15:50 GMT: Final master catchup

Service degraded for few minutes (i.e.: Gerrit was read-only), but most of the operations were available, i.e.: Gerrit index/query/plugin/version, git-upload-pack, replication

Waited for review-de to catch up with the latest changes that come in review-1 (we monitored it using the above “gerrit show-queue” command)

2/1/2019 – 15:54 GMT: Made disaster recovery active

Changed HAProxy configuration, and reloaded, to re-direct all the traffic to review-de, which become the active node in the cluster

See the transition of the traffic to review-de

Left review-de the whole night as the primary site. This way we also tested the disaster recovery site stability

2/1/2019 – 19:47 GMT: Upgrade review-1 and review-2 to Gerrit 2.16

Stopped Gerrit 2.15 and upgraded to Gerrit 2.16

Wait for offline reindexing of Projects, Accounts and Groups

Started with Gerrit 2.16 with online reindexing of the changes

It was possible to see an expected increase in the system load due to the reindexing, lasted for about 2 hours:

Furthermore, despite review-1 not being the active node, the HTTP workload grew disproportionately:

This was due to a well-known issue of the high-availability plugin, where the reindexing are forwarded to the passive nodes, creating an excessive workload on them.

3/1/2019 – 10:14 GMT: Made review 1 active

We used the same pattern used when upgrading review-de to align the data between masters

Changed HAProxy configuration, and reloaded, to re-direct back all the traffic to review-1

Conclusions

Migration was completed and production is back to stable again with the latest and greatest Gerrit v2.16.2 and the full PolyGerrit UI. With the migration of the Groups in NoteDB, ReviewDB leaves the stage completely to NoteDB. PostgreSQL is no more needed, simplifying the overall architecture.

The migration itself was quite smooth, the only issue was due to a plugin misconfiguration, nothing to have with Gerrit core. With the good monitoring we have in place, we managed to spot the issues straight away. Still, we will further automate our release process to avoid these issues from happening again.