Factory Efficiency in Citrix XenServer

Every British school kid studies the Industrial Revolution. Those who stay awake long enough learn about the principle of Division of Labour. My daughter recently studied this, and complained to me about sitting through a tedious lesson on the famous example of pin manufacturing as recounted by 18th Century economist Adam Smith [1] (native of that fine place Kirkcaldy in the Kingdom of Fife J). Division of labour is of course just one of many practices and techniques refined over the centuries that make up the body of knowledge behind manufacturing science and engineering. Much of this is geared towards optimising production lines, keeping throughput high and the cost of production low.

Strict division of labour in software production is in my experience not necessarily a good idea –product managers, architects, developers and testers all have different disciplines and distinct roles to play but building silos between them and allowing (or worse, demanding) strict hand-offs between them is demonstrably inefficient. Much of the agile manifesto, for example, is aimed at breaking down silos such as these, which cause untold damage to the efficiency and quality of software production.

Our strategy in Citrix XenServer Engineering has been to characterise our software production pipeline, to identify the aspects of it that matter most to us, to measure our performance in those aspects and to use the insights this affords us to drive continuous improvements.

We refer to this as our “factory efficiency”.

Software Pipeline

The XenServer software pipeline is fairly standard in most ways – we have requirements (both for new features and for system characteristics like performance) ; we translate these into designs ; we develop and test new code etc.

One interesting aspect of this pipeline is our Continuous Integration (CI) methodology. Three years ago, during the development of XenServer 6.0, this comprised a 2-hour automated Branch Verification Test (BVT) on every build and a 16-hour automated “Nightly” system test run several times a week. This scheme had served us well over a number of years, but as XenServer grew in complexity and the demands on the team escalated, this CI model began to break down. There was insufficient short-cycle unit testing by developers, with everyone instead relying on the long-cycle automated CI system tests. However, the increasing complexity of XenServer meant that triaging the CI runs was becoming ever more difficult – the time it took to analyse test failures and raise bugs on development teams was growing, and the quality of those bugs was decreasing . This situation led to a near catastrophic breakdown in factory efficiency:

the mean time between developers unwittingly injecting a bug and getting a bug report back from CI grew to over a week, meaning developers had in the meantime ample opportunity to further modify the code with the bug in it, and/or had lost the environment or context in which they’d injected it in the first place.

the time spent by developers chasing spurious bugs was disproportionate and kept them from fixing the bugs that mattered

The first remedial action we took was to introduce more short-cycle testing closer to the point where new code was submitted. We also required the developers (not QA) to own this. So between the 2-hour BVT and the 16-hour Nightly, we introduced 4-hour automated Branch Safety Test (BST) runs, tailored to the needs of each individual team and run on their own private branches before merging to the mainline.

But in a sense this was just a band-aid. What we really needed to do was attack the root cause problem.

Efficiency Metrics

To do this we identified two key efficiency metrics:

Nightly Signal-Noise Ratio (SNR) – the proportion of defects arising from the Nightly that resulted in production code fixes

Nightly Triage Turnaround Time (TTR) – the elapsed time between a test case failure in the Nightly and a bug report being sent to a developer.

This chart, produced by Dan from the XenServer QA team, shows the trends of both of these metrics over the last year (click on the image to see a large version):

The SNR is represented by the green and red dots and the wavy blue line. Each red dot is a bug report rejected by development as a duplicate or “not a bug”, each green dot is one accepted as a genuine problem. As you can see, the SNR has been steadily increasing (more green dots than red, and wavy blue line trending upwards). This improvement has been due to various quality initiatives that we introduced to improve triage practices and to train our staff.

The TTR is represented by the distance of each dot from the x-axis. As you can see, the TTR has been reducing steadily. Most bug reports are now filed within 1 day of a Nightly test run.

Taken together, this means that XenServer factory efficiency has improved – developers get quick feedback on newly introduced bugs and regressions and can therefore more easily fix them, and developers waste less time on bad bug reports. The overall Nightly pass rates are now maintained at a consistently high level. If these drop, then the production line stops and everyone knows it is their first duty to restore the pass rate and get the belt moving again.

This is very different from the situation during the development of XenServer 6.0, and no co-incidence that we recently delivered the highest quality XenServer release ever. By making the factory more efficient we both saved many engineer-hours of effort that were instead put to more valuable use in improving product quality AND ensured we were able to get more value from system test by allowing it to focus on gnarly and complex system-level problems rather than being buried under a mountain of simpler issues that should have been fixed further ‘upstream’.

What next?

You can’t control what you can’t measure.

Our strategy in Citrix XenServer Engineering has been to characterise our software production pipeline, to identify the aspects of it that matter most to us, to measure our performance in those aspects and to use the insights this affords us to drive continuous improvements.

I hope this blog has demonstrated the efficacy of this strategy.

In the interests of brevity, this blog has focused on one particular set of metrics. We track others, including notably various metrics relating to our ability to respond to support escalations and to provide patches and hotfixes. These have also led to organisational and process changes which have improved our performance in these areas.

XenServer serves as a platform for Citrix XenDesktop, XenApp and CloudPlatform besides also being a core component of NetScaler SDX. It would be particularly fruitful to extend the definition of the “factory” to cover more than one Citrix product group. The good news is that there are already various activities underway to do precisely this.

Share

Tagged under:

Head of QA for Citrix's Converged Infrastructure group, which covers XenServer, CloudPlatform and Citrix Lifecycle Manage. I believe that a prime responsibility of everyone involved in software production is to make the process as efficient as possible and to ensure high quality output. Based in Cambridge, UK, I ponder improvement opportunities (and ways to blog about them) while walking Juno the fluffy rottweiler, and while driving to and fro to watch Wolverhampton Wanderers...