Quantifying the Impact of Agile Software Development Practices

Rally Software and Carnegie Mellon's Software Engineering Institute (SEI) are researching the impact of agile software development practices using data from Rally’s Agile Application Lifecycle Management (ALM) platform. Reports from this study are published here.

InfoQ interviewed Larry Maccherone, Director of Analytics at Rally, and Jim McCurley, Senior Researcher at the SEI, about the collaboration between the SEI and Rally, the measurements included in the study, conclusions from the analysis and their plans for further research.

InfoQ: Why did you do this study on quantifying the impacts of agile?

Larry: First let me answer as a researcher. Much of the information available today about the effectiveness of agile practices is anecdotal or of limited scope. Because Rally is delivered as a multi-tenant SaaS product, we are uniquely enabled to do a much more comprehensive study like this. Access to non-attributable data of 13,000 teams using agile practices in a wide variety of context means that we can achieve a level of statistical significance, general applicability, and trade-off quantification that the existing literature lacks.

Now, let me answer from the perspectives of those who just want to deliver better software faster and for lower cost. Most organizations find it very hard to justify an organization wide transformation without more concrete evidence -- without the numbers to plug into the company's business model that show how much the (costly and potentially painful) transformation will impact the bottom line. This study and our ongoing research address these issues.

Jim: Studies that compare the efficacy of agile methods and practices do not exist. Although many case studies and survey data suggest improvements by the use of agile, we wanted to go beyond that to establish what practices and combinations of practices work best in what contexts.

InfoQ: What made Rally and the SEI decide that to work together to conduct this study?

Larry: I had been working as a researcher at Carnegie Mellon (where the SEI is located) when Watts Humphrey (an SEI fellow at the time) first proposed a way to reintroduce the use of quantitative insight back into an agile development world that had largely rejected metrics as evil. I gave a talk on this idea at the Agile conference in 2009 and got three job offers. I picked Rally to go work for because they are the industry leader and the data is in a multi-tenant SaaS stack, which was very enticing to the researcher in me. The basic structure of the metrics framework that we eventually came up with, the Software Development Performance Index (SDPI), overlaps significantly with various metrics frameworks that have a long history of research at the SEI. So, it was only natural that we work together. Rally has access to the data. The SEI has access to the world's best thought leaders in software engineering measurement.

Jim: When Larry contacted us about collaborating on measurement research, we immediately recognized the opportunity to empirically verify and reinforce some of the claims being made regarding the effectiveness of agile software development. Rally’s co-operative approach to this research is much appreciated. Organizations often find obstacles to sharing data, so we welcomed this opportunity from Rally.

InfoQ: Which data did you use to conduct this study, and how did you analyze it?

Larry: Rally software is used to manage agile development projects. We can look at the history in Rally to extract performancemetrics, like how long it takes for the average work item to make it from "in-development" to "accepted". We can extract other performance metrics that we consider good indicators for Productivity, Predictability, Quality, and Responsiveness. Further, we can extract behavior metrics like how many work items are "in-development" simultaneously or how much work is done by people dedicated to one team as opposed to people who work on several teams simultaneously.

Jim: While Rally collects raw data of software production, we also wanted various attribute data to evaluate practices. To do a thorough analysis, we need information about the organizational development context matched to the production data. Although we received permission to collect team information as to the practices employed, type of application, etc., the required responses were not forthcoming in time for this study. Such information is currently being collected and will hopefully form the basis for future research.

We did run into data issues; the most impactful had to do with the amount of variation exhibited by the customer base. For virtually all the production data, such variation precluded the establishment of statistically rigorous models for the relationships between variables. That is, with very large numbers of n, everything proved statistically significant, but provided very little predictability due to highly skewed distributions and extremely wide variance. Although we explored various data transformations to address these issues, we could not make much progress without the corresponding attribute data needed to segment the population.

The flexibility designed into the tool permits integration into a variety of workflows – rather than forcing the user to conform their workflow to the tool design. This flexibility turns out to be a significant source of variation for a research effort like ours because it limits the number of assumptions we can make about the workflow generating the data. As a result, the level of variation present in the data was an obstacle – one that cannot be overcome without information on patterns of practice (workflow design) that allow us to make valid inferences about patterns of team behavior that correspond to patterns of performance in the data.

InfoQ: In the report you use a metric framework called Software Development Performance Index to provide insight into productivity, predictability, quality and responsiveness. What makes these aspects so important that they need to be measured?

Larry: Those are only four of the total seven dimensions that we believe make up a complete and balanced metrics regime. We are currently adding customer/stakeholder satisfaction and employee engagement. Later we plan to add something I refer to as the build-the-right-thing metric. You can preview this year's publications in Impact of Agile Quantified - A De-Mystery Thriller.

InfoQ: One of the conclusions from the study is that the productivity of teams increases when people are dedicated to 1 team. Can you elaborate on that?

Jim: When we looked at productivity, quality, or predictability we did see improvement in the medians and averages with teams that are relatively more stable. But the differences are obscured by the wide variation – in data that come from a vast array of workflow patterns.

Larry: The productivity of teams where almost 100 percent of the work is done by people dedicated to that one team is roughly double that of teams where only 50 percent of the work is done by people dedicated to that one team.

Pre-agile thinking is that if you have people with specialized skills, you want to maximize the use of those skills so you split those people across several projects. However, for reasons of task-switching and accountability, you diminish the person's total effectiveness exponentially with each additional project that person is working on. Two projects = 80 percent total. Three = 60 percent total... something like that. Further, agile thinking is that you diminish the team's ability to respond to changing requirements (the definition of agile) if they don't "own" all of the skills that they need to get the job done. Also, by dedicating a person to one team, the team gels to high performance much more rapidly. Even if your "specialists" only get to do their specialty 50 percent of the time, it's still better to dedicate them to one team.

InfoQ: Related to this is the conclusion that having unstable teams is hurting performance. What can organizations do to prevent that they have to change the team composition?

Larry: We found that on average, stable teams have up to 60 percent better productivity, 40 percent better predictability, and 60 percent better responsiveness.

Pre-agile thinking would have you break up the team when the project was over and reform a new team when a new project started. It takes a while for a team to gel and start operating at peak performance. Also, because starting and stopping projects never happens simultaneously, you are almost guaranteed to have some non-dedicated folks at the beginning and end of projects, exactly when your performance is most critical.

So, the #1 thing you can do is to flow new projects through already high-performing, gelled teams rather than reform a team for each project which triggers another forming, storming, norming, performing cycle. However, not all stability issues are a company decision. When employees leave, you have to replace them. For this reason, it's critical to follow the traditional HR insight (that it's better to keep your current employees happy than to acquire new ones.)

Jim: We see the fractionalizing of teams in all sorts of software development environments, not just agile. Organizations constantly endeavor to identify, recruit, and retain expertise and there have been notable software development successes in the last decade. Agile, as a philosophy, can be seen as a way to address some of the inefficiencies of software development and is designed to empower teams to effectively utilize talent. So anything which distracts from that team focus is something I see as being at odds with agile tenets.

InfoQ: Some teams use story points to estimate, some use task hours, and there are teams who use both. What are the benefits that teams can get when they use both?

Larry: It appears that teams using both have 40 percent better released defect density on average. The theory is that the act of breaking the work down into tasks and estimating them causes you to think harder about the design and needed functionality. You also talk more about the work and create more shared understanding within the team. This additional thinking and shared understanding prevents defects from leaking into the product.

It is worthwhile to mention that our study also concluded that teams who skip the task breakdown and hour-based estimates have better overall performance, when you also consider productivity, responsiveness, and predictability. However, it may be that more mature teams use this approach and the effect is a function of that maturity.

InfoQ: Your study also investigated the effects of reducing work in progress (WiP). A lower WiP can give a significant increase in quality, i.e. less defects in the product. Can you elaborate on that?

Jim: If the amount of WIP can be normalized, we may find that changes in WIP can be indicators of quality. But I’m also interested in the ratio of the type of work – story or defect. As that ratio changes, it may be a more useful team measurement. Part of this research effort has been to investigate suitable ways of “normalizing” the data in meaningful and useful ways. This effort continues. For example, how would you compare the “size” of software products...the debate goes on.

Larry: We were actually very surprised by the magnitude of the effect controlling WiP has on quality. We expected a big effect on responsiveness and maybe even productivity and some effect on quality, but we see a bigger effect than expected on released defect density just by controlling WiP. Part of the explanation is that the lack of distraction and task switching that accompanies focusing on a few things leads to better quality. However, there is an additional theory. Teams that aggressively control their WiP have slack time in their process due to the fact that they have nothing to pivot to when work gets blocked or a queue goes empty. During this slack time, people can think more about their work and do that little extra needed to assure high quality.

InfoQ: Which practices would you like to recommend to agile teams based upon the results from your study? Are there also agile practices that you would like to discourage? Why?

Larry: The first study looked at 5 behaviors, but we currently have underway the follow-on study with over 55 variables of behaviors, attitudes, practices, and context. What we are finding now that we are starting to look at more context variables is that most "recommendations" are context-sensitive. I like to say... THERE ARE NO BEST PRACTICES, ONLY GOOD PRACTICES IN CONTEXT.

That said, there are a few things that seem to be good in a wide variety of cases. Keep your WiP low. Keep your teams stable. Dedicate people to one team and make sure the team is a "whole team" with all the skills it needs to deliver. There are a few things that only seem to be effective in limited circumstances or which provide benefits in one dimension, say quality, with a corresponding trade-off in another, say productivity. We’re not quite ready to publish yet, but several of the XP engineering practices fall into this bucket. It's also possible that merely exhibiting a practice is insufficient to receive the targeted benefit. It might require that you do the practice well to receive its benefits or to mitigate its associated trade-offs.

As for practices that I'd like to discourage, they mostly have to do with culture and human factors. If you roll out a measurement regime the traditional way, you risk harming the progress of your agile transformation and going to the "dark side" of metrics. I have put together some content that I call the "Seven Deadly Sins of Agile Measurement". Sin #1 is trying to use metrics as a lever to directly drive behavior. The alternative heavenly virtue is that you need to use it as a feedback mechanism for self-improvement.

Jim: I agree with Larry that WIP and stability can be important, but we have yet to do an analysis that matches up the various combinations of practices to performance. Generally, we’ve seen the medians and averages trending in the expected direction for relationships such as throughput per full-time equivalent and productivity or defect density but again, the variances were overwhelming. Given that agile is a system of principles and values which allow teams and organizations to choose how to build software, knowledge that comes from metrics that assume practices are comparable may have difficulty being applied.

InfoQ: Will there be more studies in the future on quantifying development practices?

Jim: We certainly want to continue the research into the effectiveness of agile practices by looking at production and quality data matched with the development context. We are currently creating a proposal to do further investigation along these lines. But a background question, common to all software development data, is how to make such data comparable across applications, systems, and organizations? This focus on agile can contribute to our larger understanding of how to better judge software products.

Larry: As I hinted above, we are currently looking at 55 other variables. A few of these will soon be ready for publication including: the ideal iteration length for your context, which part of the world has the highest performing software development teams, the optimal ratio of testers to developers for your context, and several more. I will be sharing findings at the RallyON conference in June, as well as the Lean Kanban conference in May, Agile Australia in June, and the Agile 2014 conference in July.

About the Interviewees

Larry Maccherone is Rally Software's Director of Analytics and Research. Before coming to Rally Software, Larry worked at Carnegie Mellon with the Software Engineering Institute for seven years conducting research on software engineering metrics with a particular focus on reintroducing quantitative insight back into the agile world. He now leads a team at Rally using big data techniques to draw interesting insights and Agile performance metrics, and provide products that allow Rally customers to make better decisions.

Jim McCurley is a Senior Member of the Technical Staff with the Software Engineering Measurement & Analysis (SEMA) program at the Software Engineering Institute (SEI). During his 17 years at the SEI, his areas of expertise have included data analysis, statistical modeling, and empirical research methods. For the last several years, he has worked with various DoD agencies involved with the acquisition of large scale systems.

Thank you for sharing this. I like seeing some attempt at more rigor in our field, and I'm glad to let someone else take the lead on that. :)

The lack of surprising conclusions might cause some people to doubt the validity of the data, the conclusions, or both, while it might cause others to double up on drinking the Purple Juice. I hope that those people will take a moment to think a little less superficially about it.

I worry about the implication that teams that estimate more see better results when delivering. Potentially lost in this article is the point that estimating often leads to breaking down tasks which then leads to thinking more about the design, and the rest. I would love to see what happens when we do this work without using estimates as a motivation; I expect the results to remain more-or-less the same. I believe that breaking down the work on its own, and not estimating, leads to the corresponding benefit stated in the interviewee's answer.

Finally, I find my interest piqued by the result that some teams that don't break down work still deliver better. Do those teams have smaller deliverable units of work (stories, features, tasks, whatever) in general? I would guess that they do, and so that have already "pre-broken down" the work.