Putting technical debt into financial terms also allows consultants and vendors to try to scare business executives into buying their tools or their help – like Gartner calculating that world wide “IT debt” costs will exceed $1.5 in a couple of more years, or CAST software’s assessment that the average enterprise is carrying millions of dollars of technical debt.

Businesses understand debt. Businesses make a decision to take on debt on and they track it, account for it and manage it. The business always knows how much debt they have, why they took it on, and when they need to pay it off. Businesses don’t accidentally take on debt – debt doesn't just show up on the books one day.

We don't know when we're taking technical debt on

But developers accidentally take on debt all of the time – what Martin Fowler calls “inadvertent debt”, due to inexperience and misunderstandings, everything from “What’s Layering?” to “Now we know how we should have done it” looking at the design a year or two later.

"The point is that while you're programming, you are learning. It's often the case that it can take a year of programming on a project before you understand what the best design approach should have been."

Taking on this kind of debt is inevitable – and you’ll never know when you’re taking it on or how much, because you don’t know what you don’t know.

Even when developers take on debt consciously, they don’t understand the costs at the time – the principal or the interest. Most teams don’t record when they make a trade-off in design or a shortcut in coding or test automation, never mind try to put a value on paying off their choice.

We don’t understand (or often even see) technical debt costs until long after we've taken the costs on. When you’re dealing with quality and stability problems; or when you're estimating out a change and you recognize that you made mistakes in the past or that you took shortcuts that you didn't realize before or shortcuts that you did know about but that turned out to be much more expensive than you expected; or once you understand that you chose the wrong architecture or the wrong technical platform. Or maybe you've just run a static analysis tool like CAST or SONAR which tells you that you have thousands of dollars of technical debt in your code base that you didn't know about until now.

Now try and explain to a business executive that you just realized or just remembered that you have put the company into debt for tens or hundreds of thousands of dollars. Businesses don’t and can’t run this way.

We don't know how much technical debt is really costing us

By expressing everything in financial terms, we’re also pretending that technical debt costs are all hard costs to the business and that we actually know how much the principal and interest costs are: we’re $100,000 in debt and the interest rate is 3% per year. Assigning a monetary value to technical debt costs give them a false sense of precision and accuracy.

Let's be honest. There aren't clear and consistent criteria for costing technical debt and modelling technical debt repayment – we don’t even have a definition of what technical debt is that we can all agree on. Two people can come up with a different technical debt assessment for the same system, because what I think technical debt is and what you think technical debt is aren't the same. And just because a tool says that technical debt costs are $100,000.00 for a code base, doesn't make the number true.

Any principal and interest that you calculate (or some tool calculates for you) are made-up numbers and the business will know this when you try to defend them – which you are going to have to do, if you want to talk in financial terms with someone who does finance for a living. You’re going to be on shaky ground at best – at worse, they’ll understand that you’re not talking about real business debt and wonder what you’re trying to pull off.

The other problem that I see is “debt fatigue”. Everyone is overwhelmed by the global government debt crisis and the real estate debt crisis and the consumer debt crisis and the fiscal cliff and whatever comes next. Your business may be already fighting its own problems with managing its financial debt. Technical debt is one more argument about debt that nobody is looking forward to hearing.

We don’t need to talk about debt with the business

We don’t use the term “technical debt” with the business, or try to explain it in financial debt terms. If we need to rewrite code because it is unstable, we treat this like any other problem that needs to be solved – we cost it out, explain the risks, and prioritize this work with everything else. If we need to rewrite or restructure code in order to make upcoming changes easier, cheaper and less risky, we explain this as part of the work that needs to be done, and justify the costs. If we need to replace or upgrade a platform technology because we are getting poor support from the supplier, we consider this a business risk that needs to be understood and managed. And if code should be refactored or tests filled in, we don’t explain it, we just do it as part of day-to-day engineering work.

We’re dealing with technical debt in terms that the business understands without using a phony financial model. We’re not pretending that we’re carrying off-balance sheet debt that the company needs to rely on technologists to understand and manage. We’re leaving debt valuation and payment amortization arguments to the experts in finance and accounting where they belong, and focusing on solving problems in software, which is where we belong.

In this model, defects that have been found but haven't been fixed yet aren't technical debt - they are just part of the work that everyone can see and that has to be prioritized with the rest of the backlog.

But what about defects that you've found, but that you've decided that you’re not going to fix – bugs that you think you can get away without fixing (or that the people before you thought they could get away without fixing)?

These bugs are technical debt – because you’re pretending that the bugs are invisible. You’re betting that you don’t have to take on the cost of fixing these problems in the short term at least, just like other kinds of technical debt: copy and paste code and conscious short-cuts in design and compounding complexity and automated tests that should have been written but weren't and refactoring that should have been done that wasn't and code that you wrote that you wished you hadn't because you didn't know the language well enough back then and anything else that you may have done or not done in the past that has a chance of slowing down your work and making it harder in the future.

The same rule applies to results from static analysis bug finding tools like Findbugs
and Klocwork and Fortify which point out coding problems that could be real bugs and security vulnerabilities. After you filter out the false positives and motherhood, and the code that works but really should be cleaned up, you’re left with code that is wrong or broken – code that should be fixed.

Keep in mind that these are problems that haven’t been found yet in testing, or found in production by customers – or at least nobody knows that they've run across these bugs. These are problems that the team knows about, but that aren't visible to anybody else. Until they are fixed or at least added to the backlog of work that will be done soon, they are another part of the debt load that the team has taken on and will have to worry about paying back some day. This is why tools like SONAR include static analysis coding violations when calculating technical debt costs.

I agree that bugs aren't technical debt – unless you’re trying to pretend that the bugs aren't there and that you won’t have to fix them. Then it’s like any other technical debt trade off – you’ll need to see if your bet pays off over time.

Tuesday, December 4, 2012

A question that constantly comes up from people that care about writing good code, is: what’s the right size for a method or function, or a class, or a package or any other chunk of code? At some point any piece of code can be too big to understand properly – but how big is too big?

In Code Complete, Steve McConnell says that the theoretical best maximum limit for a method or function is the number of lines that can fit on one screen (i.e., that a developer can see at one time). He then goes on to reference studies from the 1980s and 1990s which found that the sweet spot for functions is somewhere between 65 lines and 200 lines: routines this size are cheaper to develop and have fewer errors per line of code. However, at some point beyond 200 lines you cross into a danger zone where code quality and understandability will fall apart: code that can’t be tested and can’t be changed safely. Eventually you end up with what Michael Feathers calls “runaway methods”: routines that are several hundreds or thousands of lines long and that are constantly being changed and that continuously get bigger and scarier.

Patrick Duboy looks deeper into this analysis on method length, and points to a more modern study from 2002 that shows that code with shorter routines has fewer defects overall, which matches with most people’s intuition and experience.

Smaller must be better

Bob Martin takes the idea that “if small is good, then smaller must be better” to an extreme in Clean Code:

The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that. Functions should not be 100 lines long. Functions should hardly ever be 20 lines long.

Martin admits that “This is not an assertion that I can justify. I can’t produce any references to research that shows that very small functions are better.” So like many other rules or best practices in the software development community, this is a qualitative judgement made by someone based on their personal experience writing code – more of an aesthetic argument – or even an ethical one – than an empirical one. Style over substance.

Rule of 30

If an element consists of more than 30 subelements, it is highly probable that there is a serious problem:

a) Methods should not have more than an average of 30 code lines (not counting line spaces and comments).

b) A class should contain an average of less than 30 methods, resulting in up to 900 lines of code.

c) A package shouldn’t contain more than 30 classes, thus comprising up to 27,000 code lines.

d) Subsystems with more than 30 packages should be avoided. Such a subsystem would count up to 900 classes with up to 810,000 lines of code.

e) A system with 30 subsystems would thus possess 27,000 classes and 24.3 million code lines.

What does this look like? Take a biggish system of 1 million NCLOC. This should break down into:

30,000+ methods

1,000+ classes

30+ packages

Hopefully more than 1 subsystem

How many systems in the real world look like this, or close to this – especially big systems that have been around for a few years?

Are these rules useful? How should you use them?

Using code size as the basis for rules like this is simple: easy to see and understand. Too simple, many people would argue: a better indicator of when code is too big is cyclomatic complexity or some other measure of code quality. But some recent studies show that code size actually is a strong predictor of complexity and quality – that

“complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simplify with lines of code”.

In "Beyond Lines of Code: Do we Need more Complexity Metrics" in Making Software, the authors go so far as to say that lines of code should be considered always as the "first and only metric" for defect prediction, development and maintenance models.

Recognizing that simple sizing rules are arbitrary, should you use them, and if so how?

I like the idea of rough and easy-to-understand rules of thumb that you can keep in the back of your mind when writing code or looking at code and deciding whether it should be refactored. The real value of a guideline like the Rule of 30 is when you're reviewing code and identifying risks and costs.

But enforcing these rules in a heavy handed way on every piece of code as it is being written is foolish. You don’t want to stop when you’re about to write the 31st line in a method – it would slow down work to a crawl. And forcing everyone to break code up to fit arbitrary size limits will make the code worse, not better – the structure will be dominated by short-term decisions.

“Our goal is to keep our overall system small while we are also keeping our functions and classes small. Remember however that this rule is the lowest priority of the four rules of Simple Design. So, although it’s important to keep class and function count low, it’s more important to have tests, eliminate duplication, and express yourself.”

Sometimes it will take more than 30 lines (or 20 or 5 or whatever the cut-off is) to get a coherent piece of work done. It’s more important to be careful in coming up with the right abstractions and algorithms and to write clean clear code – if a cut-off guideline on size helps to do that, use it. If it doesn't, then don’t bother.

Scrum is easy for developers to understand too and easy for them to follow. Unlike XP or some of the other more well-defined Agile methods, Scrum is not prescriptive and doesn't demand a lot of technical discipline. It lets the team decide what they should do and how they should do it. They can get up to speed and start “doing Agile” quickly and cheaply.

And now the PMI has got involved with its PMI-ACP Certified Agile Practitioner designation which basically ensures that people understand Scrum, with a bit of XP, Lean and Kanban thrown in to be comprehensive.

Certification is a win win win…

Managers like standardization and certification – especially busy, risk-adverse managers in large mainstream organizations. If they are going to “do Agile”, they want to make sure that they do it right. By paying for standardized certified training and coaching on a standardized method, they can be reassured that they should get the same results as everyone else. Because of standardization and certification, getting started with Scrum is low risk: it’s easy to find registered certified trainers and coaches offering good quality professional training programs and implementation assistance. Scrum has become a product – everyone knows what it looks like and what to expect.

Certification also makes it easier for managers to hire new people (demand a certification upfront and you know that new hires will understand the fundamentals of Scrum and be able to fit in right away) and it’s easier to move people between teams and projects that are all following the same standardized approach.

Developers like this too, because certification (even the modest CSM) helps to make them more employable, and it doesn't take a lot of time, money or work to get certified.

But most importantly, certification has created a small army of consultants and trainers who are constantly busy training and coaching a bigger army of certified Scrum practitioners. There is serious competition between these providers, pushing each other to do something to get noticed in the crowd, saturating the Internet with books and articles and videos and webinars and blogs on Scrum and Scrumness, effectively drowning out everything else about Agile development.

And the standardization of Scrum has also helped create an industry of companies selling software tools to help manage Scrum projects, another thing that managers in large organizations like, because these tools help them to get some control over what teams are doing and give them even more confidence that Scrum is real. The tool vendors are happy to sponsor studies and presentations and conferences about Agile (er, Scrum), adding to the noise and momentum behind Scrum.

It looks like David Anderson may be trying to do a similar thing with Kanban certification. It’s hard to see Kanban taking over the world of software development – while it’s great for managing support and maintenance teams, and helps to control work flow at a micro-level, Kanban doesn't fit for larger project work. But then neither does Scrum. And who would have picked Scrum as the winner 10 years ago?

Tuesday, November 20, 2012

Speed – being first to market, rapid innovation and conducting fast cheap experiments – is critically important to startups and many high tech firms. This is where Lean Startup ideas and Continuous Deployment come in. And this is why many companies are following Agile development, to design and deliver software quickly and flexibly, incorporating feedback and responding to change.

But what happens when software – and the companies that build software – grow(s) up? What matters at the enterprise level? Speed isn't enough for the enterprise. You have to balance speed and cost and quality. And stability and reliability. And security. And maintainability and sustainability over the long term. And you have to integrate with all of the other systems and programs in your company and those of your customers and partners.

Last week, a group of smart people who manage software development at scale got together to look at all of these problems, at Construx Software’s Executive Summit in Seattle.
What became clear is that for most companies, the most important factor isn’t speed, or productivity or efficiency – although everyone is of course doing everything they can to cut costs and eliminate waste. And it isn’t flexibility, trying to keep up with too much change. What people are focused on most, what their customers and sponsors are demanding, is predictability – delivering working software when the business needs it, being a reliable and trusted supplier to the rest of the business, to customers and partners.

Enterprise Agile Development and Predictability

Steve McConnell’s keynote on estimation in Agile projects kicked this off. A lot of large companies are adopting Agile methods because they’ve heard and seen that these approaches work. But they’re not using Agile out of the box because they’re not using it for the same reasons as smaller companies.

Large companies are adapting Agile and hybrid plan-based/Waterfall approaches combining upfront scope definition, estimating and planning, with delivering the project incrementally in Agile time boxes. This is not about discovery and flexibility, defining and designing something as you go along – the problems are too big, they involve too many people, too many parts of the business, there are too many dependencies and constraints that need to be satisfied. Emergent design and iterative product definition don’t apply here.

Enterprise-level Agile development isn’t about speed either, or “early delivery of value”. It’s about reducing risk and uncertainty. Increasing control and visibility. Using story points and velocity and release Burn Up reporting and evidence of working software to get early confidence about progress on the project and when the work will be done.

The key is to do enough work upfront so that you can make long-term commitments to the business – to understand what the business needs, at least at a high level, and estimate and plan this out first. Then you can follow Agile approaches to deliver working software in small steps, and to deal with changes as they come in. As McConnell says, it’s not about “responding to change over following a plan”. It’s having a plan that includes the ability to respond to change.

By continuously delivering small pieces of working software, and calibrating their estimates with real project data using velocity, a team working this way will be able to narrow the “cone of uncertainty” much faster – they’ll learn quickly about their ability to deliver and about the quality of their estimates, as much as 2x faster than teams following a fully sequential Waterfall approach.

There are still opportunities to respond to change and reprioritize. But this is more about working incrementally than iteratively.

Kanban and Predictability

Enterprise development managers are also looking at Kanban and Lean Development to manage waste and to maintain focus. But here too the value is in improving predictability, to smooth work out and reduce variability by finding and eliminating delays and bottlenecks. It’s not about optimization and Just-in-Time planning.

You do this by keeping the team focused on the business of making software, and trying to drive out everything else: eliminating delays and idle time, cutting back administrative overhead, not doing work that will end up getting thrown away, minimizing time wasted in multi-tasking and context-switching, and not starting work before you’ve finished the work that you’ve already committed to. Anderson says that managers like to start on new work as it comes in because “starting gives your customer a warm comfortable feeling” – until they find out you’ve lied to them, because “we’re working on it” doesn’t mean that the work is actually getting done, or will ever get done. This includes fixing bugs – you don’t just fix bugs right away because you should, you fix bugs as they’re found because the work involved is smaller and more predictable than trying to come back and fixing them later.

Teams can use Kanban to dynamically prioritize and control the work in front of them, to balance support and maintenance requirements against development work and fixed date commitments with different classes of service, and limit Work in Progress (WIP) to shorten lead times and improve throughput, following the Theory of Constraints. This lets you control variability and improve predictability at a micro-level. But you can also use actual throughput data and Cumulative Flow reporting to project forward on a project level and understand how much work the team can do and when they will be done.

What’s interesting to me is seeing how the same ideas and methods are being used in very different ways by very different organizations – big and small – to achieve success, whether they are focused on fast iterations and responding to rapid change, or managing large programs towards a predictable outcome.

Friday, November 9, 2012

After going live, we started building health checks into the system – run-time checks on operational dependencies and status to ensure that the system is setup and running correctly. Over time we have continued to add more run-time checks and tests as we have run into problems, to help make sure that these problems don’t happen again.

This is more than pings and Nagios alerts. This is testing that we installed the right code and configuration across systems. Checking code build version numbers and database schema versions. Checking signatures and checksums on files. That flags and switches that are supposed to be turned on or off are actually on or off. Checking in advance for expiry dates on licenses and keys and certs.

Sending test messages through the system. Checking alert and notification services, make sure that they are running and that other services that are supposed to be running are running, and that services that aren't supposed to be running aren't running. That ports that are supposed to be open are open and ports that are supposed to be closed are closed. Checks to make sure that files and directories that are supposed to be there are there, that files and directories that aren't supposed to be there aren't, that tables that are supposed to be empty are empty. That permissions are set correctly on control files and directories. Checks on database status and configuration.

Checks to make sure that production and test settings are production and test, not test and production. Checking that diagnostics and debugging code has been disabled.
Checks for starting and ending record counts and sequence numbers. Checking artefacts from “jobs” – result files, control records, log file entries – and ensuring that cleanup and setup tasks completed successfully. Checks for run-time storage space.

We run these health checks at startup, or sometimes early before startup, after a release or upgrade, after a failover – to catch mistakes, operational problems and environmental problems. These are tests that need to run quickly and return unambiguous results (things are ok or they’re not). They can be simple scripts that run in production or internal checks and diagnostics in the application code – although scripts are easier to adapt and extend. Some require hooks to be added to the application, like JMX.

Run-time Asserts

Other companies like Etsy do something similar with run-time asserts, using a unit test approach to check for conditions that must be in place for the system to work properly.

These tests can (and should) be run on development and test systems too, to make sure that the run-time environments are correct. The idea is to get away from checks being done by hand, operational checklists and calendar reminders and manual tests. Anything that has a dependency, anything that needs a manual check or test, anything in an operational checklist should have an automated run-time check instead.

Monkey Armies

The same ideas are behind Netflix’s over-hyped (though not always by Netflix) Simian Army, a set of robots that not only check for run-time conditions, but that also sometimes take automatic action when run-time conditions are violated – or even violate run-time conditions to test that the system will still run correctly.

The army includes Security Monkey, which checks for improperly configured security groups, firewall rules, expiring certs and so on; and Exploit Monkey, which automatically scans new instances for vulnerabilities when they are brought up. Run-time checking is taken to an extreme in Conformity Monkey, which shuts down services that don’t adhere to established policies, and the famous Chaos Monkey, which automatically forces random failures on systems, in test and in production.

It’s surprising how much attention Chaos Monkey gets – maybe it’s the cool name, or because Netflix has Open Sourced it along with some of their other monkeys. Sure it’s ballsy to test failover in production by actually killing off systems during the day, even if they are stateless VM instances which by design should failover without problems (although this is the point, to make sure that they really do failover without problems like they are supposed to).

There's more to Netflix's success than run-time fault injection and the other monkeys. Still, automatically double-checking as much as you can at run-time is especially important in an engineering-driven, rapidly-changing Devops or Noops environment where developers are pushing code into production too fast to properly understand and verify in advance. But whether you are continuously deploying changes to production (like Etsy and Netflix) or not, getting developers and ops and infosec together to write automated health checks and run-time tests is an important part of getting control over what's actually happening in the system and keeping it running reliably.

Monday, November 5, 2012

An interesting interview with Dan Cornell of Denim Group on the work that they are doing to understand the cost of remediating security vulnerabilities, here on the SANS Application Street Fighter blog.

Monday, October 29, 2012

OWASP held its annual USA conference on application security last week in Austin. The conference was well run and well attended: more than 800 people, lots of developers and infosec experts. Here’s what I learned:

The Web Started off Broken

Javascript, the browser Same Origin Policy, and what eventually became the DOM were all created in about 10 days by 1 guy back in 1995 while Netscape was fighting for its life against Microsoft and IE. This explains a lot.

Fundamental decisions were made without enough time and thought – the results are design weaknesses that make CSRF and XSS attacks possible, as well as other problems and threats that we still face today.

The Web is Broken Today

We’re continuing to live with these problems because the brutal zero sum competitive game between browser vendors stops any of them from going back and trying to undo early decisions – nobody can afford to risk “breaking the web” and losing customers to a competitor. Browser vendors have to keep moving forward, adding new features and capabilities and more complexity on a broken foundation.

But it’s worse than just this. Apps are being hacked because of simple, stupid and sloppy mistakes that people keep making today, not just naive mistakes that were made more than 10 years ago. Most web apps (and now mobile apps) are being broken through SQL Injection even though this is easy for everyone to understand and prevent. Secure password storage is another fundamental thing that people keep doing wrong. We have to get together, make sure developers understand these basic problems, and get them fixed. Now.

The Web is going to stay Broken

Even crypto, done properly (which is asking a lot of most people), won’t stay safe for long. It’s becoming cheap enough and easy enough to rent enough Cloud resources to break long-established crypto algorithms. We need to keep aware of the shifting threats against crypto algorithms, and become what Michael Howard calls “crypto agile”.

HD Moore showed that the Cloud can also be used for Internet-wide reconnaissance. Not just scanning for vulnerabilities or inconsistencies in different parts of the web. This is opening up the entire web to researchers and attackers to find new correlations and patterns and attack points. With the resources of the Cloud, “$7 own the Internet hacks” are now possible.

Then there are new languages and development platforms like HTML5, which provides a rich set of capabilities to web and mobile developers, including audio and video, threads in Javascript, client-side SQL, client-side virtual file systems, and WebSockets – making the browser into a “a mini-OS”. This also means that the attack surface of HTML 5 apps has exploded, opening up new attack vectors for XSS and CSRF, and lots of other new types of attacks. From a security view point, HTML 5 is really, truly, deeply scary.

But some things are Getting Fixed

There are some things being done to make the web ecosystem safer like improvements to Javascript, ADsafe and the like.

Content Security Policy

But the most important and useful thing that I learned about is Content Security Policy, an attempt to fix fundamental problems in the Same Origin Policy, by letting companies define a white list of domains that browsers should consider to be valid sources of executable scripts.

CSP was mentioned in a lot of the talks as a new way of dealing with problems as well as finding out what is happening out there in the web, even helping to make HTML 5 safer. Twitter is already using Content Security Policy to detect violations, so, with some caveats, it works. CSP won’t work for every situation (consumer mashups for example), it doesn’t support inline Javascript, it only works in the latest browsers and then only if developers and administrators know about it and use it properly, and I am sure that people will find attacks to get around it. But it is a simple, understandable and practical way for enterprise app developers to close off, or at least significantly reduce the risk of, XSS attacks – something that developers could actually figure out and might actually use.

Devops

Another key theme at the conference was how Devops – getting developers, operations and infosec working together, using common tools to deliver software faster and in smaller batches – can help make applications more secure. Through better communication, collaboration and automation, better configuration management and deployment and run-time checking and monitoring, and faster feedback to and from developers, you can prevent more problems from happening and find and fix problems much faster.

There was a Devops keynote, a Devops panel, and several case studies: Twitter, Etsy, Netflix and Mozilla showed what was possible if you have the discipline, talent, culture, tools, money and management commitment to do this. Of course for the rest of us who don’t work at Web 2.0 firms with more money than God, the results and constraints and approach will be different, but Devops can definitely help break down the cultural and organizational and information walls between Appsec and development. If you’re in AppSec, this is something to get behind of and help with, don’t get in the way.

Other Things Seen and Overheard

You apparently need a beard to work on the manly and furry AppSec team at Twitter.

If you’re asking executives for funding, don’t say “Defence in Depth” – they know this just means wasting more money. Talk about “compensating controls” instead.

“The most reliable, effective way of injecting evil code is buying an ad”.

There was a lot more. Austin was an excellent venue: friendly people, great restaurants, cool bats (I like bats), lots of good places to chill out and enjoy the nice weather. And the conference was great. I'm already looking forward to next year.

Tuesday, October 23, 2012

Refactoring is a disciplined way to clarify, retain or restore the design of a system as you make changes, and to help cleanup and correct the mistakes and mess that we all make as we work, to clear away the evidence of false starts and changes in direction and back tracking and to help fill in gaps and misunderstandings.

As a colleague of mine has pointed out, you can get a lot out of even the most simple and obvious refactoring changes: eliminating duplication, changing variable and method names to be more meaningful, extracting methods, simplifying conditional logic, replacing a magic number with a named constant. These are easy things to do, and will give you a big return in understandability and maintainability.

But refactoring has limitations – there are some problems that refactoring won’t solve.

But it can take a long time, usually only once the system is being used in the real world by real customers to do real things, before you learn how wrong you actually were, how much you missed and misunderstood, exceptions and edge cases and defects piling up before you finally understand (or accept) that no, the design doesn't hold up, you can’t just keep on extending it and patching what you have – you need a different set of abstractions or a different architecture entirely.

Refactoring helps you make course corrections. But what if you find out that you've been driving the entire time in the wrong direction, or in circles?

“Experience to date also indicates that low-cost refactoring cannot be depended upon as projects scale up. The most serious problems that arise with simple design are problems known as “architecture breakers”. These highly expensive problems can occur when early, simple design decisions result in forseeable changes that cause breakage in design beyond the ability of refactoring to handle.”

This is another argument in the “Refactor or Design” holy war over how much design should be / needs to be done upfront and how much can be filled in as you go through incremental change and refactoring.

Deep Decisions

Many design ideas can be refined, elaborated, iterated and improved over time, and refactoring will help you with this. But some early decisions on approach, packaging, architecture, and technology platform are too fundamental and too deep to change or correct with refactoring.

You can use refactoring to replace in-house code with standard library calls, or to swap one library for another – doing the same thing in a different way. Making small design changes and cleaning things up as you go with refactoring can be used to extend or fill in gaps in the design and to implement cross-cutting features like logging and auditing, even access control and internationalization – this is what the XP approach to incremental design is all about.

But making small-scale design changes and improvements to code structure, extracting and moving methods, simplifying conditional logic and getting rid of case statements isn’t going to help you if your architecture won’t scale, or if you chose the wrong approach (like SOA) or the wrong application framework (J2EE with Enterprise Java Beans, any multi-platform UI framework or any of the early O/R mapping frameworks – remember the first release of TopLink?, or something that you rolled yourself before you understood how the language actually worked), or the wrong language (if you found out that Ruby or PHP won’t scale), or a core platform middleware technology that proves to be unreliable or that doesn't hold up under load or that has been abandoned, or if you designed the system for the wrong kind of customer and need to change pretty much everything.

Refactoring to Patterns and Large Refactorings

Joshua Kerievsky’s work on Refactoring to Patterns provides higher-level composite refactorings to improve – or introduce – structure in a system, by properly implementing well-understood design patterns such as factories and composites and observers, replacing conditional logic with strategies and so on.

Refactoring to Patterns helps with cleaning up and correcting problems like

Lippert and Roock’s work on Large Refactorings explains how to take care of common architectural problems in and between classes, packages, subsystems and layers, doing makeovers of ugly inheritance hierarchies and reducing coupling between modules and cleaning up dependency tangles and correcting violations between architectural layers – the kind of things that tools like Structure 101 help you to see and understand.

They have identified a set of architectural smells and refactorings to correct them:

But refactoring to patterns or even large-scale refactoring still isn't enough to unmake or remake deep decisions or change the assumptions underlying the design and architecture of the system. Or to salvage code that isn't safe to refactor, or worth refactoring.

Sometimes you need to rewrite, not refactor

The best answer seems to be that refactoring should always be your first choice, even for legacy code that you didn’t write and don’t understand and can’t test (there is an entire book written on how and where to start refactoring legacy spps).

But if the code isn’t working, or is so unstable and so dangerous that trying to refactor it only introduces more problems, if you can’t refactor or even patch it without creating new bugs, or if you need to refactor too much of the code to get it into acceptable shape (I’ve read somewhere than 20% is a good cut-off, but I can’t find the reference), then it’s time to declare technical bankruptcy and start again. Rewriting the code from scratch is sometimes your only choice. Some code shouldn't be – or can’t be – saved.

"Sometimes code doesn't need small changes—it needs to be tossed out so that you can start over. If you find yourself in a major refactoring session, ask yourself whether instead you should be redesigning and reimplementing that section of code from the ground up."
Steve McConnell, Code Complete

But refactoring isn’t enough if you have to reframe the system – if you need to do something fundamentally different, or in a fundamentally different way – or if the code isn’t worth salvaging. Don’t get stuck believing that refactoring is always the right thing to do, or that you can refactor yourself out of every problem.

Wednesday, October 17, 2012

“organizations which design systems (in the broad sense used here) are constrained to produce designs which are copies of the communication structures of these organizations.” [emphasis mine]

This was an assertion made in the 1960s based on a small study which has now become a truism in software development (it’s fascinating how much of what we do and think today is based on data that is 50 or more years old). There are lots of questions to ask about Conway’s Law. Should we believe it – is there evidence to support it? How important is the influence of the structure of the team that designed and built the system compared to the structure of the team that continued to change and maintain it for several years – are initial decisions more or less important? What happens as the organization structure changes over time – are these changes reflected in the structure of the code? What organization structures result in better code, or is it better to have no organization structure at all?

Conway's Law and Collective Code Ownership

Conway’s Law is sometimes used as an argument for a “Whole Team” approach and “Collective Code Ownership” in Agile development. The position taken is that systems that are designed by teams structured around different specializations are of lower quality (because they are artificially constrained) than systems built by teams of “specialized generalists” or “generalizing specialists” who share responsibilities and the code (in small Scrum teams for example).

Communications Structure and Seams

First it is important to understand that the argument in Conway’s Law is not necessarily about how organizations are structured. It is about how people inside an organization communicate with each other – whether and how they talk to each other and share information, the freedom and frequency and form, is communication low-bandwidth and formal/structured, or high-bandwidth and informal. It’s about the “social structure” of an organization.

There are natural seams that occur in any application architecture, as part of decomposition and assignment of responsibilities (which is what application architecture is all about). Client and server separation (UI and UX work is quite different from what needs to be done on the server, and is often done with completely different technology), API boundaries with outside systems, data management, reporting, transaction management, workflow. Different kinds of problems that require different skills to solve them.

Conway’s Corollary

Much more interesting is what Conway’s Law means to how you should structure your development organization. This is Conway’s Corollary:

“A software system whose structure closely matches its organization’s communication structure works better (defined broadly) than a subsystem who structure differs from its organization’s communication structure.”

“Better” means higher productivity for the people developing and maintaining the system, through more efficient communication and coordination, and higher quality.

In Making Software, Christian Bird at Microsoft Research (no relation) explains how important it is that an organization’s “social structure” mirrors the architecture of the system that they are building or working on. He walks through a study on the relationship between development team organization structure and post-release defects, in this case the organization that built Microsoft Windows Vista. This was a very large project, with thousands of developers, working on tens of millions of LOC. The study found that organization structure was a better indicator of software quality than any attributes of the software itself. The more complex the organization, the more coordination required, the more chances for bugs (obvious, but worth verifying). What is most important is “geographic and structural congruence” – work that is related should be done by people who are working closely together (also obvious, and now we have data to prove it).

Conway's Corollary and Collective Code Ownership

Conway’s Corollary argues against the “Collective Code Ownership” principle in XP where everyone can and should work on any part of the code at any time. The Microsoft study found that where developers from different parts of the organization worked on the same code, there were more bugs. It was better to have a team own a piece of code, or at the very least act as a gatekeeper and review all changes. Work is best done by the people (or person) who understand the code the most.

Making Organizational Decisions

A second study of 5 OSS projects was also interesting, because it showed that even in Open Source projects, people naturally form teams to work together on logically related parts of a code base.

The lessons from Conway's Corollary are that you should delay making decisions on organization until you understand the architectural relationships in a system; and that you need to reorganize the team to fit as the architecture changes over time. Dan Pritchett even suggests that that if you want to change the architectural structure of a system, you should start by changing the organization structure of the team to fit the target design – forcing the team to work togeter to “draw the new architecture out of the code”.

Conway’s Law is less important and meaningful than people believe. Applying the argument to small teams, especially co-located Agile teams where people are all working closely together and talking constantly, is effectively irrelevant.

Conway’s Corollary however is valuable, especially for large, distributed development organizations. It’s important for managers to ensure that the structure of the team is aligned with the architectural structure of the system – the way it is today, or the way you want it to be.

Thursday, October 11, 2012

We need to understand what happens to code over time and why, and what a healthy, long-lived code base looks like. What architectural decisions have the most lasting impact, and what decisions made early will make the most difference over the life of a system.

Forces of Compromise

Most of the discussion around technical debt assumes that code degrades over time because of sloppiness and lazy coding practices and poor management decisions, by programmers who don’t know or don’t care about what they are doing or who are forced to take short-cuts under pressure. But it’s not that simple. Code is subject to all kinds of pressures and design compromises, big and small, in the real world.

Performance optimization trade-offs can force you to bend the design and code in ways that were never expected. Dealing with operational dependencies and platform quirks and run-time edge cases also adds complexity. Then there are regulatory requirements – things that don’t fit the design and don’t necessarily make sense but you have to do anyways. And customization: customer-specific interfaces and options and custom workflow variants and custom rules and custom reporting, all to make someone important happy.

Integration with other systems and API lock-in and especially maintaining backwards compatibility with previous versions can all make for ugly code. Michael Feathers, who I think is doing the most interesting and valuable work today in understanding what happens to code and what should happen to code over time, has found that code around APIs and other hard boundaries becomes especially messy – because some interfaces are so hard to change, this forces programmers to do extra work (and workarounds) behind the scenes.

All of these forces contribute to making a system more complex, harder to understand, harder to change and harder to test over time – and harder to love.

Iterative Development is Erosive

In Technical Debt, Process and Culture, Feathers explains that “generalized entropy in software systems” is inevitable, the result of constant and normal wear and tear in an organization. As more people work on the same code, the design will naturally deteriorate as each person interprets the design in their own way and makes their own decisions on how to do something. What’s interesting is that the people working with this code can’t see how much of the design has been lost because their familiarity with the code makes it appear to be simpler and clearer than it really is. It’s only when somebody new joins the team that it becomes apparent how bad things have become.

Feathers also suggests that highly iterative development accelerates entropy, and that code which is written iteratively is qualitatively different than code in systems where the team spent more time in upfront design. Iterative development and maintenance tend to bias towards the existing structure of the system, meaning that more compromises will end up being made.

Iterative design and development involves making a lot of small mistakes, detours and false starts as you work towards the right solution. Testing out new ideas in production through A/B split testing amplifies this effect, creating more options and complexity. As you work this way some of the mistakes and decisions that you make won’t get unmade – you either don’t notice them, or it’s not worth the cost. So you end up with dead abstractions and dead ends, design ideas that aren't meaningful any more or are harder to work with than they should be. Some of this will be caught and corrected later in the course of refactoring, but the rest of it becomes too pervasive and expensive to justify ripping out.

Dealing with Software Sprawl

Software, at least software that gets used, gets bigger and more complicated over time – it has to, as you add more features and interfaces and deal with more exceptions and alternatives and respond to changing laws and regulations. Capers Jones analysis shows that the size of the code base for a system under maintenance will increase between 5-10% per year. Our own experience bears this out - the code base for our systems has doubled in size in the last 5 years.

As the code gets bigger it also gets more complex – code complexity tends to increase an average of between 1% and 3% per year. Some of this is real, essential complexity – not something that you can wish your way out of. But the rest is due to how changes and fixes are done.

Feathers has confirmed by mining code check-in history (Discovering Startling Things from your Version Control System) that most systems have a common shape or “power curve”. Most code is changed only infrequently or not at all, but the small percentage of methods and classes in the system that are changed a lot tend to get bigger and more complex over time. This is because it is

easier to add code to an existing method than to add a new method and easier to add another method to an existing class than to add a new class.

The key to keeping a code base healthy is disciplined refactoring of this code, taking the time to come up with new and better abstractions, and preventing the code from festering and becoming malignant.

There is also one decision upfront that has a critical impact on the future health of a code base. Capers Jones has found that the most important factor in how well a system ages is, not surprisingly, how complex the design was in the beginning:

The rate of entropy increase, or the increase in cyclomatic complexity, seems to be proportional to the starting value. Applications that are high in complexity when released will experience much faster rates or entropy or structural decay than applications put into production with low complexity levels.
The Economics of Software Quality

Systems that were poorly designed only get worse – but Jones has found that systems that were well-designed can actually get better over time.

Tuesday, September 18, 2012

The SANS Institute is surveying companies to understand what tools and practices they are using to build security into applications, what their greatest risks and challenges are and how they are managing them. You can find the survey here.

When you are building a system and making trade-off decisions between what can be done now and what will need to be done “sometime in the future”.

“Sometime in the future”, when have to deal with those decisions, when you need to pay off that debt.

What happens when “sometime in the future” is now? How much debt is too much to carry? When do you have to pay if off?

How much debt is too much?

Every system carries some debt. There is always code that isn’t as clean or clear as it should be. Methods and classes that are too big. Third party libraries that have fallen out of date. Changes that you started in order to solve problems that went away. Design and technology choices that you regret making and would do differently if you had the chance.

Complexity by itself isn’t enough. Some code is essentially complex, or accidentally complex but it doesn’t need to be changed, so it doesn’t add to the real cost of development. Tools like Sonar look at complexity as well as other variables to assess the technical risk of a code base:

This gives you some idea of technical debt costs that you can track over time or compare between systems.But when do you have to fix technical debt? When do you cross the line?

Deciding on whether you need to pay off debt depends on two factors:

Safety / risk. Is the code too difficult or too dangerous to change? Does it have too many bugs? Capers Jones says that every system, especially big systems, has a small number of routines where bugs concentrate (the 20% of code that has 80% of problems), and that cleaning up or rewriting this code is the most important thing that you can do to improve reliability as well as to reduce long the term costs of running a system.

Cost – real evidence that it is getting more expensive to make changes over time, because you’ve taken on too much debt. Is it taking longer to make changes or to fix bugs because the code is too hard to understand, or because it is too hard to change, or too hard to test?

While apparently for some teams it’s obvious that if you are slowing down it must be because of technical debt, I don’t believe it is that simple.

There are lots of reasons for a team to slow down over time, as systems get bigger and older, reasons that don’t have anything to do with technical debt. As systems get bigger and are used by more customers in more ways, with more features and customization, the code will take longer to understand, changes will take longer to test, you will have more operational dependencies, more things to worry about and more things that could break, more constraints on what you can do and what risks you can take on. All of this has to slow you down.

How do you know that it is technical risk that is slowing you down?

A team will slow down when people have to spend too much time debugging and fixing things – especially fixing things in the same part of the system, or fixing the same things in different parts of the system. When you see the same bugs or the same kind of bugs happening over and over, you know that you have a debt problem. When you start to see more problems in production, especially problems caused by regressions or manual mistakes, you know that you are over your head in debt. When you see maintenance and support costs going up – when everyone is spending more time on upgrades and bug fixing and tuning than they are on adding new features, you're running in circles.

The 80:20 rule for paying off Technical Debt

Without careful attention, all code will get worse over time, but whatever problems you do have are going to be worse in some places than others. When it comes to paying back debt, what you care about most are the hot spots:

Code that is complex and

Code that changes a lot and

Code that is hard to test and

Code that has a history of bugs and problems.

You can identify these problem areas by reviewing check-in history, mining your version control system (the work that Michael Feathers is doing on this is really cool) and your bug database, through static analysis checks, and by talking with developers and testers.

This is the code that you have to focus on. This is where you get your best return on investment from paying down technical debt. Everything else is good hygiene – it doesn't hurt, but it won’t win the game either. If you’re going to pay down technical debt, pay it down smart.

Thursday, September 13, 2012

Estimating remains one of the hardest problems in software development. So hard in fact that more people lately are advocating that we shouldn’t bother estimating at all.

David Anderson, the man behind Kanban, says that we should stop estimating, and that estimates are a waste of time. In his case study about introducing Kanban ideas at Microsoft, one of the first steps that they took to improve a team’s productivity was to get them to stop estimating and start focusing instead on prioritizing work and getting the important work done.

I believe that most estimation is waste and that it is more common to use estimation as a replacement for proper steering, and to use it as a whip on the developers, than it is to use it for its only valid purpose in Release Planning, which is more like "decide whether to do this project" than "decide just how long this thing we just thought of is going to take, according to people who don't as yet understand it or know how they'll do it”

Estimation is clearly "waste". It's not software…If estimation IS doing you some good, maybe you should think about it as a kind of waste, and try to get rid of it.

And, from others on the “If you do bother estimating, there’s no point in putting a lot of effort into it” theme:

Spending effort beyond some minutes to make an estimate "less wrong" is wasted time.
Spending effort calculating the delta between estimates and actuals is wasted time.
Spending effort training, working and berating people to get "less wrong" estimates is wasted time and damaging to team performance.

In “Software estimation considered harmful?” Peter Seibel talks about a friend running a startup, who found that it was more important to keep people focused and motivated on delivering software as quickly as possible. He goes on to say

If the goal is simply to develop as much software as we can per unit time, estimates (and thus targets), may be a bad idea.

He bases this on a 1985 study in Peopleware which showed that programmers were more productive when working against their own estimates than estimates from somebody else, but that people were most productive on projects where no estimates were done at all.

Seibel then admits that maybe “estimates are needed to coordinate work with others” – so he looks at estimating as a “tool for communication”. But from this point of view, estimates are an expensive and inefficient way to communicate information that is of low-quality – because of the cone of uncertainty all estimates contain variability and error anyways.

What’s behind all of this?

Most of this thinking seems to come out of the current fashion of applying Lean to everything, treating anything that you do as potential waste and eliminating waste wherever you find it. It runs something like: Estimating takes time and slows you down. You can’t estimate perfectly anyways, so why bother trying?

A lot of this talk and examples focus on startups and other small-team environments where predictability isn’t as important as delivering. Where it’s more important to get something done than to know when everything will be done or how much it will cost.

Do you need to estimate or not?

I can accept that estimates aren’t always important in a startup – once you’ve convinced somebody to fund your work anyways.

If you’re firefighting, or in some kind of other emergency, there’s not much point in stopping and estimating either – when it doesn’t matter how much something costs, when all you care about is getting whatever it is that you have to do done as soon as possible.

Estimating isn’t always important in maintenance – the examples where Kanban is being followed without estimating are in maintenance teams. This is because most maintenance changes are small by definition - maintenance is usually considered to be fixing bugs and making changes that take less than 5 days to complete. In order to really know how long a change is going to take, you need to review the code to know what and where to make changes. This can take up to half of the total time of making the change – and if you’re already half way there, you might as well finish the job rather than stopping and estimating the rest of the work. Most of the time, a rule of thumb or placeholder is a good enough estimate.

In my job, we have an experienced development team that has been working on the same system for several years. Almost all of the people were involved in originally designing and coding the system and they all know it inside-out.

The development managers triage work as it comes in. They have a good enough feel for the system to recognize when something looks big or scary, when we need to get some people involved upfront and qualify what needs to get done, work up a design or a proof of concept before going further.

Most of the time, developers can look at what’s in front of them, and know what will fit in the time box and what won’t. That’s because they know the system and the domain and they usually understand what needs to be done right away – and if they don’t understand it, they know that right away too. The same goes for the testers – most of the time they have a good idea of how much work testing a change or fix will take, and whether they can take it on.

Sure sometimes people will make mistakes, and can’t get done what they thought they could and we have to delay something or back it out. But spending a little more time on analysis and estimating upfront probably wouldn't have changed this. It’s only when they get deep into a problem, when they’ve opened the patient up and there’s blood everywhere, it’s only then that they realize that the problem is a lot worse than they expected.

We’re not getting away without estimates. What we’re doing is taking advantage of the team’s experience and knowledge to make decisions quickly and efficiently, without unnecessary formality.

This doesn't scale of course. It doesn’t work for large projects and programs with lots of inter-dependencies and interfaces, where a lot of people need to know when certain things will be ready. It doesn’t work for large teams where people don’t know the system, the platform, the domain or each other well enough to make good quick decisions. And it’s not good enough when something absolutely must be done by a drop dead date – hard industry deadlines and compliance mandates. In all these cases, you have to spend the time upfront to understand and estimate what needs to get done, and probably re-estimate again later as you understand the problem better. Sometimes you can get along without estimates. But don’t bet on it.

Tuesday, September 11, 2012

Developers need to know a lot in order to build secure applications. Some of this is good software engineering and defensive design and programming – using (safe) APIs properly, carefully checking for errors and exceptions, adding diagnostics and logging, and never trusting anything from outside of your code (including data and other people’s code). But there are also lots of technical details about security weaknesses and vulnerabilities in different architectures and platforms and technology-specific risks that you have to understand and that you have to make sure that you deal with properly. Even appsec specialists have trouble keeping up with all of it.

This is where OWASP’s Cheat Sheets come in. They provide a clear explanation of security problems, and tools and patterns and practical steps that you can follow to prevent them or solve them.

Some of the cheat sheets are easy for developers to understand and use right away. For example, the cheat sheets on common security problems like SQL injection and CSRF explain what these vulnerabilities are, and what works and what doesn’t to protect from them. Simple and practical advice from people who know.

There are also cheat sheets on basic development problems and requirements that you might think that you already understand – things that seem straightforward, but that need to be done carefully and correctly to make sure that your system is secure. Cheat sheets on how to do logging securely and the right way to use parameterized queries (prepared statements) and how to properly implement a Forgot Password feature, and on Session Management. Make sure that you read the cheat sheet on Input Validation - there’s a lot more to doing it right than you think.

Then there are cheat sheets on harder, uglier technical problems like secure cryptographic storage or what you have to do to avoid XSS. XSS is so ugly that there is also a second cheat sheet that tries to explain the problem and solutions in a simpler way; and another cheat sheet just on DOM-based XSS prevention; and a technical cheat sheet on XSS filter evasion to help test for XSS vulnerabilities.

The OWASP Cheat Sheets are shortcuts that take you straight to the explanation of specific problems and how to solve them, checklists that you can follow without demanding that you understand everything about appsec. It’s OK. Go ahead and cheat.

Tuesday, September 4, 2012

One of the things I like about devops is that it takes on important but neglected problems in the full lifecycle of a system: making sure that the software is really ready to go into production, getting it into production, and keeping it running in production.

Most of what you read and hear about devops is in online startups – about getting to market faster and building tight feedback loops with Continuous Delivery and Continuous Deployment.

But devops is even more important in keeping systems running – in maintenance and sustaining engineering and support. Project teams working on the next new new thing can gloss over the details of how the software will actually run in production, how it will be deployed and how it should be hardened. If they miss something the problems won’t show up until the system starts to get used by real customers for real business under real load – which can be months after the system is launched, by which time the system might already be handed over to a sustaining engineering team to keep things turning.

This is when priorities change. The system always has to work. You can’t ignore production – you’re dragged down into the mucky details of what it takes to keep a system running. The reality is that you can’t maintain a system effectively without understanding operational issues and without understanding and working with the people who operate and support the system and its infrastructure.

Developers on maintenance teams and Ops are both measured on

System reliability and availability

Cycle time / turnaround on changes and fixes

System operations costs

Security and compliance

Devops tools and practices and ideas are the same tools and practices and ideas that people maintaining a system also need:

Version control and configuration management to track everything that you need to build and test and deploy and run the system

Fast and simple and repeatable build and deployment to make changes safe and cheap

Monitoring and alerting and logging to make support and troubleshooting more effective

Developers and operations working together to investigate and solve problems and to understand and learn together in blameless postmortems, building and sharing a culture of trust and transparency

Devops isn’t just for online startups

Devops describes the reality that maintenance and sustaining engineering teams would be if they could be working in. An alternative to late nights trying to get another software release out and hoping that this one will work; and to fire fighting in the dark; and to ass covering and finger pointing; and to filling out ops tickets and other bullshit paperwork. A reason to get up in the morning.

Time-and-materials of course is a no-brainer, regardless of how the team works – do the work, track the time and other costs, and charge the customer as you do it. But it’s especially challenging for people who have to work within contract structures such as fixed price / fixed scope, which is the way that many government contracts are awarded and the way that a number of large businesses still contract development work.

The advice for Agile teams usually runs something like: it’s up to you to convince the purchaser to change the rules and accept a fuzzier, more balanced way of contracting, with more give-and-take. Something that fits the basic assumptions of Agile development: that costs (mostly the people on the team) and schedule can be fixed, but the scope needs to be flexible and worked out as the project goes on.

But in many business situations the people paying for the work aren’t interested in changing how they think and plan – it’s their money and they want what they want when they want it. They are calling the shots. If you don’t comply with the terms of the bidding process, you don’t get the opportunity to work with the customer at all. And the people paying you (your management) also need to know how much it is going to cost and when it is going to be done and what the risks are so they know if they can afford to take the project on. This puts the developers in a difficult (maybe impossible) situation.

Money for Nothing and Change for Free

Jeff Sutherland, one of the creators of Scrum, proposes a contract structure called “Money for Nothing and your Change for Free”. The development team delivers software incrementally – if they are following Scrum properly, they should start with the work that is most important to the customer first, and deliver what the customer needs the most as early as possible. The customer can terminate the contract at any point (because they’ve already got what they really need), and pay some percentage of the remainder of the contract to compensate the developer for losing the revenue that they planned to get for completing the entire project. So obviously, the payment schedule for the contract can’t be weighted towards the end of the project (no large payments on “final acceptance” since it may never happen). That’s the “money for nothing” part.

“Change for free” means that the customer can’t add scope to the project, but can make changes as long as they substitute work still to be done in the backlog with work that is the same size or smaller. So new work can come up, the customer can change their mind, but the overall size of the project remains the same, which means that the team should still be able to deliver the project by the scheduled end date.

To do this you have to define, understand and size all of the work that needs to be done upfront – which doesn’t fit well with the iterative, incremental way that Agile teams work. And it ignores the fact that changes still carry a price: the developers have to throw away the time that they spent upfront understanding what needed to be done enough to estimate it and the work that they went in to planning it, and they have to do more work to review and understand the change, estimate it and replan. Change is cheap in Agile development, but it’s not free. If the customer needs to make a handful of changes, the cost isn’t great. But it can become a real drag to delivery and add significant cost if a customer does this dozens or hundreds of times over a project.

Fixed Price and Fixed Everything Contracts

Fixed Price contracts, and especially what Alistair Cockburn calls Fixed-Everything contracts (fixed-price, fixed-scope and fixed-time too) are a nasty fact of business. Cockburn says that these contracts are usually created out of lack of trust – the people paying for the system to be developed don’t trust the people building the software to do what they need, and try to push the risk onto the development team. Even if people started out trusting each other, these contracts often create an environment where trust breaks down – the customer doesn’t trust the developers, the developers hide things from the customer, and the people who are paying the developers don’t trust anybody.

But it’s still a common way to contract work because for many customers it is easier for them to plan around and it makes sense for organizations that think of software development projects as engineering projects and that want to treat software engineering projects the same way as they do building a road or a bridge. This is what we told you we want, this is when we need it, that’s how much you said it was going to cost (including your risk and profit margin), we agree that’s what we’re willing to pay, now go build it and we’ll pay you when you get it done.

Cockburn does talk about a case where a team was successful in changing a fixed-everything contract into a time-and-materials contract over time, by working closely with the customer and proving that they could give the customer what they needed. After each delivery, the team would meet with the customer and discuss whether to continue with the contract as written or work on something that customer really needed instead, renegotiating the contract as they went on. I’ve seen this happen, but it’s rare, unless both companies do a lot of work together and the stakes of failure on a project are low.

Martin Fowler says that you can’t deliver a fixed price, fixed time and fixed scope contract without detailed, stable and accurate requirements – which he believes can’t be done. His solution is to fix the price and time, and then work with the customer to deliver what you can by the agreed end date, and hope that this will be enough.

Their advice: avoid fixed-priced, fixed-scope (FPFS) contracts, because they are a lose-lose for both customer and supplier. The customer is less likely to get what they need because the supplier will at some point panic over delivery and be forced to cut quality; and if the supplier is able to deliver, the customer has to pay more than they should because of the risk premium that the supplier has to add. And working this way leads to a lack of transparency and to game playing on both sides.

But, if you have to do it:

Obviously it’s going to require up-front planning and design work to understand and estimate everything that has to get done – which means you have to bend Agile methods a lot.

You don’t have to allow changes – you can just work incrementally from the backlog that is defined upfront. Or you can restrict the customer to only changing their mind on priority of work to be done (which gives them transparency and some control), or allow them to substitute a new requirement for an existing requirement of the same size (Sutherland’s “Change for Free”).

To succeed in this kind of contract you have to:

Invest a lot to do detailed, upfront requirements analysis, some design work, thorough acceptance test definition and estimation – by experienced people who are going to do the work

Make sure that you understand the problem you are working on – the domain and technology

Deliver important things early and hope that the customer will be flexible with you towards the end if you still can’t deliver everything.

PMI-ACP on Agile Contracting?

For all of the projects that have been delivered using Agile methods, contracting seems to be still a work in progress. There are lots of good ideas and suggestions, but no solid answers.

I’ve gone through the study guide materials for the PMI-ACP certification to see what PMI has to say about contracting in Agile projects. There is the same stuff about Sutherland’s “Money for Nothing and your Change for Free” and a few other options. It’s clear that the PMI didn’t take contracting in Agile projects on as a serious problem. This means that they missed another opportunity to help large organizations and people working with large organizations (the kind of people who are going to care about the PMI-ACP certification) to understand how to work with Agile methods in real-life situations.

Subscribe to this blog

About Me

I am an experienced software development manager, project manager and CTO focused on hard problems in software development and maintenance, software quality and security. For the last 15 years I have managed teams building and operating high-performance financial systems.
My special interest is how small teams can be most effective in building real software: high-quality, secure systems at the extreme limits of reliability, performance, and adaptability. Software that has to work, that is built right, and built to last.
I use this blog to explore ideas and problems in software development that are important to me. To reflect and to find new answers.