Monday, July 30, 2012

There are a lot of different factors that impact how long it could take to find and fix a bug. Some of them we’ve already gone over. How good the bug report is – can you understand it, does it include steps to reproduce the problem. And how old the report is – how much could have changed since then, how much will people remember if you need to ask them for details. How much work is involved in setting up a test environment to reproduce the bug.Whether you can reproduce the bug or not.

There are other factors too. The kind of bug. The size of the code base, how old it is, how ugly it is, and how brittle it is - will it break if you try to change it. How much experience you have with the language, and how well you understand the code, did you write it or at least work on it before. How good your tools are for debugging and profiling and navigating the code and refactoring. What kind of tests you have in place to catch regressions. And how good you are at problem solving, at narrowing in on the problem, and then coming up with a safe and clean fix - especially in code that you didn’t write and don’t understand.

Gaps and Blind Spots

Cause/Effect gap. How far is the failure removed from the actual defect in the code? Sometimes the code fails exactly where the problem is. Other times, you can see that something is broken, but you can’t trace it back to where the problem occurred.

How hard or expensive it is to duplicate the bug. Does it involve other systems? How much work is it involved to setup a test system? Does the bug only show up after running the system for a long time under heavy load, or with a lot of concurrent sessions? Is the problem intermittent? Is it configuration-specific – and, if so, do you have access to that configuration? Are you familiar with the tools, and do your debugging or tracing tools show you anything useful? When you enable the debugger, does the problem go away (the Heisenbug problem)?

Faulty assumptions. What is wrong and what you think is wrong are very different, or you don’t really know enough about the platform or language to understand what’s wrong or where to look so you are starting off wrong. You’ve got a blind spot, and unless you get help from somebody else to see it, you’re not going to be able to fix this problem. Knowledge and stubbornness are both important factors here. You have to know enough to know where to start looking, and you have to be stubborn enough not to give up. But stubbornly sticking with a hypothesis for too long, even – especially – if it’s a good hypothesis, will keep you from moving forward and finding the bug.

Some bugs are easier to fix than others

How long it takes to fix a bug can depend on what kind of bug it is. The average time to fix a security bug: 10 hours. Design bug: 8.5 hours. Data bugs: 6.5 bugs. Coding bug: only 3 hours. Invalid bugs: 4.75 hours – it takes 4.75 hours on average to figure out that a bug is in fact not a bug, but only 3 hours to fix a coding bug! Wrap your head around that. Duplicate: 1 hour (to figure out that this bug has already been reported so you can ignore it too).

But there’s still the long tail that we talked about earlier: the maximum time to fix any of these kinds of bugs can be >10 times the average. Bugs that cannot be reproduced are the hardest – on average these kinds of bugs take up to 40 hours to fix, if they can be fixed at all.

Costs by Severity

How severe the bug is can also affect how long it takes to find and fix. On average, critical Severity 1 bugs (the system is down or your biggest customer - or your CEO - is screaming at you because something isn't right or your database has been compromised by an attacker) are fixed in an average of 6 hours. Major bugs (Severity 2) in 9 hours. Minor (Severity 3) bugs take 3 hours, and trivial bugs (Severity 4) only 1 hour, if you bother to fix them at all.

What’s interesting is that the most severe bugs (Severity 1) take less time to fix than other major bugs – probably because when the system is down it gets immediate attention from your best people to contain the damage. But fixing a bug like this fast doesn't mean that it is cheap. There are other costs indirectly associated with a Severity 1 emergency, including operations support costs for incident management and escalation, and Root Cause Analysis to figure out what went wrong in the first place, and whatever follow-up actions you need to take to ensure that a problem like this doesn't happen again. Critical bugs are never cheap to fix, at least not if you fix them properly.

It depends on when you find the bug

All of this data assumes that you are fixing bugs found in production. Everyone knows that the earlier that you find a bug in the development cycle, the cheaper it is to fix – if an automated test or static analysis check reports a bug in code that you just changed, of course you can fix it immediately. The famous rule “finding and fixing a software problem after delivery is often 100 times more expensive than finding and fixing it early in development” applies. Or does it?

In “What we have Learned about Fighting Defects”,different studies show that the 100:1 rule of thumb applies for severe and critical defects (because of the direct and indirect costs involved). But for non-severe bugs, the effort multiplier is much lower: as low as only 2:1. This is especially the case for teams working in a Continuous Deployment model, where the boundaries between development, testing and production are blurred, and where the costs and time required to push a fix out to production are minimal.

How old and how big the code base is

The cost to fix bugs also depends on how old the system is, and how old the bugs are. In the first few months of operation bugs will be found quickly and usually fixed quickly, often by the same programmer who wrote the code. As time goes on it gets harder to find and harder to fix bugs, partly because the no-brainers, the more common and obvious and easier-to-reproduce problems, have already been reported and fixed, and now you’re left with more difficult edge cases or timing problems. And partly because over time there’s less chance that the programmer fixing the code is the same programmer who wrote it, so it takes longer simply for whoever has to fix the problem to get their head into the game.

And it depends on how big the system is. Bigger systems have more bugs, and it costs more to fix bugs in big systems, especially really big systems. Severity 2 bugs (the hardest to fix on average) take 9 hours or less to fix on average in systems up to 1,000 function points in size (around 50,000 lines of Java code give or take). But in much bigger systems (500,000+ lines of code or more) the average time goes up to 12 hours. In the biggest systems (another order of magnitude bigger) it can take an average of 24 hours to fix the same kind of bug.

Next, I want to look at the value of programmer experience when it comes to fixing bugs.

It’s hard to know how long it’s going to take to fix a bug, especially if you don’t know the code. James Shore points out in The Art of Agile that obviously before you can fix something, you have to figure out what’s wrong. The problem is that you can’t estimate accurately how long it will take to find out what’s wrong. It’s only after you know what’s wrong that you reasonably estimate how long it will take to fix it. But by then it’s too late. According to Steve McConnell

“finding the defect – and understanding it – is usually 90 percent of the work.”

A lot of bug fixes are only one line changes. What takes the time is figuring out the right line to change – like knowing where to tap the hammer, or when and where the fish will be biting. Some bugs are easy to find and easy to fix. Some bugs are hard to find, but easy to fix. Other bugs are easy to find and hard to fix. And some bugs can’t be found at all, so they probably can’t be fixed. Unless you wrote the code recently, you probably have no idea which kind of bug you’re being asked to work on.

Finding and Fixing a Bug

Let’s look at what’s involved in finding and fixing a bug. In Debug It! Paul Butcher does a good job of describing the steps that you need to go through, in a structured and disciplined way that will be familiar to experienced programmers:

Make sure that you know what you’re looking for. Review the bug report, see if it makes sense, make sure it really is a bug and that you have enough information to figure the problem out and to reproduce it. Check if it has already been reported as a duplicate, and if so, what the guy before you did about it, if anything.

Clear the decks – find and check out the right code, cleanup your work space.

Setup your test environment to match. This can be trivial, or impossible, if the customer is running on a configuration that you don’t have access to.

Make sure that you understand what the code is supposed to do, and that your existing test suite passes.

Now it’s time to go fishing. Reproduce and diagnose the bug. If you can’t reproduce it, you can’t prove that you fixed it.

Write new (failing) developer tests or fix existing tests to catch the bug.

Make the fix – and make sure that you didn’t break anything else. This may include some refactoring work to understand the code better before you make the fix, so that you can do it safely. And regression testing afterwards to make sure that you didn’t introduce any new bugs.

Try to make the code safer and cleaner if you can for the next guy, with some more step-by-step refactoring. At least make sure that you don’t make the code more brittle and harder to understand with your fix.

Get the fix reviewed by somebody else to make sure that you didn’t do something stupid.

Check the fix in.

Check to see if this bug needs to be fixed in any other branches if you aren’t working from the mainline. Merge the change in, deal with differences in the code, and go through all of the same reviews and tests and other work again.

Stop and think. Do you understand what went wrong, and why? Do you understand why your fix worked? Where else should you look for this kind of bug ? In The Pragmatic Programmer, Andy Hunt and Dave Thomas also ask “If it took a long time to fix this bug, ask yourself why”, and what can you do to make debugging problems like this easier in the future? How can you improve the approach that you took, or the tools that you used? How deep you go depends on the impact and severity of the bug and how much time you have.

What takes longer, finding a bug, or fixing it?

The time needed to setup a test environment, reproduce the problem or test it may far outweigh the amount of time that it takes to find the problem in the code and fix it. But for a small number of bugs, it’s not how long it takes to find it – it’s what’s involved in fixing it.

In Making Software, the chapter Where Do Most Software Flaws Come From?, Dewayne Perry analyzed how hard it was to find a bug (understand it and reproduce it) compared to how long it took to fix it. The study found that most bugs (almost 3/4) were easy to understand and find and didn’t take long to fix: 5 days or less (this was on a large-scale real-time system with a heavyweight SDLC, lots of reviews and testing). But there’s a long tail of bugs that can take much longer to fix, even bugs that were trivial to find:

Find/Fix Effort

<=5 Days to Fix

>5 Days to Fix

Problem can be reproduced

72.5%

18.4%

Hard to Reproduce or Can't be Reproduced

5.9%

3.2%

So you can bet when you find a bug that it’s going to be easy to fix. And most of the time you’ll be right. But when you’re wrong, you can be a lot wrong.

In subsequent posts, I am going to talk more about the issues and costs involved in reproducing, finding and fixing bugs, and how (or whether) to estimate bug fixes.

Monday, July 23, 2012

Everyone knows the C-I-A triad for information security: security is about protecting the Confidentiality, Integrity and Availability of systems and data.

In a recent post, Warren Axelrod argues that Availability is the most important of these factors for security, more important than Integrity and Confidentiality – that C-I-A should be A-I-C.

I don't agree.

Protecting the Confidentiality of customer data and sensitive business data and system resources is a critical priority for information security. It’s what you should first think about in security.

And protecting the Integrity of data and systems, through firewalls and network engineering and operational hardening, and access control, data validation and encoding, auditing, digital signatures and so on is the other critical priority for information security.

Availability is a devops problem, not a security problem

Axelrod makes the point that it doesn’t matter if the Confidentiality or Integrity of data is protected if the system isn’t available, which is true. But Availability is already the responsibility of application architects and operations engineers. Their job is designing and building scalable applications that can handle load surges and failures, and architecting and operating technical infrastructure that is equally resilient to load and failures. Availability of systems is one of the key ways that their success is measured and one of the main things that they get paid to take care of.

Availability of systems and data is a devops problem that requires application developers and architects and operations engineers to work together. I don’t see where security experts add value in ensuring Availability – with the possible exception of helping architects and engineers understand how to protect themselves from DDOS attacks.

Thursday, July 19, 2012

Rohit Sethi talks about how to take a pragmatic and lightweight approach to application security threat modeling, in the most recent interview in the SANS AppSec "Ask the Expert" series.
You can read the interview with Rohit here.

Monday, July 16, 2012

In Agile Estimating and Planning, Mike Cohn explains the different factors that go into prioritizing work on a software development project: financial value, cost, knowledge and risk. He then works through a couple of examples to show how these decisions are made. One of these examples is whether or not to build a security framework for an application – an example that I want to explore in more detail here.

Financial value / return

The first factor to consider when prioritizing work is the financial return that the customer or business will get from a feature: how much money will the customer/business earn or save from this feature over a period of time. For some features the business case is clear. In other cases, you have to do some creative accounting work to invent a business case, or decide based on other subjective business factors – often, how badly somebody important wants something.

Based on financial return, there is no clear business case for having a better/simpler security capability in the application by implementing a security framework. Security, and how it is implemented, is usually hidden from the business – it’s not something that they can see or use or make money from. You could try to argue that by improving the security of an application you could save on direct and indirect future costs of security breaches, and try to build a financial model around this. But unless the company or a close competitor has recently suffered from an expensive and embarrassing security problem, it is difficult to make a convincing case based on some future possibility.

Cost of development

How long will it take, how much will it cost to build it – and does this cost change over time? Is it cheaper to build it in early, or is it better to wait until later when you understand more about the problem and how to solve it properly?

Will it cost more to build security in later? Maybe. But probably not – because unless there’s a management or compliance requirement to make sure that everything is done correctly, there’s a chance that some of the security work won’t get done at all.

If you do decide to do the work early, there’s a greater chance that you will have to come back and change this code again later because of new requirements. So this adds to the future cost of development, rather than saving money and time.

Knowledge – how much do we learn by working on this

Another important factor to the development team is how much they might learn about the problem space or design, or how much they might learn about their ability to deliver the project, by working on something sooner rather than later. This is the point of technical spikes and prototyping – to learn and to reduce uncertainty.

As Mike Cohn points out, implementing a security framework doesn’t add much if anything to the team’s knowledge about the problem space or the design of the product. And it doesn’t answer any important questions for the team about their ability to deliver the project, about the technology platform or their tools and practices. It’s unlikely that whatever the team might learn from implementing a security framework will be material to the project’s success.

Risk

Of course there are risks to not implementing security correctly from the beginning. But the risks that we’re most concerned about when making prioritization decisions like this are schedule risks (how long is it going to take to deliver the key features, can we hit deadlines) and fundamental technical risks (is the architecture sound, will the system perform and stay up under load).

“Is there a risk to the project’s success that can be reduced or eliminated by implementing security earlier rather than later?”

The answer in this case is: probably not.

The project will still succeed – you’ll be able to deliver the project without a security framework. Bad things may happen later when the system is live, assuming that without a framework the team won’t do as good a job securing the app, but that won’t stop the project from getting delivered.

Security isn’t a feature

Using these decision factors (financial value, cost, knowledge and risk), there’s no compelling reason to build a security framework into the application. It doesn’t make the business any money, it doesn’t save time, it doesn’t tell us anything useful about the project or the problem domain, it doesn’t reduce project risk or technical risk in a significant way.

There’s nothing wrong with prioritizing features this way. What’s wrong is treating a security framework as a feature.

The decision on whether to implement a security framework shouldn’t be made based on any of these factors. Security isn’t a feature that can be traded off with other features, or cut because of cost. In the same way that data integrity isn’t a feature. Making sure that the system writes data correctly and consistently to the database or files, that data isn’t lost or partially updated or duplicated if something goes wrong, is a necessary part of writing a system. You don’t decide to do this now or never based on “customer value” or cost or time or whether you will learn something by doing this.

Whatever security is required by the type of system and the business needs to be architected in. It's not a product management decision. It’s part of the development team’s job to do this right. They can’t depend on other people, on the customer or a product manager to decide whether they should implement security sooner or later or not at all.

Thursday, July 12, 2012

At Devopsdays I listened to a lot of smart people saying smart things. And to some people saying things that sounded smart, but really weren’t. It was especially confusing when you heard both of these kinds of things from the same person.

Sussman is a smart guy who knows about testing. He wanted to get some important messages across about how to do testing and QA in a Continuous Deployment environment. If you’re releasing small changes to production 20 or 30 or 40 times a day, there’s no time for manual testing and QA. So you have to push a lot of responsibility back to the development team to review and test their own work. Testers can still look for problems through exploratory testing and system testing, but they won’t be able to find bugs in time – the code may already be out by the time that they get a chance. So you need to be prepared for problems in production, and to respond to them.

The Smart Things

"The fewer assumptions that developers make, the better everything works”.

A developer’s job is to keep things as simple as they can, and understand and validate their assumptions by reviewing and testing and profiling their code. A tester’s job is to invalidate assumptions – through boundary testing and exploratory testing and things like fuzzing and penetration testing.

“There are a finite number of detectable bugs”.

Everyone should be prepared for bugs in production – it’s irresponsible not to.

“QA should be focusing on risk, not on giving management false confidence that the site won’t go down”.

“It’s more important to focus on resilience than “quality”: readable code, reasonable test coverage, a sane architecture, good tools, an engineering culture that values refactoring.”

The Things that Sounded Smart, but Weren’t

“The whole idea of preventive QA or preventive testing is a fraud”.

Try telling that to the people testing flight control software or people testing medical equipment, or critical infrastructure systems, or….

“Assurance is a terrible word. Let’s discard it”.

Ummmm. Assurance probably means zip to online web startups. But what about industries where assurance is a regulatory requirement, for good reasons?

“There’s no such thing as a roll-back. You just have to deal with what’s deployed right now”.

This was repeating something presented at Velocity, that roll-back is a myth, because you can’t go back in time or whatever. I’ve responded to this crap before, so I won’t bother going into it detail again here, except to say that people need to understand that there are times when rolling forward is the right answer, and there are other times when rolling back is a better answer, and that if you’re doing a responsible job of managing an online business, you had better understand this and be prepared to do both.

“Real-time monitoring is the new face of testing” and “monitoring is more important than unit testing”.

I get the point that Monitoring Sucks,that developers have to put more thought and attention into monitoring and metrics – this is something that we’re still learning at my shop after putting several years of hard work into it. But there is NFW that a developer should put monitoring before unit testing, unless maybe you are trying to launch an online consumer web startup that doesn’t handle any money or private data and that doesn’t connect with any systems that do, and you need to get 1.0 up before you run out of money – and as long as you recognize that what you are doing is stupid but necessary and that you will need to go back and do it again right later.

If you’re in an environment where you depend on developers to do most or all of the testing, and then tell them they should put monitoring in front of testing, then you aren’t going to get any testing done. How is this possibly supposed to be a good thing?

Monitoring as Testing Sucks

Unfortunately I missed most of a follow-up Open Space session on Continuous Deployment and quality. I did come in time to hear another smart person say:

“The things that fail are the things that you didn’t test”

By which I think he meant that you wasted your time testing the things that worked when you should have been testing the things that didn’t. But of course, if the things that fail are the things that you didn’t test, and you didn’t test anything, then everything will fail. Maybe he meant both of these things at the same time, a kind of Zen koan.

Gene Kim promised to write up some notes from this session. Maybe something came out of this that will make some sense out of the “Monitoring as Testing” idea.

I like a lot of what I heard at Devopsdays, it made me re-think how we work and where we put our attention and time. There’s some smart, challenging thinking coming out of devops. But there’s also some dangerously foolish and irresponsible smart-sounding noise coming out at the same time – and I hope that people will be able to tell the difference.

Monday, July 9, 2012

Appsec and Devops are trying to solve many of the same kinds of problems, trying to get developers and operations working together to build safer and more secure and more reliable and more resilient systems. But the way that devops is doing this is very different.

Friday, July 6, 2012

At the Devopsdays conference in Mountain View, Spike Morelli led an Open Space discussion on the importance of culture. He was concerned that when people think and talk about devops they think and talk too much about tools and practices, and not enough about culture and collaboration and communication, not enough about getting people to work closely together and caring about working closely together – and that this is getting worse as more vendors tie themselves to devops.

At the Open Space we talked about the problem of defining culture and communicating it and other “soft and squishy” things. What was important and why. What got people excited about devops and how to communicate this and get it out to more people, and how to try to hold onto this in companies that aren’t agile and transparent.

Culture isn’t something that you build by talking about it

Culture is like quality. You don’t build culture, or quality, by talking about it. You build it by doing things, by acting, by making things happen and making things change, and reinforcing these actions patiently and continually over time.

It’s important to talk – but to talk about the right things. To tell good stories, stories about people and organizations making things better and how they did it. What they did that worked, and what they did that didn’t work – how they failed, and what they learned by failing and how they got through it, so that others can learn from them. This transparency and honesty is one of the things that makes the devops community so compelling and so convincing – organizations that compete against each other for customers and talent and funding openly share their operational failures and lessons learned, as well as sharing some of the technology that they used to build their success.

You need tools to make Devops work

Devops needs good tools to succeed. So does anything that involves working with software. Take Agile development. Agile is about “people over process and tools”. You can't get agile by putting in an Agile tool of some kind. Developers can, and often prefer to, use informal and low-tech methods, organize their work on a card wall for example. But I don’t know any successful agile development teams that don’t rely on a good continuous integration platform and automated testing tools at least. Without these tools, Agile development wouldn’t scale and Agile teams couldn’t deliver software successfully.

The same is even more true for devops. Tools like Puppet and Chef and cfengine, and statsd and Graphite and the different log management toolsets are all a necessary part of devops. Without tools like these devops can’t work. Tools can’t make change, but people can’t change the way that they work without the right tools.

Devops doesn’t have a culture problem – not yet

From what I can see, devops doesn’t have a culture problem – at least not yet.
Everyone who has talked to me about devops (except for maybe a handful of vendors who are just looking for an angle), at this conference or at Velocity or at other conferences or meetups or in forums, all seem to share the same commitment to making operations better, and are all working earnestly and honestly with other people to make this happen.

I hear people talking about deployment and configuration and monitoring and metrics and self-service APIs and about automated operations testing frameworks, and about creating catalogs of operations architecture patterns. About tools and practices. But everyone is doing this in a serious and open and transparent and collaborative way. They are living the culture.

As Devops ideas catch on and eventually go mainstream – and I think it will happen – devops will change, just like Agile development has changed as it has gone mainstream. Agile in a big, bureaucratic enterprise with thousands of developers is heavier and more formal and not as much fun. But it is still agile. Developers are still able to deliver working software faster. They talk more with each other and with the customer to understand what they are working on and to solve problems and to find new ways to work better and faster. They focus more on getting things done, and less on paper pushing and process. This is a good thing.

As bigger companies learn more about devops and adopt it and adapt, it will become something different. As devops is taken up in different industries outside of online web services, with different tolerances for risk different constraints, it will change. And as time goes on and the devops community gets bigger and the people who are involved change and more people find more ways to make more money from devops, devops will become more about training and certification and coaching and consulting and more about commercial tools or about making money from open source tools.

Devops will change – not always for the better. That will suck. But at its core I think devops will still be about getting dev and ops together to solve operational problems, about getting feedback and responding to change, about making operations simpler and more transparent. And that’s going to be a good thing.

Subscribe to this blog

About Me

I am an experienced software development manager, project manager and CTO focused on hard problems in software development and maintenance, software quality and security. For the last 15 years I have managed teams building and operating high-performance financial systems.
My special interest is how small teams can be most effective in building real software: high-quality, secure systems at the extreme limits of reliability, performance, and adaptability. Software that has to work, that is built right, and built to last.
I use this blog to explore ideas and problems in software development that are important to me. To reflect and to find new answers.