Monday, November 22, 2010

Given that most of us will spend most of our careers maintaining and supporting software and not building something new from scratch, there is a surprising and disappointing lack of useful information on software maintenance, on the problems that we all face in software maintenance. I’ve tried to put together here an index to useful books, articles and blog posts and white papers on software maintenance.

Last updated: November 13, 2013

Blog Posts and Articles on Software Maintenance

Jeff Atwood’s The Noble Art of Maintenance Programming is a thoughtful post on how maintenance programmers are unjustifiably unappreciated: how maintenance is much harder and more important than most people, even the people doing the work, understand.

Steve Kilner has an interesting blog called Notes on Software Maintenance. His technical focus is in maintaining RPG code on IBM iSeries computers, but he looks into a lot of basic problems and open questions in software maintenance and software maintenance management, including how to handle estimates and how to effectively manage risk and complexity in maintenance. You can get a good introduction to these problems in an article that he wrote for IBM Systems Journal, Why Good Software Goes Bad and How to Stop It and in this slideshow on Unsolved Problems in Software Maintenance.

In Sustained engineering Idiocy, Microsoft’s Eric Brechner explores how to organize your maintenance team, and the pros and cons of juggling maintenance and support responsibilities with new development work. Although some of his writing is internally Microsoft-focused, his blog is worth following for general good ideas on software development management and maintenance.

There are some excellent books on writing and maintaining code, starting with Code Complete 2 by Steve McConnell: the definitive book on how to write good code, maintain it and debug it.

Legacy code means different things to different people. For some programmers, legacy code is code is any code that they didn't write. For others, it is code that stills runs on abandoned or out-of-date platforms and technologies. But everyone agrees that it is code that is difficult to understand, difficult to support and difficult to change. And in maintenance you are going to spend a lot of time dealing with it. Working Effectively with Legacy Code by Michael Feathers lays out how to get a legacy code base under control, and the steps on how to make changes safely and properly. You can follow Michael Feathers' latest work on mining source code and "brutal refactoring" on his blog.

The Passionate Programmer by Chad Fowler offers some good advice to programmers (and managers) looking for meaning in software development and maintenance.

If you are confronted with the problem of reengineering a legacy system, check out Object Oriented Reengineering Patterns, which is available free online. The pattern approach is awkward at times, but this book offers some clear guidance on when and how to reengineer a system.

More on Legacy Code

There are also a couple of good online presentations from the smart people at ThoughtWorks on how they have dealt with legacy code problems. A presentation by Matt Simons and Jonny LeRoy covers how to get a handle on existing code, mapping the code and identifying risk areas, creating dashboards to visualize the state of the code base using open source tools like Panopticode and Codecity (which offers a cool 3D cityscape view of your code, making metrics fun for everyone), or Sonar (actually, they didn’t mention Sonar, but they should have). Then they outline different strategies for incrementally re-writing large applications, emphasizing Martin Fowler’s Strangler pattern, and some examples of how they have followed these ideas with different customers.

Another good presentation from Josh Graham at ThoughtWorks on Brownfield Software walks through an interesting case study where his team came in to help with a major overhaul of a large, 10-year-old core business system.

A Team, A System, Some Legacy... And You is an intelligent and practical presentation by Eoin Woods on how an architect can make a legacy system better. How to to an architectural assessment of a legacy system, and how to improve the reliability, scalability and maintainability of the system through just enough software architecture, safe step evolution, improving testing and deployment, engaging in production, working with stakeholders and understanding the capability of the team.

Refactoring is a foundational discipline for maintenance – Martin Fowler put a name and a disciplined structure to the work that we all do, or should do, when changing code. Seriously, is there anybody who hasn’t read this book by now, who doesn’t follow some/most of these ideas in maintaining code?

Debug It! by Paul Butcher is a narrowly-focused but useful book, about debugging, problem-solving techniques, handling bugs, and writing code that is easy to debug. Especially useful for programmers new to maintenance.

Managing Maintenance

There are some great books on managing software development projects, like Steve McConnell's Rapid Development, and Scott Berkun's Making Things Happen. Many of the ideas in these books will help you in managing software maintenance work.

There are only a couple of books that take on managing software maintenance:

The best of these by far is Software Maintenance Success Recipes by Donald Reifer, a thoughtful analysis of current software maintenance practices (2012) based on the study of several large software maintenance organizations. It hilights the importance of testing in maintenance (up to 60-70% of time is spent testing and retesting changes to check for regressions, testing to reproduce bugs then testing to verify that bugs were fixed correctly...), and examines models and metrics for effectively managing maintenance.

Alain April and some other researchers have put together a Software Maintenance Maturity Model, a la the CMMI. It's obviously process-heavy and bureaucratic, but it does a thorough job of mapping out all of the work involved in software maintenance, and it's interesting in an abstract way. If you find this model useful, you can go through it in detail in the book Software Maintenance Management, although it is written by and for academics, and focuses more in defining the model than on how to deal with practical problems.

If you're managing a large enterprise maintenance team and working through a lot of structure and standards, then Thomas Pigoski's Practical Software Maintenance offers advice based on real experience - the author ran a large maintenance organization for the US Navy, and is now the CEO of TechSoft, an outsourcing firm that handles maintenance for other companies. Unfortunately this book is out of date (published 1996) and predates modern Agile and Lean methods and practices and new technologies. Pigoski also wrote the chapter on software maintenance in the IEEE SWEBOK, an academic definition of software engineering.

Another academic book which I can't recommend for practitioners is Software Maintenance: Concepts and Practice by Penny Grubb and Armstrong Takang. If you're already working in maintenance, you won't find anything here that you don't already know or that will change the way that you think or work. It's more of an introductory text for undergrad Comp Sci courses.

If I want any kind of data on software development or software maintenance, the first person that I go to is Capers Jones. His book on Estimating Software Costs was last updated in 2007, and includes extensive data on software cost information and statistics and trends on software maintenance and support. Or you can read Applied Software Measurement or Software Engineering Best Practices - all of these books are built on the same comprehensive research. You can spend days going through this stuff - I have, and I keep finding interesting things.

Some of his research specific to software maintenance and legacy software is captured in an excellent white paper called Geriatric Issues of Aging Software. A couple of different versions of this paper can be found on the web, the best of them is in CrossTalk: The Journal of Defense Software Engineering dated December 2007.

One of the foundational studies on software maintenance is more than 30 years old now: Lientz and Swanson's study of software maintenance at 487 IT organizations. This study is widely referenced, and still the most comprehensive analysis of what work makes up maintenance (or at least what did in 1978). You can read a follow-up study on software maintenance management problems (1995) here. A newer (1998) but smaller study by Janice Singer looks at bugs and bug tracking, and how unimportant documentation actually is in maintenance.

Building, releasing and deploying code

And there are some good books on building, deploying and delivering code, on Continuous Integration and source code management, all of which are important in maintaining and releasing code. The best of these is the new book on Continuous Delivery by Jez Humble and David Farley. It’s full of practical advice on how to improve your code build and deployment practices; and on data migration and test automation strategies; and more far-out ideas on how far-out you can push automation in build and deployment.

Lean and Kanban and Maintenance

If you are interested in how Lean Manufacturing and Kanban ideas can be applied to software maintenance, and quite a few people are, a place to start is the XIT case study which explains how Kanban was first used to speed up and simplify management of maintenance work by a small cross-application team at Microsoft. I think there’s too much hype and religious debate around what is a conceptually simple idea (visualizing and limiting work in progress), but Kanban offers a natural framework for getting control over maintenance and support work, helping to deal with interruptions and constantly changing priorities.

Operations and DevOps

Developers need to understand how their code runs in production, need to care about and know about operations and performance and how to handle failures. This becomes even more important, more necessary, in maintenance: it’s a reality that maintenance teams face all day each day, rather than just another set of requirements for the future. Some smart people in the DevOps community are trying to come up with a more collaborative and agile way for operations and development teams to work together. Most of their work focuses on large-scale Web Operations problems and operations toolchains, but there is some good thinking here on how to improve deployment, on metrics and monitoring, and on dealing with architectural problems like scaling, and failure handling and recovery.

While helpful, most of the thinking in DevOps is being done from the operations point of view (maybe it should be called OpsDev instead…). The best book on operations problems by and for developers and software architects is still Release It! by Michael Nygard. He makes clear how important it is to, and how to, design scalable and resilient software architectures, and how and why to build an operations view into your system.

If you are working with Ops (especially in an enterprise shop) you should understand the basics of ITIL, and the cost and risks of change in production, and how to to get a production system under control. The best place to start on this is The Visible Ops Handbook, a tiny primer on intelligent IT change management.

Dealing with Data

One of the biggest problems in software maintenance is handling changes to data and data models. Scott Ambler's book Refactoring Databases walks through an incremental approach to making structural changes to database models, and how to test and deploy these changes.

Secure Software Maintenance

Dealing with security risks for legacy systems is another wormcan, and you're not going to find a lot of information to help you understand where to start, other than this set of white papers on legacy systems and COTS from Cigital, available on the US Department of Homeland Security’s Build Security In portal.

Forums and Discussion Groups

I can't find any discussion groups on software maintenance or sustaining engineering.

Stack Overflow has some interesting threads on maintenance and refactoring and other maintenance problems. It’s one of the mainstay resources for developers looking for advice on maintenance issues or anything else for that matter.

I would like to hear from people about any other useful books, blogs, articles or forums that focus on software maintenance and support problems - what resources you trust, what you have found helpful.

Monday, November 15, 2010

I spent a few days last week in Seattle at the Software Executive Summit hosted by Construx Software Builders. This is a small conference for senior software development managers: well organized, world class keynote speakers, and the opportunity to share ideas in supportive and open roundtables with smart and experienced people who are all focused on similar challenges, some (much) bigger, some smaller. It was good to reflect and reset, and a unique chance to spend time with some of the top thinkers in the field, swapping stories over lunch with Fred Brooks, chatting over drinks with Tim Lister and Tom DeMarco. This was my second time at this conference, and based on these experiences, I would highly recommend it to other managers looking for new ideas.

Key takeaways for me:

A lot of people are concerned that we (software developers in general) continue to do a poor job upfront, on requirements and architecture and design. And that the push to Agile methods, on collaborative and emergent design, isn’t helping: if anything, the problem is getting worse. We are getting better and faster at developing and delivering mediocre software.

The push for offshoring of programming and testing work continues to come mostly from the top. Even with the economic downturn, companies are having problems filling positions locally; and having problems keeping resources overseas, with turnover rates as high as 35% annually in India. The companies who are successful with offshoring make long-term commitments with outsourcing firms; invest in a lot of back-and-forth travel; rely on strong and hard-working onshore technical managers from their outsourcing partners to coordinate with their offshore teams; spend a lot of time on documentation and reviews; and spend more time and money on building and maintaining relationships. There can be real advantages at scale: for smaller companies I don’t see the return.

Some companies have backed away from adopting Agile methods, or are running into problems with Agile projects, because the team is unwilling or unable to commit to a product roadmap or rough project definition. Customers and sponsors are frustrated, because they don’t understand what they can expect or when they can expect it. Executives don’t want to see demos: they want to know when the project will be delivered. Working in an Agile way can’t prevent you from giving the customer what they need to make business decisions and meet business commitments. This is another argument for adapting Agile methods, for doing at least some high-level design and roadmap planning upfront, agreeing on the shape of what needs to be done and on priorities before getting to work.

Productivity: there is no clear way to define productivity or compare it across teams doing different kinds of work. Data on velocity or earned value is useful of course on bigger projects to see trends or catch problems: you want to know if you are getting faster or slower over time. Tracking time on maintenance work (bug fixes, small changes) can help to hilight technical debt or other problems: but what if most of the work that you do is maintenance? We need better ways to answer simple management questions: Am I getting my money’s worth? Is this group well-managed?

Two of the keynote presentations looked at innovation, and two others explored related problems in design: design integrity, dealing with complexity in design. Better decisions are made by small teams, the smaller the better: the ideal team size may be just two people (Holmes and Watson). It's important to give people a safe place and time to think – uninterrupted, unstructured time, preferably time blocked off together so that people can collaborate and play off of each other's ideas. If people are always on the clock and always focused on delivery, always dealing with short-term issues, they won’t have the opportunity to come up with new ways of looking at things and better ways of working, and worse they may lose track of the bigger picture, forget to look up, forget what’s important.

Technical debt: it’s our fault, not the business’s fault. It’s our house, so keep it clean. We have to find time to do the job right. One company uses the idea of a “tech tax”, a small (10%) overhead amount that’s included in all work for upgrades and refactoring and re-engineering.

The importance of building business cases for technical changes like re-architecture. Remember that business cases don’t have to be solid; they just need to be reasonable. Make sure that you have some data on cost and risk, and tie your case into the larger business strategy.

As a leader, you need to take responsibility. You have to deal with ambiguity: that's your job. Don't be a victim. Make a decision and execute.

Monday, November 1, 2010

Mike Rothman’s recent post on pen testing was interesting to me, since I come from the other side of the fence: the side of the software developers who wrote the code and the testers who test it and the project managers who are responsible for taking care of risk and the business managers who have to decide how much to spend on things like pen testing.

I’ve learned that pen testing costs and takes time to do properly, especially at the start. The time to find a good pen testing team (and there are good pen testers out there, I’ve worked with some), to understand what they do and what they need from you to do a good job. The time to setup the tests, and to setup the test system and test accounts and test data, and to harden the test system to match production-quality so that you aren’t paying an expert to report basic patching and configuration problems that you can find yourself with a Nessus scan. Sure, they are going to scan anyways, but at least save yourself and them the trouble of going through those kinds of findings.

I’ve learned that it is important to work through the testing process together, to walk through your architecture and how the important features work, to ask and answer questions so that both sides clearly understand what is going on. To make sure that the test team has enough information to focus in on what’s important, and that they are not wasting their time and your money.

To make sure that the pen tester reviews any findings with you as they go along. That if they find something important, they tell you immediately so that you can act immediately, and so that they can help verify your fix. And if they get stuck or off track, that you help them with whatever information they need to keep going.

That you will learn what it really means to think like an attacker. You might think that you are thinking like an attacker, until you see an attacker at work. The way that they probe the system, the errors and information that they find useful, the soft spots that they look for – and what they can do if they find one.

And you will learn which problems are real, exploitable. One of the arguments that you will get from a developer is whether a fault or weakness can actually be exploited. It's a valid argument – risk assessment needs to take into account how easily any vulnerability can be exploited. Pen testing helps to answer this question: if they found it, the bad guys can and probably will too. And other vulnerabilities like it.

And to work through the findings together. To understand what vulnerabilities need to be addressed and how – what’s important, what’s not. And why. What to patch now, or soon, and how to do it properly.

And finally that it’s more important to look at pen testing for what it tells you about your software and your team, than as a compliance check-mark or a quality gate. To take it seriously and really learn from anything that’s found, rather than to just fix a few bugs and move on. That you need to stop and think and look more closely at your design, and at how you build software, and consider what’s important from a risk perspective to you and your customers. And understand what you need to do to do a better job. And then do it.

It would be foolish to expect too much from pen testing in your software security program. Just like it would be foolish to expect too much from static analysis or any other technique or technology. But from my experience at least, you will get out of it what you put into it.

Wednesday, October 20, 2010

I'm deeply interested in anything that will help me and help my team do a better job of building good software. But I’ve been around long enough to learn that there isn’t one way to build good software – that XP, or Scrum, or XBreed, or RUP, or Crystal, or CMMI, or TSP/PSP, or Scrumban, or any of yesterday’s or today's or tomorrow’s new better ways of making software, that none of these methods will give me all of the answers, solve all of my problems: or yours either. I’m a pragmatist. I’ve learned from experience and from thoughtful people like Steve McConnell a simple truth: that there are smart ideas and smart ways of doing things; and there are stupid ideas and stupid ways of doing things. That if you do more smart things and less stupid things, you will have a better chance of success. And that the hard part is recognizing that you are doing stupid things, and then having the discipline to stop doing these stupid things and do smart things instead.

I like to work with other pragmatists, people who are open-minded, who want to understand, who keep looking for what works. I long ago ran out of patience for orthodoxy and for people without the experience, creativity, patience and courage to question and learn, to find their own answers, and who believe that all you need to do is follow the One Right Way. And that if you do anything less, you will fail: that any house you build will be a house of straw (…that article still burns me up).

The people who put together Scrum, or XP, RUP, TSP/PSP, Software Kanban or these other ways to manage software development are smart, and they spent a lot of time thinking through the problems of how to build software, experimenting and learning. I look closely into how other people build software, I try to understand the decisions that they made and why they made them. But that doesn’t mean that what worked for them will be a perfect fit for my organization or your organization, my circumstances or yours.

And it’s clear that I’m not alone in thinking this way. A study published earlier this year by Forrester Research on Agile Development: Mainstream Adoption Has Changed Agility reported some interesting statistics on the state of the practice for software development methods, what development methods companies are following and how they are doing adopting new ideas:

Roughly 1/3 of developers and development managers are following Waterfall or Iterative methods such as RUP or Spiral

Another 1/3 are following Agile methods (Scrum, XP, Lean, FDD, …)

And 1/3 aren’t following any method at all, or if they are, they don’t know what it is.

Of the organizations adopting Agile methods, less than 1/3 stick to a particular Agile methodology as closely as possible. More than 2/3 deliberately mix different Agile practices; or incorporate Agile practices and ideas into “non-Agile” approaches. The report explains that:

“Perhaps the clearest sign of the mainstreaming of Agile is the abandonment of orthodoxy: Teams are puzzling out the mix of methodologies and combining them to fit within their organizational realities…”

The reasons for this are clear: most organizations find that it is better to learn from these methods and adapt them to their specific circumstances, their culture, their needs, their constraints. The report calls this “cherry-picking”, a “mix-and-match” approach to take the best bits from different methods and create a customized solution.

But I think that “cherry picking” is the wrong metaphor here. It’s more than just taking the “best bits”: it’s understanding why these practices work, and how they work together, how they support each other. Then focusing in on where you are experiencing problems, where you need to save costs, where you need to be faster or where you need to create some other advantage: where you will get the highest return. It’s taking what makes the most sense to you, what fits the best for your organization, what’s important. Understanding what you need and which practices (alone or together) can be applied to your needs; seeing what works, and then building on this to improve.

Some practices make plain good sense no matter where you work and how you work. Frequent builds, or better continuous build and continuous integration, is a no-brainer. Shortening development and delivery cycles, with more frequent checkpoints. Automated developer testing. But it doesn’t matter if you don’t do pair programming, if the developers in your organization don’t like it, won’t do it – as long as you understand why mainline XPers think that pair programming is useful and important, and that you find your own way to write good code, maybe through peer code reviews and static analysis (note I said “and”, not “or”: static analysis is not a substitute for code reviews). It doesn’t matter if what you are doing is XP or Scrum or Scrumban or that you can find a label for it at all. What matters is that it works, and that you are committed to making it work, and to learning, and to making it better.

Monday, October 4, 2010

Software security is a problem of managing risks. But most of the information on software risk management is about how to manage generic business, project and technical risks - not software security risks. Even the section on risk analysis in Gary McGraw’s book Software Security: Building Security In describes a generic risk management framework that can be used in a software security program. This is good, but most people involved in software development, and hopefully anyone who is responsible for software security, already understands the fundamentals of risk management. What people need to understand better is how to manage risks from a software security perspective, and how this is different from managing project risks or business risks.

This post is based on feedback that I shared with some of the team working on the next OWASP Development Guide, in my review of an early draft of the introductory section on risk management. I’m not sure if or how my feedback will be reflected in the new guide, since the project has recently undergone a change in leadership and there has been a lot of rethinking and resetting of expectations. So I have taken some time to put my thoughts together here and think through the issues some more.

Risk Management in Software Development

As a software development manager or project manager, our focus (and our training) is on how to identify and manage risks that could impact the delivery of a project. We learn to avoid project failure through project management structure and controls, by following good practices, and by actively and continuously managing risks inside and outside of the project team, including:

Schedule and estimation risks

Requirements and scope risks – the problem of managing change

Cost and budget and financial/funding risks

Staffing and personnel risks: team skills and availability, turnover, subcontractors, dependencies on partners

Business strategy risks, portfolio risks, ROI/business case

Stakeholder and sponsorship risks – the basic politics of business

Program risks – interfaces and dependencies with other projects

Legal and contracting risks

Technical risks: architecture, platforms, tooling: how well do we understand them, are we on the bleeding edge (will it work).

Risk Management in Software Security

Discovering and managing risks from a security perspective is different: the perspective that you need to take is different, and the issues that you need to manage are different.

To find software security risks, you need to think beyond the risks of delivery and consider the operational and business context of the system. You need to look at the design of the system and its infrastructure, IT and business operations, your organization and its security posture, and your organization’s business assets.

Assets: What does the business have that is worth stealing or damaging?

Think like an attacker. Put yourself in the position of a motivated and skilled attacker: most of us can ignore the possibility of being attacked by script kiddies or amateurs who want to show off how smart they are. The bigger, more serious threat now comes from either disgruntled insiders or former employees, or from motivated professional criminals or even nation states.

Every business has something worth stealing or damaging. You need to understand what somebody would want, and how badly they might want it:

Are you handling financial payments, or other high-value transactions?

Information about buying patterns or other business activities that would be valuable to competitors or other outside parties.

Information about your financials, investments and spending, or your business plans and strategy.

Intellectual Property: research data, design work, or information not about what you are doing, but how you are doing it – your operational systems, supply chain management. Or the designs and algorithms that drive the technology platform – it may not be the data behind the system that is valuable, the target could be the system itself, your technical knowledge.

Start with data and other assets, then look at the systems that you are building, that you support.

Is the system a critical online service? In rare cases, the system could be part of critical infrastructure (electrical power transmission, or a core financial system such as the NYSE, or maybe an emergency notification system). Or you may be running a completely online business – if the system is down, your business stops. Such systems may be vulnerable to Denial of Service attacks or other attacks that could affect service over an extended period of time. Although DDOS attacks are so 2006, it’s worth remembering what happened to those Internet offshore betting systems held to ransom by distributed DOS attacks a few years ago…

Or attackers could use your system as a launch platform to attack other more valuable systems – by compromising and exploiting your connectivity and trust relationships with other, more strategic or valuable systems or organizations.

Starting early with the idea of identifying assets under threat naturally supports threat modeling later.

Attack Surface: Finding the open doors and windows

Once you know what’s valuable to bad guys, you need to consider how they could get it. Look at your systems, identify and understand the paths into and out of the system, and what opportunities are offered to an attacker. Do this through high-level attack surface analysis: walk around the house, check to see how many doors and windows there are, and how easy they are to force open. The idea behind attack surface analysis is straightforward: the fewer entry points, and the harder they are to access, the safer you are.

Most enterprise systems, especially e-commerce systems, have a remarkable number of interfaces to clients and to other systems, including shared services and shared data. Focus on public and remote interfaces. What is reachable or potentially reachable by outside parties, especially unauthenticated / anonymous access?

What other systems are we sharing information with and how? What do we know about these other systems, can we trust them, rely on them?

Is the application client-facing across a private network, or public-facing across the Internet? Do you offer an application API to partners or customers to build their own interfaces? A desktop client? A browser-based web client? How much dynamic content? Online ordering, queries? How rich an experience does your client provide, using what technologies: Ajax, Flash, Java, Silverlight, …? What about mobile clients?

How much personalization and customization do you offer to your customers: more options and more combinations to design, test and review, more chances to miss something or get something wrong.

Then you look behind the attack surface to the implementation details for each interface, at the technology stack, the trust boundaries, look at the authorization and authentication controls, trace the flow of control and flow of data. And do this each time you make changes or create a new type of interface. This is the basis of threat modeling.

Capabilities and Countermeasures: How hard / expensive is it for them to get in?

What is the state of your defenses, your protective controls?

How secure is the network? Is the server OS, database, and the rest of the technology stack hardened? Are your patches up to date – are you sure?

Was the application built in a secure way? Has it been maintained in a secure way? Is it deployed and operated in a secure way? How do you know? When was your last pen test? Your last security design review? How many outstanding security vulnerabilities or bugs do you have in your development backlog?

Are there any regulatory or legal requirements that need to be satisfied? PCI, HIPAA, GLBA, … Do you understand your obligations, are you satisfying them?

Do you know enough to know how much trouble you might be in? What is the security posture/capability of the organization, of your team? Do you have someone (or some group) in the company responsible for setting security policies and helping with security-based decisions? Has the development team been trained in security awareness, and in defensive coding and security testing? Is the team following a secure SDLC? Are you prepared to deal with security incidents – do you have an incident response team, do they know what to do?

It is important for the stakeholders to understand upfront what we know and what we don’t know. How confident they should be in the team’s ability to deliver a secure solution – and the team’s ability to understand and anticipate security risks in the first place. There are a couple of good, freely-available frameworks for assessing your organization’s software security capabilities. OWASP’s SAMM framework is one that I have used before with some success. Another comprehensive organizational assessment framework is Cigital’s BSIMM which has been built up using data from 30 different software security programs, generally at larger companies.

Back to Managing the Risks

Now you can assess your risk exposure: the likelihood of a successful attack, and the impact, the cost to your company if an attack was successful. With this information you can decide how much more to spend on defenses, and put into place a defensive action plan.

Risk management is done at the strategic level first: working with business stakeholders, managers, the people who make business decisions, who decide where and how to spend money. You need to describe risks in business terms, spend some time selling and educating. The point here is to secure a mandate for security: an agreement with the people who run the business on security priorities and goals and budgets.

Then you move to tactical implementation of your security mandate: figuring out where and how much to focus, what technical and project-based trade-off decisions to make within your mandate.

Start at a high level, with a rough H/M/L rule-of-thumb rating of the probability and impact of each risk or risk area. Use this initial assessment, and your understanding of the team’s security capabilities, to determine where to dig deeper and how deep to dig, where you will need to focus more analysis, reviews, testing.

As you find more, and more specific risks and weaknesses through threat modeling or architectural risk analysis, through code reviews, or pen testing or fuzzing or whatever, consider using Microsoft’s DREAD risk assessment model. I like this model because it asks you to evaluate each risk from different perspectives, forces you to answer more questions:

D: Damage potential, what would happen to the business?R: Reproducibility – how often does the problem come up, does it happen every time?E: Exploitability – how easy is to take advantage of, how expensive, how skilled does the bad guy need to be, what kind of tools does he need?A: Affected users – who (how important) and how many?D: Discoverability: how easy is it to find?

As you identify risks, you go back and apply generic risk management methods to determine the cost trade-offs and your response:

accept (plan for the cost, or choose to ignore it because the likelihood is small and/or the cost to fix is high)

avoid (do something else, if you can)

prevent (plug the hole, fix the problem)

reduce (take steps to reduce the likelihood or impact: put in early warning tools to detect, like an IDS, so that you can react quickly; or contain the risk through fire walls, isolation and partitioning, layered defenses, ...).

In the end, risk management comes down to the the same delicate balance of the same decision-making factors:

informed judgment

legal requirements and contractual commitments

the business’ general tolerance for any kind of risk, and the company’s willingness to face risks – a startup for example has a much higher tolerance for risk than an established brand

politics: some issues aren’t fundamentally important to the business, but they are important to somebody important – and some issues are important to the business (or should be), but unfortunately aren’t important to anybody important

And then do this over and over again as your circumstances change, as you get new information, as the company's risk profile changes, as the threat landscape changes. The decisions are the same: you just take a different path to get there.

Wednesday, September 15, 2010

Limits on the money that you have to spend, on the talent that you can get.

The limits of your technology platform, on its performance and reliability and cost of operations.

Limits imposed by decisions that have already been made, decisions on architecture and design that limit how much you can change, how safely and how quickly.

And you must maintain compatibility with legacy systems and interfaces to partners.

You may be forced to work with certain suppliers and partners, even if they can’t meet all of your needs.

You will have to meet hard, unmovable deadlines.

You will have to deal with regulatory oversight and legal compliance dictates, and follow change control and change management controls that will slow you down and make your work more expensive.

And in maintenance you have a running system, people rely on it to get their jobs done, and they expect it and need it to work a certain way.

All of these constraints box you in. They make your work more difficult and expensive. Project Managers are taught to identify and understand constraints and manage them as sources of risk and cost.

But there’s a wonderful paradox in constraints and limits that I enjoy thinking about:

Constraints take control away from you: by dictating, by forcing you to think and work a certain way, by limiting your options.

But…

Constraints help you to take control: by dictating, by forcing you to think and work a certain way, by limiting your options.

The story of Chandler in Dreaming in Code made a strong case for building software under constraints, as you follow the agonizing unraveling of a project that set out to solve too much, with too much money, too much time, and too much talent. By contrast, the book also profiled 37signals, the small team behind Ruby on Rails, which set out to build simple software with almost no resources. Their ethos is to “Embrace Constraints”:

“Constraints drive innovation and force focus. Instead of trying to remove them, use them to your advantage.”

This is the point of software development management frameworks like Scrum, Extreme Programming and Kanban: to place practical limits on the way that the team works, to force the team into a box, and so control the way that work is done.

Constraints in Scrum and XP

Scrum and XP force the team to work in fixed, short time boxes. Time boxes limit how much work that you can consider and take on at one time. You have to come up with a solution in a short time period, and while it may be a sub-optional solution (you probably would have thought of something smarter if you had more time), at least you get something done and can test the results. And there is a good chance that you will be able to build on what you have done so far, fill it in some more in later steps.

Time boxes serve as a hedge against perfectionism and gold plating and procrastination. Forcing people to work under time constraints helps fight off Parkinson’s Law: that work expands to fill the time available for its completion. It blows the whistle on people who don’t want to let go, who continue to redesign, rewrite, refactor, tune, polish and fiddle with code longer than they need to. And being pushed forward from one deadline to the next keeps you from over-thinking, from letting the problem space or solution space expand.

Doing work in small batches with fixed time limits helps to manage delivery risk. You can only get so far off track in a short time, and if it turns out that you got it wrong and you have to throw it all away, you aren’t wasting too much. And it shouldn’t take too long to understand why you failed, how to get back on track. Small decisions are easier and safer to make, and to unmake. And frequent planning gives you the flexibility to adapt to changing circumstances and take new information into account, to reset and re-prioritize.

With time boxes you are forced to work "in the small”, to think, really think, about how to get work done. It’s all about execution: who, what, when, what happens first, second. It creates a sense of urgency. And a sense of satisfaction, in seeing work done, in the feedback that you get.

In his book Succeeding with Agile Development Using Scrum, Mike Cohn makes the point that the pressure to deliver working software on a regular basis drives teams to get better at their work, to become more efficient, to adopt better technical practices and disciplines like automated testing and Continuous Integration, to get or build better tools.

And the time box becomes a commitment to management and to the customer, it forces your work to be more predictable and transparent. Your customers can plan around a regular delivery cycle, and make their own commitments.

Lean and Kanban: Constraining Work in Progress

I have spent the last few months looking into Kanban for software development, to see what we can learn from and use. Kanban, based on Lean Manufacturing principles, also constrains the amount of work that the team does, but in different ways. Rather than limiting the amount of time that the team has available to get their work done, you explicitly limit the number of things that the team can do at any point in time: the amount of Work in Progress. Each person works on one thing at a time (with maybe some small allowance for issues that could get blocked), following the idea that if small batches are good, then a batch size of one is optimal.

The team uses input queues and buffers between steps (for example: analysis, design, coding, testing, deployment) to smooth out work as it is being done:

to minimize slack (over capacity) and stalls (under capacity);

to account for variability in the work to be done (a fundamental problem with applying Kanban to software development: work that is easy to code, but hard to test; or hard to code but easy to test, and so on); and

to adjust as the workload or the team changes, and as you learn more about how you are doing and how to do it better.

The goal is to keep the input queues and the buffers as small as possible, just big enough to keep the work constantly moving (a constant state of flow).

In Kanban, like in Scrum and XP, planning and prioritization is done just-in-time. The difference is that in a time boxed method, planning is done at the beginning of each time box; in Kanban, planning and prioritization are decoupled from the rest of the work. Prioritization can be scheduled at a regular interval, or done on demand, just before the work starts – again with a small amount of buffering to make sure that team members don’t stall waiting for work. You decide with the customer what to do next and qualify it, put it into the queue, and then work is pulled based on priority and the team member’s ability to do the job. This forces you and the customer to always be ready, to focus on what is important.

Planning further out is considered waste, since the situation may change and make your plans (like most plans) out dated.

You don’t accept more work than you can take on immediately. If you run into problems downstream, the queues will fill up, and the pipeline of work upstream will stall – there’s no point taking on more analysis and estimation work if the team is falling behind on building and testing and deploying code. This shines a bright light on inefficiencies and bottlenecks. As your work changes, and the bottleneck moves, you recognize this and adapt. And you constantly look for ways to eliminate unnecessary work, rework, delays and blocks – anything that stops you from getting work done as efficiently as possible.

Constraints as a Management Tool

Limiting the amount of work that the team can do at one time, either through time boxing or setting limits on work in progress queues, is a coarse management tool but an effective one.I’ve worked with time boxing methods at different companies and come to rely heavily on this approach. Time boxing works, and today there are good tools and supporting practices that make time boxing easier to implement and sustain.

If you find it difficult to break your work into time boxes, if your work is highly interrupt-driven like in many maintenance and support shops, then Kanban offers an alternative control structure. I am not sure that I buy into its manufacturing control-line metaphor, and that all of these ideas map well from manufacturing to the highly variable work involved in building and supporting software. I think that there is more work that still needs to be done by the community, to make the fit cleaner and clearer. More people need to try it out, see what works, and whether it sticks. But I understand why Kanban could work. It makes explicit what many well-run teams already do: track and control and buffer work demands across different people on the team.

Whether you follow any of these methods is not important. What is important is that you understand constraints. Understand how constraints take control away, and how they can be used to take control back - how use them to your advantage.

Wednesday, August 18, 2010

HP’s acquisition of Fortify this week (which I am sure will make some people at Kleiner Perkins happy) has made me think some more about static analysis and the state of the technology.

I'm not a technical expert in static analysis, and I have only a superficial understanding of the computer science behind static analysis and the ongoing research in this area. So this is based more on a customer’s view of developments in static analysis over the past few years.

One of the static analysis tools that we rely on is Findbugs, an open source static analyzer for Java. I’ve said it before and I’ll say it again: if you are writing code in Java and not using Findbugs, you are a fool. It’s free, it’s easy to install and setup, and it works. However, the Findbugs detection engine was last updated almost a year ago (Aug 2009) and the updates leading up to that were mostly minor improvements to checkers. Bill Pugh, the University of Maryland professor who leads the development of Findbugs, admitted back in 2009 at JavaOne that the analysis engine “continues to improve, but only incrementally”. At that time, he said that the focus for future development work was to provide improved management and workflow capabilities and tools integration for large-scale software development using Findbugs. This was driven by work that Professor Pugh did while a visiting scientist at Google, a major user of the technology.

Anyways, these improvements look to be about a year late (originally to be available in the fall of 2009) and there hasn’t been any updates on product plans on the findbugs mailing list for a while. But the direction (less work on the engine, more on tools) seems to be consistent with other static analysis suppliers. In the meantime, Findbugs continues to do the job – and hey, it’s free, who am I going to complain to even if I wanted to?

Coverity, another tool that we have used for a while, has reduced its investment in the core static analysis engine over the past couple of years. As I noted in a post last year, after receiving venture capital investment, Coverity went on a buying spree to fill out its product portfolio, and since then has focused on integrating these acquisitions and repackaging and re-branding everything. In the last major release (5.0) of its technology earlier this year, Coverity put a lot of effort into packaging, improving management and workflow capabilities and documentation – but there were only minor tweaks to the core static analysis engine. All nice and ready for the enterprise, all dressed up and ready for somebody to make them an offer. But with IBM and HP already matched up, who’s left to buy them? Symantec maybe?

Over the last year, Fortify has been filling out its product portfolio and working on integration with HP’s product suite leading up to the acquisition (much foreshadowing) and building an On Demand / Cloud / SaaS offering. Over the past year, Klocwork has integrated its static analysis tool with collaborative peer review and code refactoring tools (Klocwork Insight Pro) and improved its reporting – like some of the other tool providers, mapping problems against the CWE and so on. And Parasoft has been about providing an integrated toolset for a long time now – static analysis is only one small part of what they offer (if you can get it to work).

Now that Ounce Labs has become Rational Appscan something-or-other at IBM it will be difficult for anyone except IBM product specialists to keep up with its development, and most of IBM’s time and money for the next while will likely be spent on assimilating Ounce and getting its sales teams skilled up. IBM Research labs have been doing some advanced work on static analysis, mostly in the Java world, although it’s not clear how or if this work will make it into IBM’s product line, especially after the Ounce acquisition.

The most interesting development in the static analysis area recently is the O2 work that OWASP’s Dinis Cruz started while at Ounce Labs, to build more sophisticated security analysis, scripting and tracing capabilities on top of Ounce and other static (and dynamic) analysis engines. I haven’t had a chance to look at O2 yet in any depth - I am still not sure I entirely understand what it does and how it does what it does, and I am not sure it is intended to be used by mere mortals, but it looks like it is almost ready for prime time.

Even O2 isn’t about offering new advances in static analysis – at least from what I understand so far. It is about leveraging existing underlying static analysis technology and getting more out of it, building a technical security analyst’s workbench and tool platform.

All of this: the absence of innovative product announcements in static analysis; the investments made instead to fill out integrated product suites and create better management tools for larger companies, and the emphasis on selling value-add consulting services; are clear signs that the market has matured; that the underlying technology has reached, or is reaching, its limits.

Our experience has been that there are benefits to using static analysis tools, once you get through the initial adoption – which admittedly can be confusing and harder than it should be, and which I guess is one of the drivers for vendors to offer better management tools and reporting, to make it easier for the next guy. The tools that we use today continue to catch a handful of problems each week: a little sloppy coding here, a lack of defensive coding there, and a couple of simple, small mistakes. And every once in a while a real honest-to-goodness slap-yourself-in-the-head bug that made it though unit testing and maybe even code reviews. So as I have said before, the returns from using static analysis tools are real, but modest. And given where suppliers are spending their time and money now, it looks like we won’t be able to expect much more than modest returns in the future.

Tuesday, July 13, 2010

OWASP has just announced their 2010 US Appsec conference. It looks like an interesting opportunity to explore the state of the art in software security. Last year, some of the leaders of this conference were concerned about how few software developers showed up for the sessions. I expect the same will happen this year: the audience will be made up of a self-referencing group of security specialists and consultants, and a handful of developers and managers who are looking to understand more about software security. And that, I think, is as it should be.

Travel and education budgets are tight. Developers and managers need to choose carefully where to spend their company’s money and time – or their own. Where can they get the most information for their own work, where can they meet people who will help them solve problems or move their careers forward?

Security experts like Jeremiah Grossman are right: developers don’t understand security, they aren’t taking ownership for building secure software, it’s not important to them.

What’s important to software developers? Delivery: if we don’t deliver, we fail. Check out the major software development conferences, the events that attract senior developers, architects, development managers, test managers, project managers. They are all about delivery, how to deliver faster, better, cheaper: Agile methods, understanding Lean/Kanban principles applied to software development, leadership and collaboration and communication and effecting organizational change, managing distributed teams and global development, getting requirements right, getting design right, tracking whether the project is on target, metrics, continuous integration, continuous delivery, continuous deployment, TDD and BDD, refactoring and improving code quality, improving the user experience, newer and better development platforms and languages and tooling.

With the notable exception of the recent NFJS UberConf, there isn’t any serious coverage of secure software development, secure SDLCs, software security problems at software development conferences. The question shouldn’t be why there are so few developers attending a software security conference. The question should be why there is so little coverage of software security at software development conferences and in the other places where developers and managers get their information: in the development-focused books and blogs and seminars.

Building Bridges

I’ve posted before about my concern about the gap between the software development and software security communities. But there is a way forward. Around 10 years ago, the development and testing community were far apart in values and goals; testing was inefficient and was seen as “somebody else’s problem”. But Agile development made it important for developers to test early and often, made it important for developers to understand testing and code quality, to find better and more efficient ways to test. Testing is cool now – and more than that, it’s expected. Developers look to professional testers for help in improving their own testing, and to find bugs that they don’t understand how to find, through exploratory testing and integration testing and more advanced testing techniques. The development community is taking responsibility for testing their own work, for the quality of their work. And I believe that software development, and software, are both better for it.

But software security is still “somebody else’s problem”. This needs to change. The solution isn’t to try to entice developers to attend security conferences. It's not to force certification in secure development through SANS or ISC. It’s not passive attempts to “infect” development managers with vague ideas about being "Rugged" that are supposed to somehow change how developers build software. And holding software producers liable for their mistakes, while clearly showing the frustration of security specialists, and while making for a provocative sound bite, is not likely to happen either.

The solution is to make secure software development a problem for software developers, a problem that we need to solve ourselves. Engage leaders in the wider software development community: the people who spend their lives thinking about and writing about better ways to build better software; the people who help shape the values and priorities of the development community; the people who help developers and managers decide where to spend time and effort and money. And help them to understand the problems and how serious these problems are, convince them that they need to be involved, convince them that we need to include security in software development, and ask them to help convince the rest of us.

Engage people like Steve McConnell, and Martin Fowler, and Kent Beck, Uncle Bob Martin, Michael Feathers, the NFJS uber geeks, Joel Spolsky and Mike Atwood, Scrum advocates like Mike Cohn and Ken Schwaber, Lean evangelists like Tom and Mary Poppendieck, David Anderson on Kanban, agile project managers like Johanna Rothman, and leaders from the software testing community like James Bach and Jonathan Kohl. And ask them who else they think should be engaged because they’ll know better who can make a difference.

Invite them to come in and work with the best of the software security community, to understand the challenges and issues in building secure software, ask them to consider how to “Build Security In” to software development. And with their help, maybe we will see software security problems owned by the people responsible for building software, and those problems solved with the help and guidance of experts in the security community in a supportive and collaborative way. Otherwise I am afraid that we will continue to see the communities drift apart, the gap between our priorities growing ever wider. And software, and software development, will not be the better for it.

Wednesday, June 30, 2010

I attended a long webinar earlier today, sponsored by SD Times: Kent Beck’s Principles of Agility. The other speakers were Jez Humble from ThoughtWorks, a proponent of Continuous Delivery; and Timothy Fitz at IMVU, the leading evangelist for Continuous Deployment.

The arguments in support of Continuous Deployment

Kent Beck explored a fundamental mismatch between rapid cycling in design and construction, and then getting stuck when we are ready to deploy. He argues that that queuing theory and experience show that there is more value in a system when all of the pipes are the same size, and follow the same cycle times. Ideally, there should be a smooth flow from ideas to design and development and to deployment, and then information from real use fed back as soon as possible to ideas. Instead we have a choke point at deployment.

Then there is the ROI argument that we can get faster return on money spent if we deploy something that we have done as soon as it is ready.

Kent Beck also explained that based on his experience at one company the constraints of deploying immediately make people more careful and thoughtful: that the practice becomes self-reinforcing, that developers stop taking risks because they don’t have time to. Essentially problems become simpler because they have to be.

Timothy Fitz presented a Deployment Equation:

If Information Value + Direct Value > Deployment Risk then Deploy

The idea is that Continuous Deployment increases information value by giving us information earlier. He talked about ways to reduce risk:

- Rolling out larger changes slowly to customers, through dark launching (hiding the changes from the front-end until ready: not exactly a new idea) and enabling features for different sets of users.- Extensive automated testing, supplemented with manual exploratory testing before exposing dark-launched features. - Ensuring that you can detect problems quickly and correct them through production monitoring, looking for leading indicators of problems, and instant production roll back.- An architecture that supports stability through isolation. Follow the patterns in Release It! to minimize the chance of “stupid take the cluster out” errors.- Locking down core infrastructure, preventing changes from certain parts of the system without additional checks.

Jez Humble at ThoughtWorks presented on Continuous Delivery: building on top of Continuous Integration to automate and optimize further downstream packaging and deployment activities. Continuous Deployment is effectively an extension of Continuous Delivery. It was mostly a re-hash of another presentation that I had already seen from ThoughtWorks, and of course there will be a book coming out soon on all of this.

Some questions on Continuous Delivery and Continuous Deployment

Me: Continuous Delivery is based on the assumption that you can get immediate feedback: from automated tests, from post-deployment checks, from customers. How do you account for problems that don't show up immediately, by which time you have deployed 50 or 100 or more changes?

Answer from Timothy Fitz: The first time, you revert and re-push. Then you post-mortem and figure out how to catch faster by looking for a leading indicator. Performance issues can be caught by dark launching, in which case turning off or reverting the functionality will have 0 visible effect. Frontend issues are usually caught by A/B tests, where you can mitigate risk by not running them at 100% of all traffic (have 80% control, 20% hypothesis, etc)

Me: Followup on my question about handling problems that show after 50 or 100 changes. The answer was to revert and re-push - but revert what? A problem may not show itself immediately. How do you know which changes or changes to rollback?

Answer from Timothy Fitz: If it took 50-100 changes, then you'll be finding the change manually. It turns out to be fairly easy even if it's been 48-96 hours, you're only looking through a few hundred very small commits most of which are in isolated areas unrelated to your problem.

Me: How to you handle changes to data (contents and/or schema) on a continuous basis?

Answer: not answered. Jez Humble talked about writing code that could work with multiple different database versions (which would make design and testing nasty of course), and how to automate some database migration tasks with tools like DBDeploy, but admitted that “databases were not optimized for Continuous Delivery”. There were no good answers on how to handle expensive data conversions.

Me: My team has obligations to ensure that the software we deliver is secure, so we follow secure SDLC checks and controls before we release. In Continuous Delivery I can see how this can be done along the pipeline. But secure Continuous Delivery?

Answer from Jez Humble: Ideally you'd want to run those checks against every version. If you can't do that, do it as often as you can.[I didn’t expect a meaningful answer on this one, and I didn’t get one]

Somebody else’s question: Do you find users struggling to keep up and adapt to the constant changes?

Answer from Kent Beck: In practice it doesn't seem to be a problem usually because each change is small--a new widget, a new menu item, a new property page that's similar to existing pages. A wholesale change to the UI would be a different story. I would try to use social processes to support such a change--have a few leaders try the new UI first, then teach others.

Somebody else’s question: Without solid continuous testing in place, CD is [a] fast track to continuous complaints from end users

Answer from Timothy Fitz: Not always, but usually. For the cases where it makes sense (small startup, or isolated segment that opts-in to alpha) you can find user segments who value features 100% over stability, and will gladly sign up for Continuous Deployment.

So what do I really think about Continuous Deployment

OK I can see how Continuous Deployment can work,

If: your architecture supports isolation, that it is horizontal and shallow, offering features that are clearly independent;

If: you don’t follow the all-or-none approach – that you recognize that some kinds of changes can be deployed continuously and some parts of the system are too important and require additional checks, tests, reviews, and more time;

If: you build up enough trust across the company;

If: your customers are willing to put up with more mistakes in return for faster delivery, if at least some of them are willing to help you do your testing for you;

If: you invest enough in tools and technology for automated layered testing and deployment and post-deployment checking and roll-back capabilities.

Continuous Deployment is still an immature approach and there are too many holes in it. And as Kent Beck has pointed out, there aren’t enough tools yet to support a lot of the ideas and requirements: you have to roll your own, which comes with its own costs and risks.

And finally, I have to question the fundamental importance of immediate feedback to a company. I can see that waiting a year, or even a month, for feedback can be too long. I fully understand and agree that sometimes changes need to be made quickly, that sometimes the windows of opportunity are small and we need to be ready immediately. And there’s first mover advantage, of course. But I have a hard time believing that any kind of changes need to be continuously made 50 times per day: that there are any changes that can be made that quickly that will have any real difference to customers or to the business. And I will go further and say that such rapid changes are not in the interests of customers, that they don’t need or even want this much change this fast. And that I don’t believe that it’s really about reducing waste, or maximizing velocity or increasing information value.

No, I suspect it is more about a need for immediate satisfaction – for programmers, and the people who drive them. Their desire to see what they’ve done get into production, and to see it right away, to get that little rush. The simple inability to delay gratification. And that’s not a good reason to adopt a model for change.

Monday, June 28, 2010

I spent an interesting few days last week at the Velocity 2010 conference in Santa Clara. The focus of the conference was on performance and application operations for large-scale web apps. Here are my take-aways:

Performance

Fundamentally a problem of scale-out, of handling online communities of millions of users and the massive amounts of information that they want at hand. As Theo Schlossnagle pointed out in an excellent workshop on Scalable Internet Architectures (or you can read the book…), the players in this space approach performance problems with similar technologies (LAMP or something similar like Ruby on Rails as the principal stack, and commodity servers) and architectural strategies:

1. Data partitioning – sharding datasets across commodity servers, required because MySQL does not scale vertically. Theo’s advice on sharding: “Avoid sharding as long as possible, it is painful. If you have to shard, follow these steps. Step 1: Shard your data. Step 2: Shoot yourself”. Consider duplicating data if you need the same information available in different partitioning schemes.

2. Non-ACID key-value data stores and NOSQL distributed data managers like Cassandra, MongoDB, Voldemort, Redis or CouchDB for handing high volumes of write-intensive data. Fast and simple, but these technologies are still immature, they are not hardened or reliable, and they lack the kinds of management capabilities and tools that Oracle DBAs have been accustomed to for years.

3. Strategies for effective caching of high-volume data, basically ways of extending and optimizing the use of memcached, and different schemes for effective cache consistency and cache coherency.

Some other advice from Theo: Planning for more than a 10-fold increase in workload is a waste of time – you won’t understand the type of problems that you are going to face until you get closer. On architecture and design: don’t simplify simple problems.

Coming from a financial trading background, I was surprised to see that the argument still needed to be made that performance was an important business factor: that speed could improve business opportunities. Seems obvious.

According to one of the keynote speakers, Urz Holzle at Google, the average time for a page to load is 4.9 seconds, while the goal should be around 100 ms – the time that it takes a reader to turn a page in a book. Google presented some interesting research work that they are leading to improve the front-end response time of the web experience, including proposals to improve DNS and TCP, work done in Chrome to improve browser performance, and advanced performance profiling tools made available to the community.

Operations and DevOps

Provisioning and deployment (a real management problem when you need to deploy to thousands or tens of thousands of servers); change management and the rate of change; version control and other disciplines; instrumentation and logging; metrics and more metrics; and failure handling and incident management.

Log and measure as much as you can about the application and infrastructure – establish baselines, understand Normal, understand how the system looks when it is working correctly and is healthy.

Configuration management and deployment. Advice from Theo: version control everything – not just code and application configuration, and server configs, but also the configs for firewalls and load balancers and switches/routers and the database schemas and…

Several companies were using Chef or Puppet for managing configuration changes. Facebook and Twitter were both using BitTorrent to stream code updates across thousands of servers.

Change management. The consensus is that ITIL is very uncool – it is all about being slow and bureaucratic. This is a shame – I think that everyone in an operations role could learn from the basics of ITIL and Visible Ops, the disciplines and frameworks.

The emphasis was on how to effect rapid change, how to get feedback as quickly as possible, time to market, continuous prototyping, A/B split testing to understand customer needs, the need to make decisions quickly and see results quickly. At the same time, different speakers stressed the need for discipline and responsibility and accountability: that the person who is responsible for making a change should make sure it gets deployed properly, and that it works.

Continuous Deployment came up several times, although “Continuous” means different things to different people. For Facebook this means pushing out small changes and patches every day and features once per week.

You can’t make changes without taking on the risk of failure. This was especially clear to an audience where so many people had experience in startups.

Lenny Rachitsky’s session, The Upside of Downtime, covered the need for transparency in the event of failure, and showed how being transparent and honest in the event of a failure can help build customer confidence. His blog, Transparent Uptime includes an interesting collection of Public Health Dashboards for web communities.

To succeed you need to learn from failures of course – use postmortems and Root Cause Analysis to understand what happened and implement changes so that you don’t keep making the same mistakes. Another quote from Theo: “Good judgment comes from experience. Experience comes from bad judgment. Allow people to make mistakes – but limit the liability. Measure the poise and integrity with which someone deals with the problem and its remediation.”

So failure can scorch you, make you afraid, and this fear can affect your decision making, slow you down, stop you from taking on necessary and manageable risks. You need to know how much risk you can take on, whether you are going too slow or too fast, and how to move forward.

John Allspaw at Etsy, one of the rock stars of the devops community, made a clear and compelling (and entertaining) case for meta-metrics, data to provide confidence in your operational decisions: “How do we get confidence in [what we are doing]? We make f*&^ing graphs!”

First track all changes: who made the change, when, what type, and how much was changed. Track all incidents: severity, time started, time to detect, time to recover/resolve, and the cause (determined by RCA). Then correlate changes with incidents: by type, size, frequency. With this you can answer questions like: What type of incidents have high Time to Recover? What types of changes have high / low success rates?

Unfortunately the video and slide deck for this presentation are not available on the Velocity site yet.

There was some macho bullshit from one of the speakers about “failing forward” – that essentially rolling back was for cowards. I think this statement was made tongue-in-cheek and I hope that it was taken as such by the audience.

The Rest

I also followed up some more on Cloud Computing. Sure, the Cloud gives you cheap access to massive resources but the consensus at the conference was that it is still not reliable and it is definitely not safe, and it doesn’t look like it will get that way soon. Any data that you need to be safe or confidential needs to be kept out of the Cloud or at minimum encrypted and signed with the keys and other secrets stored out of the Cloud, following a public/private data architecture.

The conference was fun and thought-provoking, and I met a lot of smart and thoughtful people. The crowd was mostly young and attention-deficit: iphones, ipads, notebooks and laptops in constant use throughout the sessions.

Maybe it was the California sunshine, but the atmosphere was more open, more sharing, and less proprietary than I am used to – there was a refreshing amount of transparency into the technology and operations at many of the companies. The vendor representation was small and low key, but recruitment was blatant and pervasive: everyone was hunting for talent.

I am an uptight enterprise guy. It would be fun to work on large-scale consumer problems, with more freedom to make changes. I regret missing the followon DevOps Days event last Friday but I had to get home. And finally, I am looking forward to getting my copy of the new WebOps bookwhich was premiered at the conference, and to next years Velocity conference.

Monday, May 24, 2010

There is a lot of excitement in the software development community around Lean Software Development: applying Japanese manufacturing principles to software development to reduce waste and eliminate overhead, to streamline production, to get software out to customers and get their feedback as quickly as possible.

Some people are going so far as to eliminate review gates and release overhead as waste: “iterations are muda”. The idea of Continuous Deployment takes this to the extreme: developers push software changes and fixes out immediately to production, with all the good and bad that you would expect from doing this.

“The other day we passed product release number 25,000 for WordPress.com. That means we’ve averaged about 16 product releases a day, every day for the last four and a half years!”

I am sure that he is not proud of their history of security problems however, which you can read about here, here, here, here, here and elsewhere.

And Facebook? You can read about how they use Continuous Deployment practices to push code out to production several times a day. As for their security posture, Facebook has "faced" a series of severe security and privacy problems and continues to run into them, as recently as last week.

I’ve ranted before about the risks that Continuous Deployment forces on customers. Continuous Deployment is based on the rather naïve assumption that if something is broken, if you did something wrong, you will know right away: either through your automated tests or by monitoring the state of production, errors and changes in usage patterns, or from direct customer feedback. If it doesn’t look like it’s working, you roll it back as soon as you can, before the next change is pushed out. It’s all tied to a direct feedback loop.

Of course it’s not always that simple. Security problems don’t show up like that, they show up later as successful exploits and attacks and bad press and a damaged brand and upset customers and the kind of mess that Facebook is in again. I can’t believe that the CEO of Facebook appreciates getting this kind of feedback on his company's latest security and privacy problems.

“Facebook are not fools of course. You don't build a business that engages every tenth adult on the planet without honing a pretty good sense for which way the wind is blowing. The company realizes that it is under no obligation to provide any real security controls to its users.

Maybe to be truly open and collaborative, you are obliged to make compromises on security and data integrity and confidentiality. Some of these Web 2.0 sites like Facebook are phenomenally successful, and it seems that most of their customers don’t care that much about security and privacy, and as long as you haven’t been foolish enough to use tools like Facebook to support your business in a major way, maybe that’s fine.

And I also don’t care how a startup manages to get software out the door. If Continuous Deployment helps you get software out faster to your customers, and your customers are willing to help you test and put up with whatever problems they find, if it gives you a higher chance of getting your business launched, then by all means consider giving it a try.

Just keep in mind that some day you may need to grow up and take a serious look at how you build and release software – that the approach that served you well as a startup may not cut it any more.

But let’s not pretend that this approach can be used for online mission-critical or business-critical enterprise or B2B systems, where your system may be hooked up to dozens or hundreds of other systems, where you are managing critical business transactions. Enterprise systems are not a game:

“I understand why people would think that a consumer internet service like IMVU isn't really mission critical. I would posit that those same people have never been on the receiving end of a phone call from a sixteen-year-old girl complaining that your new release ruined their birthday party. That's where I learned a whole new appreciation for the idea that mission critical is in the eye of the beholder.”

This is a joke right?

But seriously, I get concerned when thoughtful people in the development community, people like Kent Beck and Michael Feathers start to explore Continuous Deployment Immersion and zero-length iterations. These aren’t kids looking to launch a Web 2.0 site, they are leaders who the development community looks to for insight, for what is important and right in building software.

There is a clear risk here of widening the already wide disconnect between the software development community and the security community.

On one side we have Lean and Continuous Deployment evangelists pushing us to get software out faster and cheaper, reducing the batch size, eliminating overhead, optimizing for speed, optimizing the feedback loop.

On the other side we have the security community pleading with us to do more upfront, to be more careful and disciplined and thoughtful, to invest more in training and tools and design and reviews and testing and good engineering, all of which adds to the cost and time of building software.

Our job in software development is to balance these two opposing pressures: to find a way to build software securely and efficiently, to take the good ideas from Lean, and from Continuous Deployment (yes, there are some good ideas there in how to make deployment more automated and streamlined and reliable), and marry them with disciplined secure development and engineering practices. There is an answer to be found, but we need to start working on it together.

Subscribe to this blog

About Me

I am an experienced software development manager, project manager and CTO focused on hard problems in software development, software quality and security. For the last 20 years I have managed teams building and operating high-performance financial platforms.
My special interest is how small teams can be most effective in building real software: high-quality, secure systems at the extreme limits of reliability, performance, and adaptability. Software that has to work, that is built right, and built to last.
I use this blog to explore ideas and problems in software development that are important to me. To reflect and to find new answers.