Thursday, December 12, 2013

There are beautiful, simple ideas in today’s Agile development methods that work really well. And some that don’t. Like defining all of your requirements as User Stories.

I don’t like the name. Stories are what you tell children before putting them to bed, not valuable information that you use to build complex systems. I don’t like the format that most teams use to write stories. And I don’t like how they use them.

Sometimes you need Stories, Sometimes you need Requirements

One of the “rules” of Agile is that stories have to be small – small enough to fit on an index card or a sticky note. They are too short on purpose, because they are supposed to be place holders, reminders to have conversations with the customer when you are ready to work on them:

They're not requirements. They're not use cases. They're not even narratives. They're much simpler than that.

Stories are for planning. They're simple, one-or-two line descriptions of work the team should produce.

This isn't enough detail for the team to implement and release working software, nor is that the intent of stories. A story is a placeholder for a detailed discussion about requirements. Customers are responsible for having the requirements details available when the rest of the team needs them.
James Shore, The Art of Agile - Stories

According to Mike Cohn in his book Succeeding With Agile, making stories short forces the team to shift their focus from writing about features to talking about them. Teams want to do this because these discussions are more important than what what gets written down.

But this idea can be – and often is – taken too far. Sure, most people have learned that it’s not possible to write correct, complete, comprehensive requirements specs for everything upfront. But there are lots of times when it doesn't make sense to limit yourself to 1- or 2-line placeholders for something that you hope to fill in later.

Some requirements aren't high-level expressions of customer intent that can be fleshed out in a conversation and demoed to verify that you got it right. They are specs which need to be followed line by line, or rules or tolerances that constrain your design and implementation in ways that are important and necessary for you to understand as early as possible.

Some requirements, especially in technical or scientific domains, are fundamentally difficult to understand and expensive to get wrong. You want to get as much information as you can upfront, so developers – and customers – have the chance to study the problem and think things through, share ideas, ask questions and get answers, explore options and come up with experiments and scenarios. You want and need to write these things down and get things as straight as you can before you start trying to solve the wrong problem.

And there are other times when you've already had the conversation – you were granted a brief window with people who understood very well what they needed and why. You might not get the same chance with the same people again. So you better write it down while you still remember what they said.

Short summary stories to be detailed later or detailed requirements worked out early – different problems and different situations need different approaches.

The Connextra Template: As a {type of user} I want {something}…

Stories started off as simple, free-form descriptions of work that the development team needed to do, like “Warehouse Inventory Report”. But now,
the Role-Feature-Reason template also known as the Connextra template/format (because somebody working there came up with it back in 2001) is the way that we are told we should all write user requirements.

Something significant and I'm tempted to say magical happens when requirements are put in the first person…

Reason 2

Having a structure to the stories actually helps the product owner prioritize. If the product backlog is a jumble of things like:

Fix exception handing

Let users make reservations

Users want to see photos

Show room size options

… and so on, the Product Owner has to work harder to understand what the feature is, who benefits from it, and what the value of it is.

Reason 3

I've heard an argument that writing stories with this template actually suppresses the information content of the story because there is so much boilerplate in the text. If you find that true, then correct it in how you present the story… [which is not a Reason to use this template, but a workaround if you do use it}.

Trying to fit every requirement into this template comes with its own set of problems:

User story format is awkward and heavy-handed. “As a ___, I want ___, so I can ____.” The concept is good – there should always be an explanation of “why” the task is desired, to make sure the end result fulfills the actual need. But the amount of verbal gymnastics I've seen people go through to try to make a simple and obvious requirement into a “User Story” proves that despite what Agile says, it’s not always the best way to go.
Talia Fukuroe, 6 Reasons Why Agile Doesn't Work

But where they excel is making sure every story is expressed in the format As a ____ I can____So That____. No matter what the story is, they find a way to shoe horn it into that template. And this is where things start to fall apart. The need to fit the story into the template becomes more important than the content of the actual story. …

The Template has become a formalized gate. My understanding of stories when I first learned about them was that they were to bring natural language back to the conversation around what the software is intended to do. How are we to move away from formal “shall lists” and requirements documents if we are just replacing them with Story Templates?

Gojko Adzic says that robotically following a standardized story template leads to stories that “are grammatically correct but completely false”:

Stories like this are fake, misleading, and only hurt. They don’t provide a context for a good discussion or prioritisation. They are nothing more than functional task breakdowns, wrapped into a different form to pass the scrutiny of someone who spent two days on some silly certification course, and to provide the organisation some fake comfort that they are now, in fact, agile.

The development team must always be clearly delivering Customer Value. From the start of the project, the team is supposed to deliver working features that customers can see, touch, explore and respond to.

The Customer/Product Owner has to be able to understand every requirement, which means that every requirement must be in their language and something that they care about. (This is another example of what’s wrong with the Customer/Product Owner idea in Agile development – that a single person can be responsible for defining everything that is done in a project.)

Focusing exclusively on delivering Customer Value leaves little room for non-functional requirements and constraints, which are critically important in building any real system,
and de-emphasizes important design problems and technical work that developer teams need to do in order to deliver a high-quality system and to minimize technical risks.

It is especially a problem with under-the-covers technical requirements like security, maintainability and supportability – cross-cutting concerns that don't make sense to a customer, but are fundamental constraints that apply across all of the work that the team does and how they do it, not just work done in one Sprint.

Like the idea of a using common template, putting the customer first in requirements was well-intentioned: a way to bring the development team and customers closer together, and a way to make sure that the people paying for the work actually got what they asked for. But insisting that this is the only way that every requirement has to be framed creates unnecessary problems and risks and makes the team’s work harder than it has to be.

Don’t Get Hung Up on User Stories – Do Whatever Works

Insisting that all of the work that every development team needs to do has to be defined in the same way is arbitrary, unnecessary, and wrong. It's like back in the UML days when we learned that requirements had to be captured through Use Cases. Anyone remember how silly - and pointless - it was trying to model functional specifications using stick-men and bubbles?

Teams should use whatever format makes sense for their situation, with as little or as much detail as the problem and situation requires. There are times when I would much rather work with well-defined, detailed documented rules and scenarios that have been carefully thought through and reviewed, or a prototype that has been shown to customers and improved iteratively, than to be limited to a long list of 2-line User Stories and hope that I will be able to get someone who understands each requirement to fill the details in when the team needs them.

At Agile 2013, Jeff Patton in his talk on Agile Requirements and Product Management made it clear that the Story Template is a tool that should be used for beginners in order to learn how to ask questions. Like the “snow plow” technique for beginning skiers – something to be thrown aside once you know what you’re doing. His recommendation is to “use whatever you need to capture requirements: pictures, slides, notes, acceptance tests”. At the same conference, Scott Ambler reiterated that “stories are not enough. They are just one tool, a usage view. There are so many more options”.

Some product owners and teams are so fond of user stories that everything is expressed as a story. This either results in some rather odd stories – stories that capture the user interface design, complex user interactions, and technical requirements; or these aspects are simply overlooked.

Like any technique, user story writing has its strengths and limitations. I find stories particularly well suited to capture product functionality, and when applied properly, nonfunctional requirements. But user interface design and complex user interactions are better described by other techniques including design sketches, mock-ups, scenarios, and storyboards. Complement your user stories therefore with other techniques, and don’t feel obliged to only use stories.

Capture requirements in whatever format works best. Keep requirements as light and clear and natural as possible. If you have important technical requirements that need to be worked on, write them down and don’t worry about trying to warp them into some kind of arbitrary customer stories because you've been told that’s what you have to do in order to “be Agile”.

Thursday, December 5, 2013

But one of the reasons for this is that Appsec has a serious Agile problem. Most security experts don’t understand Agile development and haven’t come to terms with the way the way that Agile teams design and build software; with the way that Agile teams think and work; and especially with the speed at which Agile teams deliver software and make decisions.

The CSSLP and Agile = Epic Fail

You can see this problem in (ISC)2’s Certified Secure Software Lifecycle Professional (CSSLP), which is supposed to help bridge between security and software development. The Official Guide to the CSSLP is 572 pages long. Of this, only 2 pages are spent on Agile development: ½ page each on Scrum and XP, and a couple of pictures. Otherwise, ISC2 pretends that software development is done in big formal Waterfall steps (requirements, design, coding, testing, deployment) with lots of documents to review and clear hand-offs at each of these steps where somebody from Security can step in and insert a big formal review/test before the next step can start. Most developers don’t work this way anymore, if they ever did.

Appsec’s Agile Challenges

It’s not clear how and when security should engage with Agile teams that are following Lean, lightweight Agile methods.

Where does Security fit in Scrum, or a Scrum of Scrums? What meetings do security engineers need to attend, and what roles are they supposed to play in these meetings? How much input can they / should they have on decisions? Is Security a Chicken or a Pig?

How can Security know when they need to do a security review, if requirements are all captured in 1-sentence User Stories which are “too short on purpose”?

How can Security catch and correct design and implementation decisions before it is too late if they aren't in the same room as the development team, when developers are learning and deciding on the fly what work needs to be done and how it needs to be done?

When do you schedule security reviews and tests if the design and the code are always changing? When the team is continuously experimenting and trying out new ideas, new programming models, new languages and frameworks and libraries and toolchains?

How do you do threat modeling on a design that is never finished? And how can you assess the design of a system for security risks if “the design is the code” and “the code is the documentation” without having to go through all of the code by hand after it has already been written?

Security and compliance requires a security review for every major software release. But what if there is never a “major release”, what if the development team is releasing small changes to production 20 or 50 or 500 or 5000 times a year?

It Has Already Been Decided

Appsec isn’t prepared for the rapid pace that Agile teams deliver working software, often from the start of a project. Or for the fierce autonomy and independence of self-managing Whole Teams in which developers are free to decide who will do the work and how it will get done. Or for the speed at which these decisions are made.

This is a different way of thinking and working from top-down, plan-driven projects.

Responsibility and accountability for decisions are pushed down to the team and from there to individuals. Lots of people making lots of small decisions, quickly and often – and changing or unmaking these decisions just as quickly and just as often.
The ground is always shifting, as people continuously seek out and respond to feedback and new ideas and information, adjusting and backtracking and making course corrections. Constantly changing and tuning how they work through frequent retrospection. A culture and working approach where people are encouraged to fire first and then aim, to make mistakes and embrace failure, to fail early, fail fast and fail often, as long as they keep learning.

The software – and the process that the team follows to design and build and test it – is never done, never stable and therefore “never secure”.

Agile Appsec: Case Studies

Microsoft has taken on the problem of how to do secure Agile development with its SDL-Agile process framework. Unfortunately, it only works for Microsoft: the SDL-Agile is expensive, heavyweight, and draws extensively on the scale and capabilities of Microsoft’s massive internal organization.

Two “From the Trenches” case studies at this year’s OWASP Appsec USA conference in NYC showed how other organizations are taking on the same challenges.

The first case study by Chris Eng and Ryan Boyle at Veracode, a software security as a service provider (couldn't find the link at OWASP) proves how difficult it can be for Appsec to keep up with Agile development teams, even in an organization that does Appsec for a living and has deep security engineering capabilities.

Veracode’s internal Appsec engineering program has continued to learn and adapt as their development organization grew to more than 100 application developers working in a dozen Scrum teams. In the early pre-Agile days, their program relied on static analysis checking (essentially eating their own dog food as they used the same platform technology that the development team was building for customers), staged manual pen testing and ad hoc consultation from the security engineering team.

As the development organization grew and adopted Scrum, Security had to find new ways to work closer with development without slowing the developers down or stretching their security engineering resources too thin. Security engineers got involved in Sprint planning meetings to discover risks, identify which stories needed security reviews, and do some threat modeling. But they found that planning meetings were not the best place for technical security reviews – the security engineers had already missed a lot of design and implementation decisions that developers had already made, which forced the teams to back track or add work after the Sprint had already started, making them miss their commitments. Now security engineers work earlier with the Product Owner to look for risks and to proactively review the team’s backlog and identify candidate stories that Security will need to review and sign-off on or help the team with.

In the second case study, Yair Rovek explained how at LivePerson, 200+ developers in more than 20 Scrum teams build secure software using a common set of technologies, tools and practices. Security engineering works with a central architecture team to build security into the technology platform that all of the development teams share, including custom-built developer-friendly wrappers around ESAPI and other security libraries.

Security reviews and other controls are added at different points in the development cycle: Release planning (identify risks, high-level design, compliance issues), Sprint planning, coding, testing, release. LivePerson uses static analysis tools with custom rules to check that architecture conventions are followed and to alert when a developer integrates new Open Source code so that this code can be reviewed for vulnerabilities. They schedule pen tests for every major release of their software and open up their service to customer pen testing – as a result their systems are almost continuously pen tested throughout the year.

In order for Appsec to “push left” into the SDLC, Appsec has to change its role from assurance/auditing and compliance to proactively enabling self-service secure development.

We have to stop pretending that big security reviews and stage gates at major project milestones still work (if they ever did). They need to be replaced by lightweight, in-phase, iterative and incremental preventative controls – simple cheap things that make sense to developers and that they can do as part of designing and building software.

There’s still a role for pen testing and other security reviews. But not as a once-a-year annual release certification/assurance step to “prove that the system is secure” or some other fantasy. Pen tests and other reviews are just another source of feedback to the team, information that they can use to learn and adapt and improve. Security reviews need to be cheaper and scaled down, so that they fit into time boxes and so that they can be done earlier and more often.

There are a handful of organizations that are pushing Appsec further into the rapidly blurring lines between development and operations:
Etsy,
Netflix,
and Twitter
are already doing Appsec at “DevOps Speed” today, inventing new tools and ideas.

BTW: If you are involved in security for your organization’s software, the SANS Institute would appreciate your insight. Please participate in the SANS Application Security Survey. The survey closes December 20.

Thursday, November 14, 2013

Managers don’t want to think harder than they have to. They like simple rules of thumb, quick and straightforward ways of looking at problems and getting pointed in the right direction. The simpler, the better.

80% of effects come from 20% of causes and 80% of results come from 20% of effort.

It’s the flip side of diminishing returns: instead of getting less out of doing more, you can get more from doing less, by working smarter, not harder.

You can see obvious cases where the 80:20 rule applies in software without looking too hard. For example, 80% of performance improvements are found by optimizing 20% of the code – although the actual ratio is probably much closer to 90:10 or even 99:1 when it comes to performance optimization. But whether it's 80:20 or 90:10 or 70:30, the rule works essentially the same.

80:20 Who uses What, What do you Really have to Deliver

Another well-known 80:20 rule in software is that 80% of users only use 20% of features. This came out of research from the Standish Group back in 2002, where they found that:

45% of features were never used;

19% used rarely;

16% sometimes;

only 20% were used frequently or always.

Like the like the cost of change curve, this is another example of a
widely-held “truth” in software development
which is based on limited evidence – it would be good to see more research that backs this claim up.

Standish Group’s latest research shows that thinking smaller and delivering less is a key to improving the success of software projects:
While more than 70% of small projects are delivered successfully, large projects have “virtually no chance of success: … more than twice the chance of being late, over budget, and missing critical features”.

“In summary, there is no doubt that focusing on the 20% of the features that give you 80% of the value will maximize the investment in software development and improve overall user satisfaction. After all, there is never enough time or money to do everything. The natural expectation is for executives and stakeholders to want it all and want it all now. Therefore, reducing scope and not doing 100% of the features and functions is not only a valid strategy, but a prudent one.”

Each time that you find a bug in this code, chances are that it means there are still more bugs left to find and fix. The more bugs you find, the more chances there are that there are still more bugs to be found, in a downward spiral.

Each time that you touch this code, even when you’re trying to fix it, there is a good chance that you are making it worse, not better: there is more than a 20% chance that a developer trying to fix a bug in error-prone code will accidentally introduce a new bug as a side-effect.
Most of the effort put into trying to understand this code and fix it and understand it and fix it over and over, is wasted:

“Most error-prone modules are so complex and so difficult to understand that they cannot be repaired once they are created“.

When code gets this bad, it needs extensive and “brutal refactoring” to make it understandable and safer to work with, or it needs to be “surgically removed and replaced” with new code written from scratch by somebody who knows what they are doing.

It’s not hard to identify what parts of the code are bad if you have the same people working on the same code for a while – ask anyone on the team and they’ll know where that nasty stink is coming from. In big systems and big organizations with lots of turnover, you’ll probably need to track bugs over time and mine defect data for bug clusters, rather than just fixing bugs and moving on.

80% of time spent fixing bugs is on 20% of the bugs

Some bugs are much harder to fix than others. Sometimes because the code is so bad (see the rule above). Sometimes because the problems are so hard to reproduce and debug.
Sometimes because they are much deeper than they appear to be – fundamental bugs in design, bugs that you can’t code your way out of.
Be prepared for those times when even your best developers won’t be able to tell you when – or even if – some bugs will be fixed.

A lot of code is written once, and never changed: static and standardized interfaces, basic wiring and config, back office functions. Then there’s other code that changes all of the time: the 20% of features which are used 80% of the time and need to be tweaked and tuned and occasionally overhauled as needs change; core code that needs to be optimized; and other code that needs to be fixed a lot because it contains too many bugs (back again to the 80:20 bug cluster rule above).

Feathers has found that code that gets changed a lot also tends to get bigger as time gets on, because of a simple, built-in bias:

it is easier to add code to an existing method than to add a new method and easier to add another method to an existing class than to add a new class.

Hot spots in code are easy to find by reviewing check-in history for areas with high churn and through simple static analysis of the code base. This is where you get the most value out of refactoring, where you can do the most to keep the code from losing structure and becoming dangerously unmaintainable – and it is also the code that naturally should get refactored most often as part of making changes (changed more often = refactored more often if you’re refactoring properly).

80:20 and Programming Time

It usually doesn't take long to get something almost working, or something that looks like it works,
especially if you’re working iteratively and incrementally, delivering frequently and fast.

But there’s a lot of work that still needs to be done “behind the scenes” to finish things up, to catch the edge cases and handle errors, make sure that the system performs and scales, find and fix all of the little bugs, get the code into shape before it can be deployed. Product Owners/Customers (and managers) often don’t understand why it takes so long to get the “last 20%” of the work done. And programmers often forget how long this takes too, and don’t include this work in their estimates. This is why a developer’s estimates are so often wrong. And why prototyping can be so dangerous in setting unrealistic expectations.

80:20 and Managing Software Development

Keeping the 80:20 rough rule in mind can save you money and time, and improve your chance of success by keeping you focused on what’s important: the features that really matter, the parts of the code where most of your most serious bugs are (and the bugs that take the most time to fix); the parts of the code that are changing the most; and how and where your team really spends their time.

Friday, November 8, 2013

Last night I presented to the Calgary Agile Methods Users Group on "Agile Appsec: Why we Suck at Building Secure Software, and what we can do about it". This is an outline of the problems that we have as an industry building secure software - why we fail at it, why Agile development is blamed for insecure software - and what we can do to build more secure software while still being Agile. I look at different approaches to injecting application security into Agile development: security stories, evil user stories, abuse cases and abuse stories; security sprints; and building security into development, using Microsoft's SDL Agile as a guide.

Wednesday, October 30, 2013

Because Agile development teams work from a backlog of stories, one way to inject application security into software development is by writing up application security risks and activities as stories, making them explicit and adding them to the backlog so that application security work can be managed, estimated, prioritized and done like everything else that the team has to do.

SAFECode also includes a list of secure development practices (operational tasks) for the team that includes making sure that you’re using the latest compiler, patching the run-time and libraries, static analysis, vulnerability scanning, code reviews of high-risk code, tracking and fixing security bugs; and more advanced practices that require help from security experts like fuzzing, threat modeling, pen tests, environmental hardening.

Altogether this is a good list of problems that need to be watched out for and things that should be done on most projects. But although SAFECode’s stories look like stories, they can’t be used as stories by the team.

Security Stories can’t be pulled from the backlog and delivered like other stories and removed from the backlog when they are done, because they are never “done”. The team has to keep worrying about them throughout the life of the project and of the system.

As Rohit Sethi points out, asking developers to juggle long lists of technical constraints like this is not practical:

If you start adding in other NFR constraints, such as accessibility, the list of constraints can quickly grow overwhelming to developers. Once the list grows unwieldy, our experience is that developers tend to ignore the list entirely. They instead rely on their own memories to apply NFR constraints. Since the number of NFRs continues to grow in increasingly specialized domains such as application security, the cognitive burden on developers’ memories is substantial.

OWASP Evil User Stories – Hacking the Backlog

Someone at OWASP has suggested an alternative, much smaller set of non-functional Evil User Stories that can be "hacked" into the backlog:

A way for a security guy to get security on the agenda of the development team is by “hacking the backlog”. The way to do this is by crafting Evil User Stories, a few general negative cases that the team needs to consider when they implement other stories.

Example #1. "As a hacker, I can send bad data in URLs, so I can access data and functions for which I'm not authorized."

Example #2. "As a hacker, I can send bad data in the content of requests, so I can access data and functions for which I'm not authorized."

Example #3. "As a hacker, I can send bad data in HTTP headers, so I can access data and functions for which I'm not authorized."

Example #4. "As a hacker, I can read and even modify all data that is input and output by your application."

Thinking like a Bad Guy – Abuse Cases and Abuser Stories

Another way to beef up security in software development is to get the team to carefully look at the system they are building from the bad guy's perspective.

In “Misuse and Abuse Cases: Getting Past the Positive”, Dr. Gary McGraw at Cigital talks about the importance of anticipating things going wrong, and thinking about behaviour that the system needs to prevent. Assume that the customer/user is not going to behave, or is actively out to attack the application. Question all of the assumptions in the design (the can’ts and won’ts), especially trust conditions – what if the bad guy can be anywhere along the path of an action (for example, using an attack proxy between the client and the server)?

Abuse Cases are created by security experts working with the team as part of a critical review – either of the design or of an existing application. The goal of a review like this is to understand how the system behaves under attack/failure conditions, and document any weaknesses or gaps that need to be addressed.

At Agile 2013 Judy Neher presented a hands-on workshop on how to write Abuser Stories,
a lighter-weight, Agile practice
which makes “thinking like a bad guy” part of the team’s job of defining and refining user requirements.

Take a story, and as part of elaborating the story and listing the scenarios,
step back and look at the story through a security lens. Don’t just think of what the user wants to do and can do -
think about what they don’t want to do and can’t do. Get the same people who are working on the story to “put their black hats on” and think evil for a little while, brainstorm to come up with negative cases.

As {some kind of bad guy} I want to {do some bad thing}…

The {bad guy} doesn’t have to be a hacker. They could be an insider with a grudge or a selfish customer who is willing to take advantage of other users, or an admin user who needs to be protected from making expensive mistakes, or an external system that may not always function correctly.

Ask questions like: How do I know who the user is and that I can trust them? Who is allowed to do what, and where are the authorization checks applied? Look for holes in multi-step workflows – what happens if somebody bypasses a check or tries to skip a step or do something out of sequence? What happens if an action or a check times-out or blocks or fails – what access should be allowed, what kind of information should be shown, what kind shouldn’t be? Are we interacting with children? Are we dealing with money? With dangerous command-and-control/admin functions? With confidential or pirvate data?

Look closer at the data. Where is it coming from? Can I trust it? Is the source authenticated? Where is it validated – do I have to check it myself?
Where is it stored (does it have to be stored)?
If it has to be stored, should it be encrypted or masked (including in log files)?
Who should be able to see it? Who shouldn’t be able to see it?
Who can change it, and to the changes need to be audited?
Do we need to make sure the data hasn't been tampered with (checksum, HMAC, digital signature)?

Use this exercise to come up with refutation criteria (user can do this, but can’t do that; they can see this but they can’t see that), instead of, or as part of the conditions of acceptance for the story.
Prioritize these cases based on risk, add the cases that you agree need to be taken care of as scenarios to the current story, or as new stories to the backlog if they are big enough.

“Thinking like a bad guy” as you are working on a story seems more useful and practical than other story-based approaches.

It doesn’t take a lot of time, and it’s not expensive. You don’t need to write Abuser Stories for every user Story and the more Abuser Stories that you do, the easier it will get – you'll get better at it, and you’ll keep running into the same kinds of problems that can be solved with the same patterns.

You end up with something concrete and functional and actionable, work that has to be done and can be tested. Concrete, actionable cases like this are easier for the team to understand and appreciate – including the Product Owner, which is critical in Scrum, because the Product Owner decides what is important and what gets done.
And because Abuser Stories are done in phase, by the people who are working on the stories already (rather than a separate activity that needs to be setup and scheduled) they are more likely to get done.

Simple, quick, informal threat modeling
like this isn’t enough to make a system secure – the team won’t be able to find and plug all of the security holes in the system this way, even if the developers are well-trained in secure software development and take their work seriously.
Abuser Stories are good for identifying business logic vulnerabilities,
reviewing security features (authentication, access control, auditing, password management, licensing)
improving error handling and basic validation, and keeping onside of privacy regulations.

Thursday, October 24, 2013

I've spent the last 3 years or so learning more about devops. I went to Velocity and Devopsdays and a bunch of other conferences that included devops stuff (like the last couple of OWASP USA conferences and this year's Agile conference). I've been following the devops forums and news and reading devops books and trying out devops tools and Continuous Delivery, talking to smart people who work at successful devops shops, encouraging people in my organization to adopt some of these ideas and tools where they make sense. Looking for practices and patterns and technology we can take and apply to the work that we do, which is in an enterprise, B2B environment.

A problem for us is that devops today is still mostly where it started, rooted in Web Ops with a focus on building and scaling online shops and communities and Cloud services – except maybe where some enterprise technology vendors have jumped on the devops bandwagon to re-brand their tools.

Is there really that much that a well-run highly-regulated enterprise IT organization hooked into hundreds or thousands of other enterprises can learn from a technology startup trying to launch a new online social community or a multi-player online game, or even from larger, more mature devops shops like Etsy or Netflix? Do the same rules and ideas apply?

The answer is: yes sometimes, but no, not always.

There are some important factors that separate enterprises from most devops shops today.

Platform heterogeneity and the need to support legacy systems and all of the operational inter-dependencies between systems is one – you can’t pretend that you can take care of your configuration management problems by using Puppet or Chef in an enterprise that has been built up over many years through mergers and acquisitions and that has to support thousands of different applications on dozens of different technology platforms. Many of those apps are third party apps that you don’t have control over. Some of those platforms are legacy systems that aren't supported any more.
Some of those configs are one-off snow flakes because that’s the only way that somebody could get things to work.

Governance and regulatory compliance (and all the paperwork and hassle that goes with this) is another. Even devops shops don’t handle their highly-regulated core business functions the same as they do the rest of their code (a good example is how Etsy meets PCI compliance).

There are two other important factors that separate many enterprises such as large financial institutions from the way that a devops shop works: the need for speed in change, and the cost of failure.

The Need for Speed

If “How can I change things faster?” is the question, devops looks like the answer.

Devops enables – and emphasizes – rapid, continuous change, through relentless automation, breaking down walls between developers and operations, and through practices like Continuous Deployment,
where developers push out code changes to production several times a day.

Being able to move this quickly is important in early stages of iterative design and development, for online startups that need to build a critical mass of customers before they run out of money, and other organizations experiencing hyper growth. Every organization has some systems that need to be changed often, and can be changed with minimal impact: CRM systems, analytics and internal management reporting for example. And as James Urqhuart explains, optimizing for change makes sense when you need to change technology often.

But there are other systems that you don’t need to or you can’t change every day or every week or every month: ERP and accounting systems, payment handling, B2B transactional systems, industrial control. Where you don’t need to and don’t want to run experiments on your customers to try out new features or constantly refine the details of their user experience because the system already works and lots of people are depending on it to work a certain way and what’s really important is keeping the system working properly and keeping operational costs down. Or where change is risky and expensive because of regulatory and compliance requirements and operational interdependencies with other systems and other organizations. And where the cost of failure is high.

Change, even when you optimize for it, always comes with the risk of failure.
The 2013 State of Devops Report
found that high performing devops shops deploy code 30x more frequently, with “double the change success rate”. By themselves these figures are impressive. Taken together – they aren’t good enough. Changing more often still means failing more often than an organization which moves more slowly and more cautiously, and not every organization can afford to fail more often.

The Cost of Failure

Most online businesses exist in a simpler, more innocent world where change is straightforward – it’s your code and your Cloud so you can make a change and push it out without worrying about dependencies on other systems and the people who use them or how to coordinate a roll-out globally across different companies and business lines – and where the consequences of failure are really not that high.

If you’re not charging anything (Facebook, Twitter) or next to nothing (Netflix) for customers to use your service, and if the cost of failure to customers is not that much (they have to wait a little bit to tell people that their kitty just sneezed or to post a picture of their gold fish or watch a movie)
then nobody has the right to expect too much when something goes wrong.

I’ve been told by a tech exec at a bank that Etsy (or any web company) wasn't a “serious” endeavor, that his bank works with “serious money” which means that they can’t “screw around” like web companies do. I've also seen web companies poo-poo the enterprise because they're "spoiled" with their small user base and non-24x7 working environments.

Until there is a shared understanding between those groups, the healthy and mature swapping of ideas and concepts is going to be slow.

The cost and risk involved in a failure is several orders of magnitude different between a bank and an online consumer web business, even something as large as Etsy. In a presentation at the end of 2012,
Etsy's CTO boasted that they are now handling “real money”, as much as “$1k per minute” at that time. While that’s real money to real customers at Etsy (and I am sure that this number is higher by now), it’s negligible compared to the value of transactions that any major financial institution handles.

There aren’t any mistakes that a company like Etsy or even Facebook could make that could compare with the impact of a big system failure at a major bank, or a credit card processor or a major stock exchange
or financial clearing house or brokerage, or some other large financial institution.

This is not just because of the high value of transactions that are moving through these systems. It is also because of the chain reaction that such failures have on other institutions in the national and international system-of-systems that these organizations operate in – the impact on partner and customers’ systems, and on their customers and partners and so on. The costs of these failures can run into the millions or hundreds of millions of dollars, much more if you include lost opportunity costs and the downstream costs of responding to the failure (including IT costs for upgrading or replacing the entire system, which is a common response to a major failure), never mind the follow-on costs of increased regulatory oversight that is often demanded across an entire industry after a high-profile system failure.

It’s not enough for many enterprises and even smaller B2B platforms to optimize for MTTR and try harder next time or to accept that roll-back is a myth and that “real men only roll forward” –
and from the continuing stories of high-profile failures at online organizations this isn't enough for devops organizations once they reach a certain size either.

But you can still learn a lot from Devops

It’s not that devops won’t work in the enterprise. Just not devops as it is mostly described until now. Devops overplays the “everything needs to change faster and more often” card, and oversimplifies some of the other problems that many organizations face if they don’t or can’t run everything in the Cloud.
But there is still a lot that to learn from devops leaders, even if their stories and their priorities and constraints and their business situations don’t match up.

We can certainly learn from them about how to run a scalable Web presence and about how to move stuff to the Cloud.

We can learn how to take more of the risk and cost out of change management by simplifying and standardizing configurations where possible and simplifying and automating testing and deployment steps as much as possible –
even if we aren't going to change things every day.

But for now, probably the most valuable thing that devops brings to those of us who don't work in online Web shops isn't tools or practices. It’s that devops is creating new reasons and more opportunities for dev and Ops and management to engage with each other, as they try to figure out what this devops thing is and whether and how it makes sense in our organizations.

When you’re trying to do something that you've never done before – or nobody has ever done before. Or when you've done it before but you sure as hell aren't going to make the same mistakes again and you need time to think your way to a better way. Or when you’re trying to understand code that somebody else wrote so you can change it, or when you’re hunting down an ugly bug. All of this can take a lot of time, but in the end you won’t have a lot of code to show for it.

Then there’s all the other work in development – work that requires a lot of typing, and not as much thinking. When it’s clear what you need to do and how you need to do it, but you have an awful lot of code to pound out before the job is done. You've done it before and just have to do it again: another script, another screen, another report, and another after that. Or where most of the thinking has already been done for you: somebody has prepared the wireframes and told you exactly how the app will look and feel and flow, or speced out the API in detail, so all you need to do is type it in and try not to make too many mistakes.

Debugging is thinking. Fixing the bug and getting the fix tested and pushed out is mostly typing. Early stage design and development, technical spikes to check out technology and laying out the architecture, is hard thinking. Doing the 3rd or 4th or 100th screen or report is typing. UX design and prototyping: thinking. Pounding out CRUD maintenance and config screens: typing. Coming up with a cool idea for a mobile app is thinking. Getting it to work is typing. Solving common business problems requires a lot of typing. Optimizing business processes through software requires hard thinking.

Someone who is mostly thinking and someone who is just typing are doing very different kinds of work and need to be managed in different ways.

Sometimes Programming is just Typing

Many business applications are essentially shallow. Lots of database tables and files with lots of data elements and lots of data, and lots of CRUD screens and lots of reports that are a lot like a lot of other screens and reports, and lots of integration work with lots of fields to be mapped between different points and then there are compliance constraints and operational dependencies to take care of. Long lists of functional requirements, lots of questions to ask to make sure that everyone understands the requirements, lots of details to remember and keep track of.
Banking, insurance, government, accounting, financial reporting and billing, inventory management and ERP systems, CRM systems, and back-office applications and other book-keeping and record-keeping systems are like this. So are a lot of web portals and online shops. Some maintenance work – platform upgrades and system integration work and compliance and tax changes – is like this too.

You’re building a house or a bridge or a shopping mall, or maybe renovating one.
Big, often sprawling problems that may be expensive to solve. A lot of typing that needs to be done. But it’s something that’s been done many times before, and the work mostly involves familiar problems that you can solve with familiar patterns and proven tools and ways of working.

"I saw the code for your computer program yesterday. It looked easy. It’s just a bunch of typing. And half of the words were spelled wrong. And don’t get me started on your over-use of colons."

Once the design is in place most of the work is in understanding and dealing with all of the details, and managing and coordinating the people to get all of that code out of the door. This is classic project/program management: budgeting and planning, tracking costs and changes and managing hand offs. It’s about logistics and scale and consistency and efficiency, keeping the work from going off of the rails.

Think Think Think

Other problems, like designing a gaming engine or a trading algorithm, or logistics or online risk management or optimizing real-time control systems, require a lot more thinking than typing. These systems have highly-demanding non-technical requirements (scalability, real-time performance, reliability, data integrity and accuracy) and complex logic, but they are focused on solving a tight set of problems. A few smart people can get their head around these problems and figure most of them out. There’s still typing that needs to be done, especially around the outside, the framing and plumbing and wiring, but the core of the work is often done in a surprisingly small amount of code – especially after you throw away the failed experiments and prototypes.

This is where a lot of the magic in software comes from – the proprietary or patented algorithms and the design insight that lies at the heart of a successful system. The kind of work that takes a lot of research and a lot of prototyping, problem solving ability, and real technical chops or deep domain knowledge, or both.

Typing and Thinking are different kinds of work

Whether need to do a lot of typing or mostly thinking dictates how many people you need and how many people you want on the team, and what kind of people you want to do the work. It changes how people work together and how you have to manage them. Typing can be outsourced. Thinking can’t. You need to recognize what problems can be solved with typing and what can’t, and when thinking turns to typing.

Thinking work can and should be done by small teams of specialists working closely together – or by one genius working on their own. You don’t need, or want, a lot of people while you are trying to come up with the design or think through a hard problem, run experiments and iterate. Whoever is working on the problem needs to be immersed in the problem space, with time to explore alternatives, chances to make mistakes and learn and to just stare out the window when they get stuck.

This is where fundamental mistakes can be made: architecture-breaking, project-killing, career-ending errors.
Picking the wrong technology platform. Getting real-time tolerances wrong. Taking too long to find (or never finding) nasty reliability problems. Picking the wrong people or trying to solve the wrong problem. Missing the landing spot.

Managing this kind of work involves getting the best people you can find, making sure that they have the right information and tools, keeping them focused, looking out for risks from outside, and keeping problems out of their way.

Thinking isn’t predictable. There’s no copy-and-paste because there’s nothing to copy and paste from. You can’t estimate it, because you don’t know what you don’t know. But you can put limits on it – try to come up with the best solution in the time available.

Typing is predictable. You can estimate it – and you have to. The trick is including all of the things that need to be typed in, and accounting for all of the little mistakes and variances along the way – because they will add up quickly. Sloppiness and short cuts, misunderstanding the requirements, skipping testing, copy-and-paste, the kinds of things which add to costs now and in the future.

Typing is journeyman work. You don’t need experts, just people who are competent, who understand the fundamentals of the language and tools and who will be careful and follow directions and who can patiently pound out all the code that’s needed – although a few senior developers can out-perform a much larger team, at least until they get bored. Managing a bunch of typists requires a different approach and different skills: you need to be a politician and diplomat, a logistician, a standards setter, an administrator and an economist. You’re managing project risks and people risks, not technical risks.

Over time, projects change from thinking to typing – once most of the hard “I am not sure what we need to do or how we’re doing to do it” problems are solved, once the unknowns are mostly known, now it’s about filling in the details and getting things running.

The amount of typing that you need to do expands as you get more customers and have to deal with more interfaces to more places and more customizations and more administrivia and support and compliance issues. The system keeps growing, but most of the problems are familiar and solvable. There’s lots of other code to look at and learn from and copy from. You need people who can pick up what’s going on and who can type fast.

Thinking and Typing

Thinking and typing are both important parts of software development.

In “Programming is Not Just Typing”, Brendan Enrick explains that the reason that Pair Programming works is because it lets people focus on typing and thinking at the same time:

Both guys are thinking, but about different things. One developer has the keyboard at any given time and keeps in his head the path he is on. (This guy is concerned with typing speed.) He types along the path thinking about the code he is currently writing not the structure of the app, but the code he is typing right now. For a short time his typing speed matters.

The programmer in the pair who is not actively typing is spending all of his time thinking. He keeps in his head the path that the typist is taking, not concerned with the syntax of the programming language. That is the other guy thinking about the language syntax. The one sitting back without the keyboard is the guide. He must make sure that the pair stays on the right path using the most efficient route to success.

There’s more to being a good developer than typing - and there's more to typing than just being able to press some keys. It means being good at the fundamentals: knowing the language well enough, knowing your tools and how to use them, knowing how to navigate through code, knowing how to write code – as well as being fast at the keyboard.
Mastering the mechanics, knowing the tools and being able to type fast, so that you're fluent and fluid, are all essential for succeeding as a developer. Don’t diminish the importance of typing. And don’t let typing – not being able to type – get in the way of thinking.

Thursday, October 10, 2013

Agile development – because you are building working software faster and delivering it incrementally –
forces development teams to face a common, fundamental problem: how to balance the work of developing new software with the need to support a system that is already being used in production, whether it’s the legacy system that you’re replacing, or the system that you are still building – and sometimes both.

This is especially a problem for Agile teams following Scrum. On the one hand, in order for the team to meet Sprint goals and commitments and to establish a velocity for future planning, the team is not supposed to be interrupted while they are doing their work.
On the other hand, the point of working iteratively and incrementally in Scrum is to deliver working software early and frequently to the customer, who will want to use this software as soon as they can, and who will then need support and help using the software – help and support that needs to come from the people who wrote the software.

It’s not easy to balance two completely different kinds of work with directly opposed goals and incentives and metrics.
As Don Schueler explains in the “The Fragile Balance between Agile Development and Customer Support”, development teams – even Agile teams working closely with their Customer – are mostly inward-looking, internally focused on delivery and velocity and cost and code quality and technical concerns. Support teams are outward-looking, focused on customer relationships and customer experience and completeness and minimizing operational risk.

Development is about being predictable and efficient: deliver to schedule and keep development costs down. Support is about being responsive and effective: listen to the customer, answer questions, fit in unplanned work, figure out problems and fix things right away. Development work is about flow, continuity, predictability, velocity, and, if managed correctly, is mostly under control of the team. Support and maintenance work is interrupt-driven, immediate, inconsistent and unpredictable – a completely different way of working and thinking. Development work requires the team to be drawn together so that they can collaborate on common goals and the design. Most maintenance and support work is disjointed and disconnected, smaller tasks that can be done by people working independently.
Development, even in high pressure projects, is measured in weeks or months. Support and maintenance work needs to be done in days or hours or sometimes minutes.

Agile Support Models: Maintenance Victims

One way that teams try to handle support and maintenance is by sacrificing someone from the team: offering up a “maintenance victim” who takes on the support burden for the rest of the team temporarily, letting the others focus on design and development work. This includes taking calls from Ops or directly from customers, looking at logs, solving problems, fixing bugs. This could mean staying after hours to help troubleshoot or repairing a production problem or putting out a fix, and being on call after hours and on weekends.

The rest of the team tries to pretend that this victim doesn’t exist. If the victim isn’t busy working on support issues or fixing bugs found in production, they might work on fixing other bugs or maybe some other low-priority development work, but they are subtracted from the team’s velocity – nobody depends on them to deliver anything important.

Teams generally rotate someone through support and triage responsibilities for one or two Sprints. This way everyone at some point “shares the pain” and gets some familiarity with support problems and operational issues. There are also positive sides to being sacrificed to support. Developers get a chance to learn more about the system and stretch some of their technical skills, and get off of the hamster wheel of Sprint-after-Sprint delivery for a bit. And they get a chance to play the hero, step in and fix something important and make the customer happy.

Kent Beck and Martin Fowler in Planning Extreme Programming extend this idea to larger organizations by creating a small production support team: 2-4 developers who volunteer to focus on fixing bugs and dealing with production problems. Developers spend a couple of Sprints in production support, then rotate back to development work. Beck and Fowler recommend staggering rotations, making sure that at least one developer is in the first rotation and another in the second so that at least one member of the support team always knows about what is going on and what problems are being worked on.

Sacrificing a maintenance victim or a team makes it possible for most of the rest of the team to move forward on development, while still meeting support commitments.
This approach assumes that anyone on the team is capable of figuring out and fixing any problem in the system – that everyone is a cross-functional generalist. And this means that whoever is on this support rotation has to be good enough and experienced enough that they can deal with most issues without bringing in the rest of the team - you can’t rotate newbies through support and maintenance work, at least not without someone senior backing them up.

And you also have to be prepared for problems that are too big or too urgent for your maintenance victim to take care of on their own. Even with a dedicated team you may still need to build in some kind of slack or buffer to deal with emergencies and general helping out, so that you don’t keep blowing up Sprints. You can come up with a reasonable allowance based on “yesterday’s weather”, on how much support work the team has had to do over the last few weeks or months. If you can't make this work, if the entire team is spending too much time on support and fire fighting and pushing hot fixes, then you are doing something wrong and you have to get things under control before you build more software any ways.

Kanban’s queuing model and use of task boards makes it easy to see what work needs to be done, what work is being done, who is doing it, what’s getting in the way, and when anything changes.

Kanban makes it easier to track and manage different kinds of work that requires different kinds of skills and that don’t always fit nicely into a 1-week or 2-week time-box..

Kanban doesn’t pretend that you won’t be or can’t be interrupted – instead it helps you to manage interruptions and minimize their impact on the team.
First, in Kanban you set limits on how much of different kinds of work the team can deal with at a time.
This lets the team get control over work coming in, and stay focused on getting things done.
Kanban’s queue-and-task model allows emergencies to pre-empt whatever work is in progress through escalation/priority lanes.
And priorities can keep changing right up until the last minute – team members just pull the highest priority work item from the ready queue when they are free to take on more work,
whether this is designing and developing a new feature, or fixing a bug, or dealing with a support issue.

Kanban helps teams focus more on immediate, tactical issues. It’s a better model to follow when you have more maintenance and support work than new design and development, or when you have to assert control over a major problem or manage something with a lot of moving pieces like the launch of a new system.

Devops Changes Everything

Devops, as followed by organizations like
Etsy
and Facebook
and Netflix (where they go so far as to call it NoOps)
tries to completely break down the boundaries between development, maintenance, support and operations. Devops engages developers directly and closely into the support, maintenance and operations of the systems that they build. Developers who work in these organizations are not just writing code – they are part of a team running an online service-based business, which means that support work is as important, and sometimes more important, than designing and writing more software.

In these organizations, developers are held personally responsible for their software,
for getting it into production
and making sure that it works. They are on call for problems with software that they worked on. They are actively involved in operations of the system, providing insight into how the system works and how it is running, in testing and configuring it and tuning it and troubleshooting problems.

Devops changes what developers work on and how they do it. They move away from project work and more towards fast feature development, fixing, tuning and hardening. Availability and reliability and performance and security and other operational factors become as important – or more important – than delivery schedules and velocity. Developers spend more time thinking about how to make the system work, how to simplify deployment and setup and about the information that people need to understand what’s going on inside the system, what metrics and tools might be useful, how to handle bad data and infrastructure failures, what could go wrong when they make a change and who they need to check with and what they need to test for.

This is not just because they are usually the only people who can actually figure out and fix many problems.

Putting aside moral hazard arguments about whether it’s ethically acceptable for developers not to take full responsibility for the consequences for their decisions and the quality of their work, there are compelling advantages to developers being directly involved in supporting and maintaining the software that they work on.

The most important is the quality of the feedback that developers get from supporting a real system – feedback that is too valuable for them to ignore.

Real feedback on what you did right in building the system, and what you got wrong.
Feedback on what you thought the customer needed vs. what they really need. What features customers really find useful (and what they don`t).
Where the design is weak.
Where most of your problems are coming from (the 20% of the code where 80% of the bugs are hiding),
where the weaknesses are in your testing and reviews,
where you need to focus and where you need to improve.
Valuable information into what you’re building and how you’re building it and how you plan and prioritize, and how you can get better.

When developers are called into fire fighting production incidents and Root Cause Analysis reviews they can learn enormous amounts about what it takes to build software for the real world. Thinking seriously about how problems happened and how to prevent them can change how you plan, design, build, test and deploy software; and how people work together as a team.

Farming all of this off to someone else, filtering it through a help desk or an offshore maintenance team, breaks these valuable feedback loops, with negative effects for everyone involved.

In a startup, developers take care of problems themselves, well, because there isn`t anybody else to do it. But at some point things change:

“…managers decided that we were spending far too long investigating users’ problems and not long enough building the new features the business wanted. Developers needed to be more productive, and more productive meant developers developing more new features. To get developers to develop they need to be ‘in the zone’. They need headphones and big screens to glue their eyes to. They did not need petty interruptions like stupid users ringing up because they got a pop up saying their details will be resent when they tried to refresh.”

But by doing this, the development team became disconnected from the results of their work, and from their customers…

“A systems thinker would tell you this is wrong. You’ve gone from a system that connected a user to the team responsible with one degree of separation, to one that has three degrees of separation. Or think of it another way: the team producing the product, and responsible for improvements and fixes used to be one degree away from their end users, who use the product and are feeding back the product’s shortcomings and issues, but are now three degrees. And not even three degrees all of the time. The majority of the time the team won’t ever hear about most of the support issues. And most of the time the team won’t even have that much interaction with the team that does hear about most of the support issues.”

The result: Customers don’t get the support that they need. Developers don’t get the information that they need to understand how to make the system work better. A support team stuck in the middle with people just trying to keep things from getting worse and hoping to find a better job someday. It’s a self-reinforcing, negative spiral.

In our shop, support takes priority over development – always. Our senior developers work with operations to support the system, and are on call when we put new software in and on call if something goes wrong after hours. They can bring in anyone else from any team that they need for help. As a result, we have very few serious problems and these problems get fixed fast and fixed right. The experience that everyone gets from working in support helps them to design and write better, safer code. This has made the system more resilient and easier and less expensive to support and safer to setup and run and easier and safer to change. And it has made our organization better too. It’s brought developers and operations closer together, and closer to what’s important to the business.

Whether you call it “Agile” or not,
there’s nothing more agile than a team that is working directly with customers, responding immediately to problems and changing requirements in a live system. While some developers and managers think of this as overhead, sustaining engineering and try to push it off to somebody else so that they can focus on “more strategic" work, others recognize that this is really the leading edge of software development, and the only way to run a successful software organization, and the only way to make software, and developers, better.

Subscribe to this blog

About Me

I am an experienced software development manager, project manager and CTO focused on hard problems in software development, software quality and security. For the last 20 years I have managed teams building and operating high-performance financial platforms.
My special interest is how small teams can be most effective in building real software: high-quality, secure systems at the extreme limits of reliability, performance, and adaptability. Software that has to work, that is built right, and built to last.
I use this blog to explore ideas and problems in software development that are important to me. To reflect and to find new answers.