Featured in
Architecture & Design

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Operations & Infrastructure

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Enterprise Architecture

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Agile, Architecture and the 5am Production Problem

Author Michael Nygard counts himself among those who still believe there is such a thing as architecture. In his InfoQ article Agile, Architecture and the 5am Production Problem, Nygard walked the reader through the whodunnit mystery of a real production problem. The surprising conclusion illustrated his message that building applications for the real world, and not just QA, requires a failure-oriented mindset and strong defensive programming tactics. The article poses a challenge to the Agile community's ideas about what constitutes "just enough" architecture.

Nygard's new book "Release It! Design and Deploy Production-Ready Software" from the Pragmatic Programmers has been Amazon's top "Hot New Release" in the Software Design category for the last month. This article expands on a story the author told in the book, explicitly relating it to the Agile approach he has practiced since back when they were called "lightweight methodologies":

Agile methods tell us a lot about how to build functional software that changes easily over time. Programmers created techniques such as unit testing and refactoring for use by other programmers, and they improved the craft as a result. For the most part, though, agile methods focus on the interior of the system boundary. In the agile community, debate continues about how much attention we should pay to the architecture of things outside the application boundary. The most extreme adherent (or should that be "eXtreme" adherents?) say, "Let the architecture emerge from relentless refactoring and vigorous unit testing!"

I am an agile developer and architect, but you should count me ... among those who think architecture must stay grounded in implementation. A good architecture is one that survives contact with the real world. A bad one creaks and groans its way through the day, chewing up people and computers. I have often observed that architects who retreat into abstractions create architecture that cannot be built successfully.

The article told the true story of an interested and unexpected failure that would only occur in the wee hours of the morning after a quiet period on the website: an application that would hang at 5 AM every day, involving a database that was only ever queried. The guilty parties---simultaneously the victims---were a web server, a database server, and a firewall. For those whose first reaction is to think "there's no way to create a deadlock if you're just querying": you'll be interested to see what Nygard uncovered.

Nygard used the story to illustrate what he calls the "failure-oriented mindset", which doesn't mean that he expects his projects to fail, but rather that he builds his systems under the assumption that somehow, someday, every single piece of the architecture will fail. In his book, he recommended building a deviously malicious test harness to simulate many kinds of failures, from simple wire breaks to responding with the wrong protocol.

This article voiced a challenge to the Agile community, where those who espouse the "just-enough-architecture" mantra still don't agree on what this actually means in practice. Certainly, Feature Driven Development and XP, as proposed in their respective canonical articles and books, are far apart on how to deal with this. Agile, Architecture and the 5am Production Problem points to an area in which, Nygard believes, agile methods are conspicuously silent.

Agile has expanded to embrace the Testing discipline, and has more recently been reaching to work out better interactions with disciplines like Technical Writing and Usability. Is the discipline of Architecture another candidate for harmonization with Agile practices, or does Agile already embody adequate principles and practices to build such robust architectures?

I agree. On decent sized project bottom-up design techniques such as TDD are not enough. You also need to do some some up front work and think about quality attributes. I think that architectural retirements expressed as user stories should (so that implementing them would also bring business value) be emphasized in the fist few iterations.

nice article. The same once happened to me, too. See www.epischel.de/wordpress/?p=45. I agree with you that most failure points in the field are integration issues - in particular when practicing TDD. In most cases you would try to mock up external systems. And even if you use the real one, you probably won't test in your production environment.In an other, much bigger project, the first failure during load-testing was (against all odds) a network switch that failed under heavy network load only. Should we take that into account when developing software? When human lifes depend on it - yes. But otherwise?

Me too. I've never believed (or found) that architecture and agile are mutually exclusive. Overblown and over-engineered architectures, sure, but then these should have no place in the waterfall world either. I do have some sympathy for the "let the architecture emerge" school of thought, and certainly no architecture should be so inflexible so as not to change, but properly skilled architects (who are grounded in something more technically relevant than just powerpoint..) ought to be able to lay down some appropriate guidelines to meet NFRs and shifting requirements that save buckets of time later. The trick is to get 'just enough' in place, and not to get carried away with gold-plating the design when you can't be 100% sure it won't all change in the next iteration.

I tend to think that author expects to solve all problems with Agile/TDD/XP methods which is obviously not correct.I would agree that integration problems are tend to be very complex once and not possible to handle in "Unit" testing environments, but as usual very important aspect of agile development is forgotten - and this aspect is "evolution".We can't fix/test/predict integration problems with our unit test wherever methodology we were using. It is indeed very complex/impossible to create a fake interfaces that will match 100% interfaces of the integration point, but what we can do with Agile/TDD/XP/(put your agile method here) is to make system evolution simple!I would never assume that TDD will replace normal functional testing and will solve 100% of problems system will have in the production environment, but I strongly beleve that high unit test coverage, big number of automated end2end functional tests will help us to deliver new functionality faster without worring about broken existing functionality.We also have allot of discussions in our company about value of automated unit/functional testing and only one conclusion is feasible for me - all automation tests are by nature regression tests.

Yup. As I posted in my blog I found you need to take a mixed approach, try to come up with a basic a architecture and evolve it with the project: 1. Set the first one or two iterations as architectural ones. Some of the work in these iterations is to spike technological and architectural risk. Nevertheless most of architectural iterations are still about delivering business value and user stories. The difference is that the prioritization of the requirements is also done based on technical risks and not just business ones. By the way, when you write quality attribute requirements as scenarios makes them usable as user stories helps customers understand their business value. 2. Try to think about prior experience to produce the baseline architecture 3. One of the quality attributes that you should bring into the table is flexibility - but be weary of putting too much effort into building this flexibility in 4. Don't try to implement architectural components thoroughly - it is enough to run a thin thread through them and expand then when the need arise. Sometimes it is even enough just to identify them as possible future extensions. 5. Try to postpone architectural decisions to the last responsible moment. However, when that moment comes -make the decision. try to validate the architectural decisions by spiking them out before you introduce them into the project

PS Here is the correct link to the post I made on quality attributes - the link my first message is broken.

It's not so much that I expect TDD or XP to solve everything, but I am a strong proponent of agile methods and the agile values.

So, this isn't so much meant to complain that there was a problem we didn't find through unit testing as it is meant to draw a parallel.

We (using the "royal we" for a moment) invented and adopted unit testing to solve our own problem of producing buggy code. Here, I see a similar problem.

I often here XPers say there should be no architecture up front, that it should all emerge through the practices. On the opposite end of the spectrum, there are the Zachman framework types that want to define the world before any projects can begin. Even on the most pragmatic of agile teams, there's still a kind of connotation that some of amount of up-front architecture is probably necessary, but it's a compromise---a necessary evil.

That leaves us wide open to this kind of problem, and myriad others that I've seen. Failures in the white space. Cracks originate in the gaps between boxes.

Is there something analogous we could invent to address architecture issues while remaining consistent with agile values?

IIs there something analogous we could invent to address architecture issues while remaining consistent with agile values?

As I said above - I think this can be handled within the practices of agile development. if you express architectural constrains as user stories - by demonstrating how the concern is manifested in the application. You can then prioritize and handle it like other user stories (you can look at an architect as a type of a technical product owner).

I'm not sure I agree that there is any conflict at all. Nobody, to my knowledge, has ever said that Agile or TDD is a silver bullet.

But, with respects to this particular example (in the article), I'm not sure it has anything to do with Agile or TDD at all. Production problems happen - Agile or not.

The fair question to ask would be 'could I have avoided this problem?' And if your answer is yes, that you could have foretold this problem, the next question would be 'with what accuracy?'

By and large, we in the technical community are TERRIBLE fortune-tellers. You will miss things. But if you try to foresee all, you will over-design, over-complicate, and increase your code debt.

I think the question should be stated differently - 'What kinds of architectures evolve?' If TDD (via refactoring) is a local improvement - isn't this analogous to a steepest-decent algorithm for design? We know that steepest-decent gets stuck in local minima. Does TDD really give us a good architecture? Does it give us good-enough architecture? Or does the local nature of TDD preclude evolving towards an acceptable architecture?

Actually, YAGNI says no fortune-telling because we tend to be wrong more than we tend to be right. You are within YAGNI if you have an architecture in mind and then wait to evolve it in that direction when the requirements ask for it.

When I heard of refactoring, I remember thinking, "Now I know what architecture is. It's the stuff that's hard to refactor!" I guess that's the art of "just-enough" or delaying decisions--knowing when you have to make them. There are some decisions that do have to be made earlier.

One agile value is "don't throw stuff over the wall." I've almost always had to support what I wrote, and that forces a production mindset. I don't want the phone to ring at 5 AM, and if it does, I want the problem to be obvious. So I build in monitoring and logging functionality from the start. I guess I could cover proper behavior of logs and monitors with unit tests. Find a copy of "Writing Solid Code." It's 10 years old, and C-centric, but I learned a ton from that book.

Another agile value is "test early and often," and I guess that can include load testing. I like to try and build the simplest-possible feature that spanned all of the components in the architecture, and load test that. If you log and graph CPU, memory, network, and disk I/O on all components, you will begin to see patterns. As you test, monitor various system components and graph the output. You will start to see patterns long before flames start shooting out. If you have underpowered hardware, all the better. You're trying to see where and how the software breaks.

I consider YAGNI to be a good guideline for a good architecture - i.e., one that is only as complicated as it needs to be. Having no architecture at all, and no-one explicitly or implicitly responsible for it, is a sure recipe for failure.

I am the only one reading this who is amazed that you didn't have a test environment that exactly replicated (down to the firewalls used) the production environment?

Saying that you CAN'T test your production architecture because you don't bother to is not a good enough answer. There are some things you cannot test effectively, but firewall rules should definitely be the same. Where do Agile Development methodologies recommend that functional testing be done in an environment that does not mimic production?

Having said that, I admire your skill at finding the problem, and this is a good write-up of how to do this sort of low-level packet sniffing.

You make a great point, and one that I address in the book. One of my major themes is getting grounded and connecting with the actual deployment environment. It's the only way to have true confidence in what you deliver.

Most companies will not build exact replicas for their test environments, though. They choose to save a bit of money by eliminating expensive network components like firewalls and hardware load balancers. This is a penny-wise, pound-foolish decision. Whatever money they save on network equipment will surely be lost in production outages. Nevertheless, budgetecture happens, particularly in QA.

Sometimes, it's not as much a budget issue as it is a knowledge gap. Development may not know what the enterprise network will be, particularly if development is outsourced. Other times, the network architecture changes late in the game. I've heard, "We can't disrupt the QA environment now! We're too busy getting ready for release to lose a day while you change the network." Of course, what happens then when it does hit the real network?

Anyway, I always fight to have the QA network match the production network. About 50% of the time, I win that fight.

I guess I don't see the mapping between the problem encountered and either Architecture or Agile. I am not sure what the authors proposes to do differently.

An underlying concept in Agile is that not everything can be forseen. I am not sure what could have predicted this problem nor discovered it quicker than actually fielding the software. It is only by fielding quickly that we can discover what we don't know.

There were several references to "no architecture". That's simply not possible. Everything has one whether you gave it any thought or not.

What Agile doesn't do is try to design and build it before doing any other coding, which often involves trying to predict every darned thing the application(s) will ever need. Just build enough to support today's needs, be mentally and technically prepared to add or change or remove bits as the app evolves.

To the original post - this was a fascinating story. That detective work would be beyond me and the project teams I know in my company. We do some "unplug the network cable" testing for failures, but this situation would have been way hard to predict and test for.

Have to agree with this one. There's no evidence here that a more "architectural heavy" approach would have inexorably discovered this flaw, nor is there any definitive proof that "Agile" and not "gross developer error" was the root culprit. It's a great case study right up until you start Agile Bashing. At that point, this article becomes FUD, pure and simple. Michael, please stop spreading it.

"Agile Bashing" and "FUD" are both very incendiary terms... not conducive to conversation at all. Neither is asking people to be silent. I will attempt to respond to the substance of your comment rather than the terms you've put it in.

My purpose here is certainly not to bash Agile. I've been a proponent and practitioner since before the moniker existed. I was doing unit testing, pairing, refactoring, and short iterations back when it was all just called "XP" or, more generally, lightweight methods. Several years back, I even quit my job to start a company explicitly built on agile methods. More recently, I spent an intense year in a fully agile Scrum/XP project. In the first 8 months, we delivered what the client had failed to deliver over the previous 2 1/2 years. In the next 4 months of my time on that project, we did six additional releases.

I'm speaking from within the Agile community, not from outside of it.

I can see that several people have misread my intention. I blame myself, as the author, for not being clear enough. I will try to make myself more clear here in the comments.

I don't attribute the failure here to a "failure of agile". Nor do I expect that agile methods, as formulated today, should have prevented this problem.

What I am presenting is a problem that has two very difficult characteristics:

It's virtually impossible to predict.

It only occurs in the actual production environment.

I'm drawing an analogy to unit testing. In days past, people thought it was impossible to test software within the development environment. Testing was done in a test lab, by testers, using testing tools. We have rewritten those rules. We now understand that unit testing won't catch every bug, but it sure catches a lot of them. (And, yes, unit testing also motivated changes in the way we design the code itself. We don't mind that much, since the design changes needed for unit testing are all "virtues" that we endorse anyway: decoupling, isolation, single-responsibility, and so on.)

Furthermore, we use automation to solve problems once and keep them solved. So, once a bug is discovered, we write a test to verify the bug. Once we fix the bug, the test acts as a barrier to keep the bug from re-emerging. We use our suite of automated tests to "nail down" the functionality. (And, they allow us to retain existing value while incrementally adding more value.)

As a practice, automated unit testing supports many positive virtues. We don't expect it to prevent or solve every problem. There are known challenges---areas that work, sort of, but not very well---mostly around databases and GUIs. Despite those challenges, I would never give up unit testing.

My point in this piece is to ask a question, not to bash anyone or anything. Can we think of a practice, consistent with agile values, that would advance architecture work the same way that unit testing has advanced coding? I am asking this question by using a specific example of a general class of problems to illustrate a difficult, costly situation that I would like to have avoided rather than solved.

I'm asking this question because I see a need for more connection to the actual deployment environment: filesystems, servers, networks, databases, etc. There are times and places for isolation, but we cannot always be isolated from the deployment environment. By the way, this "disconnectedness" is not unique to agile developers. I suspect that the best solution to disconnectedness will come from the agile community. The Ivory Tower architects have already had their whack at it---and they responded with even larger diagrams that got further disconnected from the real environment.

I very much want to avoid problems like this one, but I've got a hundred stories like this. Some come from agile projects, most come from non-agile projects. Some come from projects with heavy "big architecture up front", some come from projects with incrementally developed architecture. I'm certainly not blaming "agile" for these problems. I'm looking for a solution to them, and for that solution, I'm asking the agile community if we can find a practice that fits our values: incremental, automated, expressive, "just-in-time", self-describing, executable documentation, and enabling.

Like software testing, traditional (heavy) approaches to architecture have not moved the needle on the quality gauge. Let's see if there's a way to do for architecture what unit testing did for software quality.

I think the issue is that there will always be unknowns in software development and any test environment will always be just an approximation of the actual operating environment. We can only make our bets guesses at what parts of the physical and operational environments are significant, what needs to be accurately reflected, what needs to be simulated or approximated, and what can be ignored.

Yes, we should do the best we can do in testing given real world constraints. The value, though, we get from agile development is risk reduction through reduction in costs sunk in an absolute failure. By having multiple, iterative releases, a failed iteration can be rolled back with only the sunk costs of 2-4 weeks of effort, however, we must plan for and be prepared to rollback.

Iterative development moves us past the risks of all-or-nothing one-shot project development. The way to address the risks of unknowns is to push them to the front of the queue and force them to arise as soon as possible. There is no way to plan to avoid the unknown, one can only force it to arise as early as possible allowing time to recover from it.

How would upfront architectural design have prevented this problem
by
Steven Gordon

Really nice article! One thing that was nicely illustrated was that discovering and fixing this problem was a team effort, not just a heroic effort by one "architect".

What I do not see is how doing upfront architectural design would have prevented this from occurring (except armed with 20/20 hindsight).

It seems to me that the same problem could easily have occurred in a waterfall project. The lack of unit tests and functional tests (and likely code bloat to handle dozens of other potential problems that never do happen in production) in most waterfall projects would have made the set of possible causes orders of magnitude larger. Being less sure that each unit was working correctly and that the system works correctly under normal conditions, discovering the root cause would have been much more difficult.

Furthermore, once a fix was determined, establishing that the fix did not break anything else would have also been much more difficult without all those automated unit and functional tests.

I do not even see how doing a few iterations of architectural spikes at the beginning of the project would have prevented this anomoly. In my experience, making sure the first story really goes end-to-end forces a slice of each architectural component to be implemented. This gives us the best of both worlds - validating everyone's understanding of the requirements as well as laying out and testing the architectural approach.

I don't think it's a methodology or architecture problem, though I do think asking the question that way moves the course of inquiry in productive ways.

If all methodologies strive for predictable outcomes, and if patterns that worked before are trusted more than unfamiliar patterns, can we arrive at this generalized principle that all software development projects will apply patterns (architectural, methodological, testing practice, and coding practice) based on the dominant makeup of the team and little else?

I believe this to be true. Given a brilliant technologist with incredible people skills and lots of self confidence vs. a more junior technologist with less self confidence but a better solution, which will a business person choose to listen to? How will a business person make budget decisions? Who will they trust their reputation to?

In IT, more than any other industry, people are the dominant success factor. Agile development recognizes this. Traditional waterfall recognizes it much less. The false concept of man-months recognizes it not at all.

In fifteen years, I've never seen a project that failed for purely technical reasons. Political? ROI? Bad management? Yes. If, by failure, we mean over budget and late, we can always tie the failures back to our inability to estimate accurately the coding effort of a feature more than a few weeks out.

Why? There is a cliche`: "In the time it took to figure out the requirements and write up the spec, I could have coded the feature." I believe this is the seed that germinated into short iterations in all the modern methodologies. It also speaks to our inability to estimate effort without knowing all the details.

Once you 'know the details', the code flies out of your head and onto the screen. Except for those pesky parts where surprising new details are discovered. We'll save that for the next iteration :-)

Nice to read your writing again, Mike.

On a pragmatic note, I don't think the team and methodology exists that can accomplish what you suggest; namely, covering all production scenarios. Based on my aerospace manufacturing experiences, all design refinement is based on feedback loops from production experience.

Where to place the fuel tanks is influenced by what the intended top speed of the aircraft should be, which influences the material of the tanks, which influences their shape and manufacturing method, which influences the type of fuel sensor, which influences which side of the fuel tank its mounted on, which influences where to place the tanks...