The Agile Coach on Production Support

One of the biggest struggles I’ve seen in organizations adopting Agile is in the area of Production Support. Every organization that has a product to support has to manage this. It’s a critical part of the business. There’s a lot of responsibility to manage a 24/7 Production environment. The middle of the night calls, working on code you didn’t create. Speediness, precision, reaction time and problem-solving are crucial.

The struggle typically comes in the form of pulling people away from active Scrum teams to make whatever hot fixes are necessary. This is clearly disruptive to the Scrum development team. Oftentimes, a team member is being pulled away to work on something that’s very unpredictable. Who knows how long they could be gone from the team; an hour, a day, a week? This disrupts our rhythm and velocity. It makes planning difficult. Team cohesiveness suffers. The person getting yanked to do the Production work usually isn’t too happy either. Context switching makes them less productive. And in the end, we actually get less done!

Alas, we need to get these ongoing high priority bug fixes out the door asap. And we need the ability to re-prioritize any time. We need focus. And we know that 2 week sprints aren’t a good fit.

So how do we combat this? How can we keep our Production environment up to speed, while also allowing our Scrum teams to develop that stuff they’ve committed to getting done for their Product Owner?

My preference, and one that I’ve seen work well in many organizations, is to create a new Production Support team that works these hot fixes as part of a Kanban effort. The only negative feedback I’ve heard on this is that it seems like punishment to some (think of the Gulag in Northern Syberia). To combat this feeling, we’ve tried cycling folks on and off the team so they don’t burn out. As any good Managers knows, it goes a long way to show your appreciation to this group in some way. Maybe a really nice work environment, new laptops, snacks/drinks, etc. A sincere pat on the back also works wonders. Get creative. Additionally, there may be some folks that don’t fit well in your new highly collaborative team culture. They’re very smart folks, but more of the lone wolf variety. This team could be a natural fit for them.

At the end of the day, technical debt is something that will keep us in this death spiral. We want fewer bugs in Production. In order to get out of it, we need to improve our technical practices, our code quality, our testing, our architecture, and our design. It’s hard to make any progress if we keep patching up something that’s not good to begin with. Sometimes it’s just easier in the long run to scrap it and start fresh (I know, that’s a hard pill to swallow).

8 Responses to The Agile Coach on Production Support

An interesting article Mike and a challenge we’re currently experiencing having moved from a waterfall style of delivery to Agile – Scrum over the last 18 months. We currently have 2 development teams of 12 developer each and a permanent support team comprising of 4 support developers.

Five years ago and obviously prior to our adoption of Agile we moved from performing support on a rota basis to creating a permanent support team – primarily covering 2nd and 3rd line support tasks. This works to a point but as we mature in our Agile fluency and increase the rate of delivery this puts a different challenge our support teams way.

The development teams release changes into production and then perform a managed handover into the support team. The support team then manages all issues on our production systems. Often the support team need to call on our dev teams to help resolve issues.

This feels to me very much like a “hand-off”/ over the wall mentality and contradicts the core concepts of DevOps and maintaining an effective flow of work.

In our emerging Agile world the resolution of these issues feels inefficient and ineffective – i.e. time taken and the amount of effort made to resolve issues is more than it should be, and due to the lack of depth of understanding of the system and problem we often don’t address the root cause of issues. If the dev team who had expertise in the system(s) and domain tackled these issues there would be more of a need to react and this would impact velocity but we’d probably get a quicker and more effective resolution. If the team feel the pain and are empowered to address it they’re more likely tackle the root cause.

We’re seriously thinking of moving production support for systems back into our dev teams and managing them within our backlogs. No decision been made on this yet but I’d interested in your thoughts.

I’ll throw out a couple of other ideas for handling production issues:

Keeping the maintenance work within the team, but rotating responsibility for production issues. This approach avoids the “maintenance team ghetto” problem, but it won’t work if team members are too specialized in what areas of the code they can fix.

Treating maintenance issues as overhead. Expand the overhead part of planning to include time devoted to production issues. If these issues are too volatile (i.e., they happen too randomly, and the time to resolution is both big and random), this approach will be tough. However, it will give you an incentive to deal with whatever technical debt that is creating this volatility.

Mike – I lived this exact challenge. With shrinking budgets, forming a dedicated production support team with sufficient depth of skills was no longer realistic. With critical production support requirements, sometimes an ops issue simply was highest priority over the current sprint backlog. So, we planned a support story into each sprint and sized it — partly based on history and with consideration for any known upcoming events that might affect its relative size for each sprint. Tasks were created to track what actually had to be performed. If more was needed then had been planned, the Product Owner was fully engaged and made the final decision. Each sprint accounted for all the real work that was done. At least, we got traceability into what unplanned issues occurred and how they impacted the original sprint plan. Unplanned truly urgent priority work does however cause all the problems you identified. Agree best bet is to invest in quality measures to minimize this intrusion. But despite all best efforts, these challenges still will occur – and not every factor is within the team’s control.

I have seen the “support story in every sprint” approach, and it has worked in the limited cases I’ve seen. I advise against splitting responsibility for code changes between different teams, particularly if one team is agile (the dev team) and the other is not (the support team). I recommend against doing anything that dilutes accountability for product quality. Until I see a better idea, my current recommendation is that user education/information and “normal” support/admin activities (e.g., password resets) be handled by a service desk. Anything requiring code changes should be escalated to the development team. Critical fixes (i.e., must occur immediately) should be done in a separate branch from development (based on the production trunk) and moved into production through a managed hot-fix process. The hot-fix branch and the development branch should be merged at the next appropriate opportunity (e.g., as a story in the next sprint, or as part of a release). Non-critical fixes should be added to the Product Backlog and prioritized by the Product Owner for a future sprint (like other development activity).

Mike, this is an excellent questions. What has worked well for my teams is to reduce ideal hours to allow for production support during capacity planning. There is an impact to velocity using this approach, and release forecasting needs to adjust according. It is a good idea to track production support tasks on a Kanban board to provide visibility into prioritized production support tasks per resource. This approach enables you to produce metrics to show the cycle time dedicated to production support per resource in a given sprint. These reports enable you compare burn up reports (project vs production support) and perhaps make a case to include production support resources in the next budget.

Alex – I agree with you 100% – don’t have your non-agile support team changing code that your Agile team (using xp practices, tdd, etc) put in place. But here is my question: how do you handle this if the team is finished with project x and onto project y, including a new product owner? The new product owner may not have the same understanding of a fix to the old project. How to you compromise the prioritization so the fix can get done along side the new project work?

Steve – I have not had to deal with a change of product owner before, but I’ll tell you how I would try to handle if it came up. I view the product owner as having full responsibility for the business value of the project. There will be times when the development team believes that work is necessary, but the product owner does not view it as a priority. In such a case, I think the development team should make sure that the product owner is fully informed about the need and the implications if the work is not performed, *but* it is the product owner’s decision at the end of the day. They are the ones who are accountable to the organization’s leaders for realizing the business value of the project.
From a practical perspective, that means there are going to be times when a user submits a request or incident report, and the product owner decides that the request is either not a priority, or will not be added to the product backlog. If the requestor is not satisfied with the decision, they should be able to escalate the issue within the business organization, not within IT – it was a business decision by the product owner, not a technical decision. Of course, this is based on the assumption that the product owner is part of the business organization, not IT.