Change management best practice

Hi all, I was hoping you could offer some best practice advice in terms of change management where multiple services are affected.

We currently have a change managmenet process in place but where a change encompasses numerous services I.e 20+ what we have been doing is creating one overarching request for change form. This works fine in the majority of cases but I'm concerned about the lack of detail in this form in terms of implementation plans testing, risk assessments etc. For example the implementation plan might say that server x needs to be shut down or moved to anothet DR site and thats it. Whereas if this was to be done on its own it would cover all the steps required to do this.
To overcome this surely you would need an RFC form for each affected service which would be time consuming.

What's the best way of managing these situations as we have had issues arise because correct procedures weren't followed or testing wasn't completed because it was part of the overarching RFC.

"To overcome this surely you would need an RFC form for each affected service which would be time consuming."
If indeed it is "surely" then you must do it. Turn you attention to make sure that the required steps and approvals are done quickly. Study the bottlenecks and eliminate them.

chg mgmt is a major pain and i'm sure you'll find no one does it perfect. why? as you've eluded to; imperfect information in which to make a fully informed decision.

with that said, chg mgmt boils down the following:
1) Identify change that is desired to be made (this involves an IVB, implementation plan, verification plan, and backout plan)
1b) Part of this is the risk of the change going bad and if backout is even possible and how long it'd take.
2) Identify what depends on the item being changed.
3) Identify the verification and recovery for those potentially affected services as well as the risks/time frame associated with recovery
3b) Part of the this is, what is the business impact if an outage is caused. For example, a non-essential system is taken out for a day, who cares. If your ecommerce system is taken out for even 30 minutes, it can have potential customer image ramifications even. The risk is defined purely by mgmt. The possibility by the engineers.
4) Mgmt needs to decide if the change has enough business value to do or if there are things that should be done to minimize potential impact or if the change can be broken up.

Sorry, but without specifics, this is very much of a philosophical/ideological question which is heavily based on experience. ITIL outlines good chg mgmt processes IMHO.

However, a couple quick examples from my own experience
1) Upgrading network devices
Even if they are core devices, the risk should be medium to low. The reason is chances are it is HA so you can do rolling upgrades. However, this can sometimes mean that traffic in flight gets forcibly reset. While most apps should recover, its not guaranteed. In this case, there is impact. But should you really try to identify every single app that could possibly be affected? Kind of. Only for business critical systems (systems mgmt have identified as required 5 9's uptime). Then mgmt makes the decision. This would typically just be done by notifying the team to ensure there aren't multiple changes going on that increases risk, but beyond that notification is good enough. So email, and do it. App teams then know about it and to watch carefully at their systems during that time frame.
So yes, systems can be impacted, but upgrading for bug fixes or needed features helps the business more than slowing things down so much by being crippled by FUD that something might go wrong

2) Re-architecting dynamic routing over entire company
You are completely redoing how BGP is architected to be used. This is high risk, because if things go sideways and you don't have a out of band connection, you could take down the entire network without any ability to quickly get it up. So what do you do? Well, basically the same as in #1. The major difference is higher risk. All that means is that you may have to spend more time planning. Since if things go bad, the backout plan is extremely undesired and time intensive. You also change when it may be done as well as a lot of communication ahead of time so that others don't schedule changes in the meantime. All of this is up to mgmt to help coordinate though.

In summary, there are 3 things to remember, again imho.
1) Engineers identify technical risks, recovery plans and possibility of failures
2) Mgmt identify business risks and solutions to minimize those risks
3) Don't be crippled by fear of breaking stuff. Things break, it's inevitable. Prepare for things breaking to quickly fix them. Simply, even in high risk changes, at some point you have to pull the trigger.

0

Featured Post

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

EE introduced a new rating method known as Level, which displays in your avatar as LVL. The new Level is a numeric ranking that is based on your Points. This article discusses the rationale behind the new method and provides the mathematical formula…

In this video, we show how to convert an image-only PDF file into a PDF Searchable Image file, that is, a file with both the image (typically from scanning) and text, which is created in an automated fashion with Optical Character Recognition (OCR) …

Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail. The methods are covered in more detail in o…