So, during a major incident that impacted the availability of one section of a business critical system a restart of the entire system was required to restore service.

This restart caused unavailability of the entire platform for 3 minutes. The technical teams tried to push through with this activity immediately but I requested it go through the CM process as an emergency change.

This happened, everything was fixed and all was well with the world. Then I get the following in an email the next day from "upon high"

Quote:

A restart does not constitute a CR having to be raised. This is a Service/Operations call with approval from the business (i.e. me). What we need to establish is what is BAU type activities, and below is a perfect example, and what is an actual change.

Lets work together over the next few days to distinguish these.

I completely disagree with this. A restart of the platform is not "BAU type activities" - to me it is changing the state of the service, it is availability impacting and so is a Change that requires Management.

As I have said before (and also in another place), I also take the view that restart is not a change.

However this is a technical view because it is strongly analogous. What you did was entirely correct. If you do not have a formal and rigorous process in place for performing a restart then use the protocols of Change Management as a reasonable make-do.

The way I suggest you deal with this "on high" is to propose the creation of a formal restart procedure. This procedure will require to be followed prior to any restart taking place and it will:

a) ensure that no other service is disrupted, unless a business decision is confirmed that temporary loss of said service(s) - with the risk that such loss may be much longer than three minutes! - are of lower value than the need to restore the currently disrupted service.

b) ensure that there are no latent changes residing in the server involved (e.g. patches awaiting a restart), or that such additional risks are understood and contingency is in place.

c) contingency is in place in the event the server cannot be restarted (possibly overkill - until it happens to you!).

d) appropriate staff are on hand to support any issues arising with the server or any of the applications running on it.

e) everyone in the relevant parts of IT services and of the business is fully aware of the situation and the plan and approvals are obtained from all the right people

f) I'm sure there is more, but it is getting late and I'm up early to drive to Glasgow in the morning.

The bottom line is that everything must be done to protect the business. So it has to be done as quickly as possible, but the risks from blundering ahead are considerable, especially when you remember that services do not often have an absolute priority. A normally "run of the mill" application, may be in the middle of a vital activity with considerable consequences.

It is probable that such a restart procedure will look very similar to your change procedure, but with some of the blanks filled in identifying specific steps and specific staff.

If you swing that, you will have answered the false charge against you and perhaps have opened the eyes of "on high" to the implications of restating a computer.

There was a reboot thread that went into this at some length. If you can find it, you will get some more useful thoughts. Also there is a vast thread on Linkedin on this very subject, with people of considerable experience hammering out their views, sometimes in a clear way.

I hope this gives you more than a defence for your stance, but a possible way forward as well._________________"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718

It sounds like you're essentially suggesting what I want to propose, which is the development of a 'service x' restart Change Model. This way a standard template can be used, the procedure and associated contingency plans will already be documented and tested, the impact and risks will already have been identified & accepted as necessary in certain conditions and all that will be required is the authorisation to implement.

This way it will still go through the Change Mgt procedure, and still be authorised. What I don't want is technical teams reaching for a 'restart procedure' in the middle of an MI and the only record of it that we have is an email somewhere.