Overlords, or Big Brother is watching you

An important component in an HA system is an overlord or Big Brother process
(as in Orwell, not the TV show).
This process is responsible for ensuring that all of the other processes in the system are running.
When a process faults, we need to be able to restart it or make a standby process active.

That's the job of the overlord process.
It monitors the processes for basic sanity (the definition of which is fairly
broad — we'll come back to this), and performs an orderly
shutdown, restart, fail-over, or whatever else is required for the failed (or failing) component.

One remaining question is "who watches the watcher?"
What happens when the overlord process faults?
How do we recover from that?
There are a number of steps that you should take with the overlord process
regardless of anything I'll tell you later on:

since it's a critical part of the system, it warrants extensive testing (this maximizes MTBF).

in order to minimize the amount of testing required, the overlord should be as simple as possible.

However, since the overlord is a piece of software that's more complex than "Hello, world"
it will have bugs and it will fail.

It would be a chicken-and-egg problem to simply say that
we need an overlord to watch the overlord—this would result in a never-ending chain of overlords.

What we really need is a standby overlord that is waiting for the primary
overlord to die or become unresponsive, etc.
When the primary fails, the standby takes over (possibly killing the faulty primary), becomes primary, and
starts up its own standby version.
We'll discuss this mechanism next.