How to handle cross-training and redundancy in system administration teams?

In a small team managing [what I think to be] a fairly diverse
environment like ours how do you handle [cross-]training/redundancy? I
know there are a large number of things right now that no one other than
me knows how to do. I think I have the everyday issues well documented,
but non-routine issues may cause issues.

First, let's define "cross training". A department with many sub-teams wants everyone to be able to handle tasks from the other sub-teams. For example, you have an IT department with three sub-teams: a Linux sub-team, a Networking sub-team, and a Storage sub-team. In an ideal world, the Storage sub-team members should be able to handle 80% of the requests of the Linux sub-team, and vice versa. Being able to handle 80% of the Storage-related requests probably means knowing about 20% of what someone on the Storage sub-team knows. That's ok. 80% of the requests are probably things like add/delete/change requests (add a new virtual partition, increase the size of an existing one, etc.), and common problems (what to do with a NFS stale file handle, etc.). If everyone in the department could handle those tasks, the individual teams could focus on higher-order issues like scaling, monitoring, and optimization.

The #1 thing you need to be able to do is document those "sharable tasks". But everyone hates writing documentation, right? So don't write documentation: write checklists. You only have to document the steps, and you can use language that assumes the person has basic knowledge. If you keep the checklists on a system that permits anyone to edit the documents (i.e. a wiki or a source code repository), then they can fine-tune your docs as they use them.

Some standard checklists to write are:
1. things we do for each newhire.
2. things we do when an employee is terminated.
3. how to: allocate space in the machine room, set up a new server, deploy a workstation, add to the puppet configuration, etc. etc.

If you do cross-training right, rather than a pager rotation for each sub-team, you can have one globally pager rotation. That means rather than being oncall once every 3 weeks, you might be on call once every 12 weeks!

To set up cross-training for a pager rotation, I also suggest checklists.

For each page that you might receive, have a checklist of what actions to take. Check list, try rebooting that, look at the logs for messages that say such-and-such. The last step should always be "if all that failed to fix the problem, escalate to so-and-so." If so-and-so feels he/she is getting called in the middle of the night too much, ask them to improve the checklist.

Encourage people to write the checklist when they add the alert rule to the monitoring system. If someone won't or doesn't write the checklist for a particular alert it just means they have agreed to be called in the middle of the night every time.

These checklists will grow and improve over time. Every time you have an outage, augment the checklists that would prevent that problem in the future.

In my tutorial at Usenix LISA, I'll expand on this and show how this system can be used to coordinate training in general, especially in ways that help bring newhires up to speed. This is material I've never written about or included in any other tutorial, and you can only see it at Usenix LISA in San Jose, Nov 7-12, 2010. Seating is limited. Register today!