Tuesday, April 29, 2014

7 habits of highly successful Unix admins

You can spend 50-60 hours a week managing your
Unix servers and responding to your users' problems and still feel as
if you're not getting much done or you can adopt some good work habits
that will both make you more successful and prepare you for the next
round of problems.

April 05, 2014, 6:41 PM —
Unix admins generally work a lot of hours, juggle a large set of
priorities, get little credit for their work, come across as arrogant by
admins of other persuasions, tend to prefer elegant solutions to even
the simplest of problems, take great pride in their ability to apply
regular expressions to any challenge that comes their way, and are
inherently lazy -- at least they're constantly on the lookout for ways
to type fewer characters even when they're doing the most routine work.
While skilled and knowledgeable, they could probably get a whole lot
more done and get more credit for their work if they adopted some habits
akin to those popularized in the 1989 book by Stephen R. Covey -- The 7
Habits of Highly Effective People. In that light, here are some habits
for highly successful Unix administration.

Habit 1: Don't wait for problems to find you

One of the best ways to avoid emergencies that can throw your whole
day out of kilter is to be on the alert for problems in their infancy. I
have found that installing scripts on the servers that report unusual
log entries, check performance and disk space statistics, report
application failures or missing processes, and email me reports when
anything looks "off" can be of considerable value. The risks are
getting so much of this kind of email that you don't actually read it or
failing to notice when these messages stop arriving or start landing
in your spam folder. Noticing what messages *aren't* arriving is not
unlike noticing who from your team of 12 or more people hasn't shown up
for a meeting.

Being proactive, you are likely to spot a number of
problems long before they turn into outages and before you users notice
the problems or find that they can no longer get their work done.
It's also extremely beneficial if you have the resources needed to
plan for disaster. Can you fail over a service if one of your primary
servers goes down? Can you rely on your backups to rebuild a server
environment quickly? Do you test your backups periodically to be sure
they are complete and usable? Preparing disaster recovery plans for
critical services (e.g., the mail service could be migrated to the spare
server in the data center and the NIS+ service has been set up with a
replica) can keep you from scrambling and wasting a lot of time when the
pressure is on.

Habit 2: Know your tools and your systems

Probably the best way to recognize that one of your servers is in
trouble is to know how that server looks under normal conditions. If a
server typically uses 50% of its memory and starts using 99%, you're
going to want to know what is different. What process is running now
that wasn't before? What application is using more resources than
usual?
Be familiar with a set of tools for looking into performance issues,
memory usage, etc. I use and encourage others to use the sar command
routinely, both to see what's happening now on a system and to look back
in time to get an idea when the problems began. One of the scripts
that I run on my most critical servers sends me enough data that I can
get a quick view of the last week or two of performance measures.
It's also a good idea to be practiced with all of the commands that
you might need to run when a problem occurs. Can you construct a find
command that helps you identify suspect files, large files, files with
permissions problems? Knowing how to use a good debugger can also be a
godsend when you need to analyze a process. Knowing how to check
network connections can also be an important thing to do when your
systems might be under attack.

Habit 3: Prioritize, prioritize, prioritize

Putting first things first is something of a no brainer when it comes
to how you organize your work, but sometimes selecting which priority
problem qualifies as "first" may be more difficult than it seems. To
properly prioritize your tasks, you should consider the value to be
derived from the fix. For me, this often involves how many people are
affected by the problem, but it also involves who is affected. Your CEO
might have to be counted as equivalent to 1,000 people in your
development staff. Only you (or your boss) can make this decision. You
also need to consider how much they're affected. Does the problem imply
that they can't get any work done at all or is it just an
inconvenience?
Another critical element in prioritizing your tasks is how long a problem will take to resolve.
Unless the problem that I'm working on is related to an outage, I try to
"whack out" those that are quick to resolve. For me, this is analogous
to the "ten items or fewer" checkout at the supermarket. If I can
resolve a problem in a matter of minutes and then get back to the more
important problem that is likely to take me the rest of the day to
resolve, I'll do it.
You can devise your own numbering system for calculating priorities
if you find this "trick" to be helpful, but don't let it get too
complicated. Maybe your "value" ratings should only go from 1 (low) to 5
(critical), your number of people might go from 1 (an individual) to 5
(everybody), and your time required might be 1 (weeks), 2 (days), 3
(hours) or 4 (minutes). But some way to quantify and defend your
priotities is always a good idea.

Habit 4: Perform post mortems, but don't get lost in them

Some Unix admins get far too carried away with post mortems. It's a
good idea to know why you ran into a problem, but maybe not something
that rates too many hours of your time. If a problem you encountered
was a very serious, high profile problem, and could happen again, you
should probably spend the time to understand exactly what happened. Far
less serious problems might not warrant that kind of scrutiny, so you
should probably put a limit on how much time you devote to understanding
the cause of a problem that was fairly easily resolved and had no
serious consequences.
If you do figure out why something broke, not just what happened,
it's a good idea to keep some kind of record that you or someone else
can find if the same thing happens months or years from now. As much as
I'd like to learn from the problems I have run into over the years, I
have too many times found myself facing a problem and saying "I've seen
this before ..." and yet not remembered the cause or what I had done to
resolve the problem. Keeping good notes and putting them in a reliable
place can save you hours of time somewhere down the line.
You should also be careful to make sure your fix really works. You
might find a smoking gun only to learn that what you thought you fixed
still isn't working. Sometimes there's more than one gun. Try to verify
that any problem you address is completely resolved before you write it
off.
Sometimes you'll need your end user to help with this. Sometimes you can su to that user's account and verify the fix yourself (always my choice).

Habit 5: Document your work

In general, Unix admins don't like to document the things that they
do, but some things really warrant the time and effort. I have built
some complicated tools and enough of them that, without some good notes,
I would have to retrace my steps just to remember how one of these
processes works. For example, I have some processes that involve visual
basic scripts that run on a windows virtual server and send data files
to a Unix server that reformats the files using Perl, preparing them to
be ingested into
an Oracle database. If someone else were to take over responsibility
for this setup, it might take them a long time to understand all the
pieces, where they run, what they're doing, and how they fit together.
In fact, I sometimes have to stop and ask myself "wait a minute; how
does this one work?" Some of the best documentation that I have
prepared for myself outlines the processes and where each piece is run,
displays data samples at each stage in the process and includes details
of how and when each process runs.

Habit 6: Fix the problem AND explain

Good Unix admins will always be responsive to the people they are
supporting, acknowledge the problems that have been reported and let
their users know when they're working on them. If you take the time to
acknowledge a problem when it's reported, inform the person reporting
the problem when you're actually working on the problem, and let the
user know when the problem has been fixed, your users are likely to feel
a lot less frustrated and will be more appreciative of the time you are
spending helping them. If, going further, you take the time to explain
what was wrong and why the problem happened, you may allow them to be
more self-sufficient in the future and they will probably appreciate the
insights that you've provided.

Habit 7: Make time for yourself

As I've said in other postings, you are not your job. Taking care of
yourself is an important part of doing a good job. Don't chain
yourself to your desk. Walk around now and then, take mental breaks, and
keep learning -- especially things that interest you. If you look
after your well being, renew your energy, and step away from your work
load for brief periods, you're likely to be both happier and more
successful in all aspects of your life.