June 27, 2009

Last post I promised the backstory behind the pretty pictures. As a fair warning, there are no pictures this time around, just lots of discussion of Linux monitoring tools and their integration. (more…)

June 20, 2009

(Here are the pretty pictures. The actual discussion is after the break.) (And okay, maybe they aren’t that pretty, but I’d say they’re more aesthetic than you could reasonably expect for monitoring health stats on a Linux server. Further defensive parentheticals will be reserved for later in the post.)

For a while now I’ve been meaning to post something about Calliope, my (Debian) Linux server at home. I’ve got a nice draft somewhere about all the useful services I’ve got running, why I chose the packages I did, and so forth. At the end of the day, though, the basic take-away is “pick some services you want to run, install them, and Google till you get them to work”. (I’ll probably post it someday, though perhaps not after giving it such a heart-warming buildup.)

Anyway, the more I think about it, the most distinctive part of Calliope isn’t the actual services she provides, but rather the tools I’ve assembled for keeping an eye on her. This isn’t like a desktop box, where I’m sitting at the console or the machine is turned off. Instead, most of my interaction is indirect — listening to music on the PS3, or looking at files shared through Windows networking or Apache, or even just acquiring a network address through DHCP. The common thread in all these cases is that I’ll notice abrupt failure, but I’m not directly logged in to see notifications or status messages.

One traditional solution has always been e-mail. I send myself nightly notifications of backup success, mostly because it gives me a warm fuzzy feeling to know that my Subversion repository is safely in at least two places (and on a recent trip, I found it a surprisingly reassuring touch of home), but fundamentally I’m not prepared to page through long logs or status reports on my phone (where I read most of my e-mail). In fact, I’m not prepared to do that even on my desktop or netbook, unless I already know there’s a problem (and remember, theonly readily visible symptoms of problem are “slow” or “failed to connect”).

So to summarize, I’ve got an uninvolved admin staff (myself), who wants things to Just Work, and doesn’t want to have to explain to his wife that the internet is down right now because he was in the middle of a project when he got bored with it. Fortunately, I’ve got Debian-stable as a pretty damn rock solid baseline to build from, so most problems will be the result of misconfiguration, user error, or mechanical failure. (The last is also a challenge, since the box is out in the hallway with the cats’ litter box — not a lot of foot traffic to notice things like fan failure or hard drive Squeaks of Doom.) Oh, and because I don’t do Linux server admin for a day job, I’m not necessarily going to be able to distinguish between dire-sounding-but-routine conditions and actual symptoms, since I lack a good intuitive baseline.

Anyway, enough exploration of the problem space. (Well, never enough, but at some point requirements analysis turns into plain old griping.) I actually have a solution that’s working pretty well for me so far, and is probably the biggest difference between this box and previous Windows servers I’ve set up.