When things are managed automatically with good tools, they work. When people manage them, they often work. Here we talk about automation, managing systems, monitoring, discovery, open source, and related topics.

eating your own dog food

12 November 2007

In this posting, I show how to use cl_respawn[1] to monitor my system logging and help keep it running, and along the way, I improved cl_respawn a little as well. In addition, I explain why I couldn't just use the respawn directive in /etc/inittab[5] (and why you probably can't either). I first talked about cl_respawn in one of my first blog posts[6].

The problem

When we run our automated CTS[2] tests for Linux-HA[3] we rely on the guaranteed log entry delivery provided by syslog-ng[4]. Basically, we redirect all our logs in a test cluster to a test overseer machine, and then CTS watches this consolidated log for errors and correct behavior.

This is a nice system and it works pretty well, but it relies on the reliability of syslog-ng. For the most part, that's just fine. But, sometimes syslog-ng just stops running. Then the tests show that Heartbeat has failed, but it's really just syslog-ng that's crashed on me. So, in the past I added some code to CTS to make it test the logging after every error, and then hit the machines over the head with a hammer and restart logging if logging wasn't working.

This was sort-of OK, because it meant subsequent tests would run fine, but the one test would show failed - even though it probably succeeded. This would be fine, except that one of my machines (my oldest and slowest) had syslog-ng die on it a few times a day. I don't know why, and as long as I can live with it for my testing, I don't much care. I just want it to work. (I know, it's a lousy attitude, but I have way more to do than I can possibly do).

The solution

Then it I had this revolutionary thought - I could use HA software to make my logging highly available!!

Hold the presses, folks, new headline reads"HA guru realizes he can use HA software just like he tells everyone else to do!"

To fix this problem all I had to do was change the init script for syslog to use our cool little cl_respawn tool to babysit the syslog-ng service. Although I could have used Heartbeat to monitor this service, it seemed like overkill and would have conflicted with CTS.

So, I set out to use cl_respawn to restart syslog-ng quickly - minimizing but not eliminating the possibliity of losing important log messages.

When I looked at the init scripts (they're from SUSE Linux), they had these statements in them:

My first thought was ll I had to do was insert cl_respawn ahead of the ${BINDIR}/syslog and I'd be done. Well.... not quite...

If I had done that, then the pid file for the service ${syslog_pid} would have pointed not to cl_respawn, but to syslog-ng. So, when I tried to shut down syslog, cl_respawn would have just respawned it. OOPS. Not quite the right effect.

What was necessary was for the syslog pid file to contain the pid of cl_respawn, not the pid of syslog-ng. One minor problem - the author of cl_respawn didn't deal with pidfiles. To fix that, I added support for a -p option to tell it the name of the pid file to use.

Now I try it. Uh-oh... It didn't work. The logs are quickly filled with attempts to start ${syslog} and having it fail continually with socket in use. What was all that about?

By default, syslog-ng forks itself into the background, and its parent process exits. That makes cl_respawn think it's died - so it restarts it - and it fails ad infinitum. So, I read the man page for syslog-ng and discover the -F option to keep it from forking. Without that, cl_respawn can't tell when it dies.

Along the way, I read the code, find a couple of other minor bugs and fix them. I update my init script and now it looks like this:

Of course, if you don't run SUSE Linux, then your init scripts will look somewhat different, but I'm sure you'll figure it out.

Why not just use respawn in inittab?

Those of you who know UNIX administration to any degree realize that /etc/inittab[5] has a respawn directive you can give it. Why wouldn't that do the trick? The short answer is service dependencies. The longer answer is below:

Logging depends on other /etc/init.d services, so you don't want it to start until after those other services (like the network) are started. The LSB init script system supports these dependencies and starts things in the right order.

Other services depend on logging. A number of other services can't start until after logging starts. If you try and disable the /etc/init.d/syslog service on your machine so you can start it with respawn from /etc/inittab, havoc ensues - because these other services won't start until the /etc/init.d/syslog service is started. If you disable it, they won't start.

What fun would that be? I mean, if we wrote this cl_respawn tool, we probably ought to use it ;-).