incron
is a useful little cron-like utility that lets you run arbitrary jobs
(like cron), but instead of being triggered at certain times, your
jobs are triggered by changes to files or directories.

It uses the linux kernel inotify
facility (hence the name), and so it isn't cross-platform, but on linux
it can be really useful for monitoring file changes or uploads, reporting
or forwarding based on status files, simple synchronisation schemes, etc.

Again like cron, incron supports the notion of job 'tables' where
commands are configured, and users can have manage their own tables
using an incrontab command, while root can manage multiple system
tables.

So it's a really useful linux utility, but it's also fairly old (the
last release, v0.5.10, is from 2012), doesn't appear to be under
active development any more, and it has a few frustrating quirks that
can make using it unnecessarily difficult.

So this post is intended to highlight a few of the 'gotchas' I've
experienced using incron:

You can't monitor recursively i.e. if you create a watch on a
directory incron will only be triggered on events in that
directory itself, not in any subdirectories below it. This isn't
really an incron issue since it's a limitation of the underlying
inotify mechanism, but it's definitely something you'll want
to be aware of going in.

The incron interface is enough like cron (incrontab -l,
incrontab -e, man 5 incrontab, etc.) that you might think
that all your nice crontab features are available. Unfortunately
that's not the case - most significantly, you can't have comments
in incron tables (incron will try and parse your comment lines and
fail), and you can't set environment variables to be available for
your commands.

That means that cronMAILTO support is not available, and in
general there's no easy way of getting access to the stdout or
stderr of your jobs. You can't even use shell redirects in your
command to capture the output (e.g. echo $@/$# >> /tmp/incron.log
doesn't work). If you're debugging, the best you can do is add a
layer of indirection by using a wrapper script that does the
redirection you need (e.g. echo $1 2&>1 >> /tmp/incron.log)
and calling the wrapper script in your incrontab with the incron
arguments (e.g. debug.sh $@/$#). This all makes debugging
misbehaving commands pretty painful. The main place to check if
your commands are running is the cron log (/var/log/cron) on
RHEL/CentOS, and syslog (/var/log/syslog) on Ubuntu/Debian.

incron is also very picky about whitespace in your incrontab.
If you put more than one space (or a tab) between the inotify
masks and your command, you'll get an error in your cron log
saying cannot exec process: No such file or directory, because
incron will have included everything after the first space as part
of your command e.g. (gavin) CMD ( echo /home/gavin/tmp/foo)
(note the evil space before the echo).

It's often difficult (and non-intuitive) to figure out what inotify
events you want to trigger on in your incrontab masks. For instance,
does 'IN_CREATE' get fired when you replace an existing file with a
new version? What events are fired when you do a mv or a cp?
If you're wanting to trigger on an incoming remote file copy, should
you use 'IN_CREATE' or 'IN_CLOSE_WRITE'? In general, you don't want to guess,
you actually want to test and see what events actually get fired on
the operations you're interested in. The easiest way to do this is
use inotifywait from the inotify-tools package, and run it using
inotifywait -m <dir>, which will report to you all the inotify
events that get triggered on that directory (hit <Ctrl-C> to exit).

The "If you're wanting to trigger on an incoming remote file copy,
should you use 'IN_CREATE' or 'IN_CLOSE_WRITE'?" above was a trick
question - it turns out it depends how you're doing the copy! If
you're just going a simple copy in-place (e.g. with scp), then
(assuming you want the completed file) you're going to want to trigger
on 'IN_CLOSE_WRITE', since that's signalling all writing is complete and
the full file will be available. If you're using a vanilla rsync,
though, that's not going to work, as rsync does a clever
write-to-a-hidden-file trick, and then moves the hidden file to
the destination name atomically. So in that case you're going to want
to trigger on 'IN_MOVED_TO', which will give you the destination
filename once the rsync is completed. So again, make sure you test
thoroughly before you deploy.

First, the WiredTiger storage engine (the default since mongodb 3.2)
"strongly" recommends using the xfs filesystem on linux, rather than
ext4 (see https://docs.mongodb.com/manual/administration/production-notes/#prod-notes-linux-file-system
for details). So the first thing to do is reorganise your disk to make
sure you have an xfs filesystem available to hold your upgraded database.
If you have the disk space, this may be reasonably straightforward; if
you don't, it's a serious PITA.

These are the basics anyway. This doesn't cover configuring access control on your
new database, or wrangling SELinux permissions on your database directory, but if
you're doing those currently you should be able to figure those out.

support for (secure) downloads, ideally via a browser (no special software required)

support for (secure) uploads, ideally via sftp (most of our customers are familiar with ftp)

Our target was RHEL/CentOS 7, but this should transfer to other linuxes pretty
readily.

Here's the schema we ended up settling on, which seems to give us a good mix of
security and flexibility.

use apache with HTTPS and PAM with local accounts, one per customer, and nologin
shell accounts

users have their own groups (group=$USER), and also belong to the sftp group

we use the users group for internal company accounts, but NOT for customers

customer data directories live in /data

we use a 3-layer hierarchy for security: /data/chroot_$USER/$USER
are created with a nologin shell

the /data/chroot_$USER directory must be owned by root:$USER, with
permissions 750, and is used for an sftp chroot directory (not writeable
by the user)

the next-level /data/chroot_$USER/$USER directory should be owned by $USER:users,
with permissions 2770 (where users is our internal company user group, so both
the customer and our internal users can write here)

we also add an ACL to /data/chroot_$USER to allow the company-internal users
group read/search access (but not write)

We just use openssh internal-sftp to provide sftp access, with the following config:

So we chroot sftp connections to /data/chroot_$USER and then (via the ForceCommand)
chdir to /data/chroot_$USER/$USER, so they start off in the writeable part of their
tree. (If they bother to pwd, they see that they're in /$USER, and they can chdir
up a level, but there's nothing else there except their $USER directory, and they
can't write to the chroot.)

Since I got bitten by this recently, let me blog a quick warning here:
glibc iconv - a utility for character set conversions, like iso8859-1 or
windows-1252 to utf-8 - has a nasty misfeature/bug: if you give it data on
stdin it will slurp the entire file into memory before it does a single
character conversion.

Which is fine if you're running small input files. If you're trying to
convert a 10G file on a VPS with 2G of RAM, however ... not so good!

This looks to be a
known issue, with
patches submitted to fix it in August 2015, but I'm not sure if they've
been merged, or into which version of glibc. Certainly RHEL/CentOS 7 (with
glibc 2.17) and Ubuntu 14.04 (with glibc 2.19) are both affected.

Once you know about the issue, it's easy enough to workaround - there's an
iconv-chunks wrapper on github that
breaks the input into chunks before feeding it to iconv, or you can do much
the same thing using the lovely GNU parallel
e.g.

If you're a modern sysadmin you've probably been sipping at the devops
koolaid and trying out one or more of the current system configuration
management tools like puppet or chef.

These tools are awesome - particularly for homogenous large-scale
deployments of identical nodes.

In practice in the enterprise, though, things get more messy. You can
have legacy nodes that can't be puppetised due to their sensitivity and
importance; or nodes that are sufficiently unusual that the payoff of
putting them under configuration management doesn't justify the work;
or just systems which you don't have full control over.

We've been using a simple tool called extract in these kinds of
environments, which pulls a given set of files from remote hosts and
stores them under version control in a set of local per-host trees.

You can think of it as the yang to puppet or chef's yin - instead of
pushing configs onto remote nodes, it's about pulling configs off
nodes, and storing them for tracking and change control.

We've been primarily using it in a RedHat/CentOS environment, so we
use it in conjunction with
rpm-find-changes,
which identifies all the config files under /etc that have been
changed from their deployment versions, or are custom files not
belonging to a package.

Extract doesn't care where its list of files to extract comes from, so
it should be easily customised for other environments.

It uses a simple extract.conf shell-variable-style config file,
like this:

Extract also allows arbitrary scripts to be called at the beginning
(setup) and end (teardown) of a run, and before and/or after each host.
Extract ships with some example shell scripts for loading ssh keys, and
checking extracted changes into git or bzr. These hooks are also
configured in the extract.conf config e.g.:

# Pre-process scripts
# PRE_EXTRACT_SETUP - run once only, before any extracts are done
PRE_EXTRACT_SETUP=pre_extract_load_ssh_keys
# PRE_EXTRACT_HOST - run before each host extraction
#PRE_EXTRACT_HOST=pre_extract_noop
# Post process scripts
# POST_EXTRACT_HOST - run after each host extraction
POST_EXTRACT_HOST=post_extract_git
# POST_EXTRACT_TEARDOWN - run once only, after all extracts are completed
#POST_EXTRACT_TEARDOWN=post_extract_touch

Extract is available on github, and
packages for RHEL/CentOS 5 and 6 are available from
my repository.

Be aware that there are multiple ldap configuration files involved now.
All of the following end up with ldap config entries in them and need to
be checked:

/etc/openldap/ldap.conf

/etc/pam_ldap.conf

/etc/nslcd.conf

/etc/sssd/sssd.conf

Note too that /etc/openldap/ldap.conf uses uppercased directives (e.g. URI)
that get lowercased in the other files (URI -> uri). Additionally, some
directives are confusingly renamed as well - e.g. TLA_CACERT in
/etc/openldap/ldap.conf becomes tla_cacertfile in most of the others.
:-(

If you want to do SSL or TLS, you should know that the default behaviour
is for ldap clients to verify certificates, and give misleading bind errors
if they can't validate them. This means:

if you're using CA-signed certificates, and want to verify them, add
your CA PEM certificate to a directory of your choice (e.g.
/etc/openldap/certs, or /etc/pki/tls/certs, for instance), and point
to it using TLA_CACERT in /etc/openldap/ldap.conf, and
tla_cacertfile in /etc/ldap.conf.

RHEL6 uses a new-fangled /etc/openldap/slapd.d directory for the old
/etc/openldap/slapd.conf config data, and the
RHEL6 Migration Guide
tells you to how to convert from one to the other. But if you simply
rename the default slapd.d directory, slapd will use the old-style
slapd.conf file quite happily, which is much easier to read/modify/debug,
at least while you're getting things working.

If you run into problems on the server, there are lots of helpful utilities
included with the openldap-servers package. Check out the manpages for
slaptest(8), slapcat(8), slapacl(8), slapadd(8), etc.

rpm-find-changes is a little script I wrote a while ago for rpm-based
systems (RedHat, CentOS, Mandriva, etc.). It finds files in a filesystem
tree that are not owned by any rpm package (orphans), or are modified
from the version distributed with their rpm. In other words, any file
that has been introduced or changed from it's distributed version.

It's intended to help identify candidates for backup, or just for
tracking interesting changes. I run it nightly on /etc on most of my
machines, producing a list of files that I copy off the machine (using
another tool, which I'll blog about later) and store in a git
repository.

I've also used it for tracking changes to critical configuration trees
across multiple machines, to make sure everything is kept in sync, and
to be able to track changes over time.

When you have more than a handful of hosts on your network, you need to
start keeping track of what services are living where, what roles
particular servers have, etc. This can be documentation-based (say on a
wiki, or offline), or it can be implicit in a configuration management
system. Old-school sysadmins often used dns TXT records for these kind of
notes, on the basis that it was easy to look them up from the command
line from anywhere.

I've been experimenting with the idea of using lightweight tags attached
to hostnames for this kind of data, and it's been working really nicely.
Hosttag is just a couple of ruby command line utilities, one (hosttag
or ht) for doing tag or host lookups, and one (htset/htdel) for
doing adds and deletes. Both are network based, so you can do lookups
from wherever you are, rather than having to go to somewhere centralised.

Hosttag uses a redis server to store the hostname-tag
and tag-hostname mappings as redis sets, which makes queries lightning
fast, and setup straightforward.

Here's what I use to take a quick inventory of a machine before a rebuild,
both to act as a reference during the rebuild itself, and in case something
goes pear-shaped. The whole chunk after script up to exit is
cut-and-pastable.

Came across cronologger
(blog post)
recently (via Dean Wilson),
which is a simple wrapper script you use around your cron(8) jobs, which
captures any stdout and stderr output and logs it to a couchdb database,
instead of the traditional behaviour of sending it to you as email.

It's a nice idea, particularly for jobs with important output where it
would be nice to able to look back in time more easily than by trawling
through a noisy inbox, or for sites with lots of cron jobs where the sheer
volume is difficult to handle usefully as email.

Cronologger comes with a simple web interface for displaying your cron jobs,
but so far it's pretty rudimentary. I quickly realised that this was another
place (cf. blosxom4nagios) where
blosxom could be used to provide a pretty
useful gui with very little work.

cronologue(1) is the wrapper, written in perl, which logs job records and
and stdout/stderr output via standard HTTP PUTs back to a designated apache
server, as flat text files. Parameters can be used to control whether job
records are always created, or only when there is output produced. There's
also a --passthru mode in which stdout and stderr streams are still output,
allowing both email and cronologue output to be produced.

On the server side a custom blosxom install is used to display the job records,
which can be filtered by hostname or by date. There's also an RSS feed available.

Obligatory screenshot:

Update: I should add that RPMs for CentOS5 (but which will probably work on
most RPM-based distros) are available from
my yum repository.

Been playing with Riak recently, which is
one of the modern dynamo-derived nosql databases (the other main ones being
Cassandra and Voldemort). We're evaluating it for use as a really large
brackup datastore, the primary attraction
being the near linear scalability available by adding (relatively cheap) new
nodes to the cluster, and decent availability options in the face of node
failures.

I've built riak packages for RHEL/CentOS 5, available at my
repository,
and added support for a riak 'target' to the
latest version (1.10) of brackup
(packages also available at my repo).

The first thing to figure out is the maximum number of nodes you expect
your riak cluster to get to. This you use to size the ring_creation_size
setting, which is the number of partitions the hash space is divided into.
It must be a power of 2 (64, 128, 256, etc.), and the reason it's important
is that it cannot be easily changed after the cluster has been created.
The rule of thumb is that for performance you want at least 10 partitions
per node/machine, so the default ring_creation_size of 64 is really only
useful up to about 6 nodes. 128 scales to 10-12, 256 to 20-25, etc. For more
info see the Riak Wiki.

Here's the script I use for configuring a new node on CentOS. The main
things to tweak here are the ring_creation_size you want (here I'm using
512, for a biggish cluster), and the interface to use to get the default ip
address (here eth0, or you could just hardcode 0.0.0.0 instead of $ip).

Save this to a file called e.g. riak_configure, and then to configure a couple
of nodes you do the following (note that NODE is any old internal hostname you use
to ssh to the host in question, but FIRST_NODE needs to use the actual -name
parameter defined in /etc/riak/vm.args on your first node):

Problem: you've got a remote server that's significantly hosed, either
through a screwup somewhere or a power outage that did nasty things to
your root filesystem or something. You have no available remote hands,
and/or no boot media anyway.

Preconditions: You have another server you can access on the same
network segment, and remote access to the broken server, either through
a DRAC or iLO type card, or through some kind of serial console server
(like a Cyclades/Avocent box).

Solution: in extremis, you can do a remote rebuild. Here's the simplest
recipe I've come up with. I'm rebuilding using centos5-x86_64 version
5.5; adjust as necessary.

Note:dnsmasq, mrepo and syslinux are not core CentOS packages,
so you need to enable the rpmforge
repository to follow this recipe. This just involves:

1. On your working box (which you're now going to press into service as a
build server), install and configure dnsmasq
to provide dhcp and tftp services:

# Install dnsmasq
yum install dnsmasq
# Add the following lines to the bottom of your /etc/dnsmasq.conf file
# Note that we don't use the following ip address, but the directive
# itself is required for dnsmasq to turn dhcp functionality on
dhcp-range=ignore,192.168.1.99,192.168.1.99
# Here use the broken server's mac addr, hostname, and ip address
dhcp-host=00:d0:68:09:19:80,broken.example.com,192.168.1.5,net:centos5x
# Point the centos5x tag at the tftpboot environment you're going to setup
dhcp-boot=net:centos5x,/centos5x-x86_64/pxelinux.0
# And enable tftp
enable-tftp
tftp-root = /tftpboot
#log-dhcp
# Then start up dnsmasq
service dnsmasq start

3. Finally, finish setting up your tftp environment. mrepo should have copied
appropriate pxelinux.0, initrd.img, and vmlinuz files into your
/tftpboot/centos5-x86_64 directory, so all you need to supply is an
appropriate grub boot config:

Following on from my IPMI explorations, here's the next
chapter in my getting-down-and-dirty-with-dell-hardware-on-linux adventures.
This time I'm setting up Dell's
OpenManage Server Administrator
software, primarily in order to explore being able to configure bios settings
from within the OS. As before, I'm running CentOS 5, but OMSA supports any of
RHEL4, RHEL5, SLES9, and SLES10, and various versions of Fedora Core and
OpenSUSE.

Here's what I did to get up and running:

# Configure the Dell OMSA repository
wget -O bootstrap.sh http://linux.dell.com/repo/hardware/latest/bootstrap.cgi
# Review the script to make sure you trust it, and then run it
sh bootstrap.sh
# OR, for CentOS5/RHEL5 x86_64 you can just install the following:
rpm -Uvh http://linux.dell.com/repo/hardware/latest/platform_independent/rh50_64/prereq/\
dell-omsa-repository-2-5.noarch.rpm
# Install base version of OMSA, without gui (install srvadmin-all for more)
yum install srvadmin-base
# One of daemons requires /usr/bin/lockfile, so make sure you've got procmail installed
yum install procmail
# If you're running an x86_64 OS, there are a couple of additional 32-bit
# libraries you need that aren't dependencies in the RPMs
yum install compat-libstdc++-33-3.2.3-61.i386 pam.i386
# Start OMSA daemons
for i in instsvcdrv dataeng dsm_om_shrsvc; do service $i start; done
# Finally, you can update your path by doing logout/login, or just run:
. /etc/profile.d/srvadmin-path.sh

Now to check whether you're actually functional you can try a few of the
following (as root):

omconfig about
omreport about
omreport system -?
omreport chassis -?

omreport is the OMSA CLI reporting/query tool, and omconfig is the
equivalent update tool. The main documentation for the current version of
OMSA is here.
I found the CLI User's Guide
the most useful.

omconfig allows setting object attributes using a key=value syntax, which
can get reasonably complex. See the CLI User's Guide above for details, but
here are some examples of messing with various bios settings:

After using brackup for a while you find
you have a big list of backups sitting on your server, and start to think
about cleaning up some of the older ones. The standard brackup tool for this
is brackup-target, and the prune and gc (garbage collection)
subcommands.

This simple scheme - "keep the last N backups" - works pretty nicely for
backups you do relatively infrequently. If you do more frequent backups,
however, you might find yourself wanting to be able to implement more
sophisticated retention policies. Traditional backup regimes often involve
policies like this:

keep the last 2 weeks of daily backups

keep the last 8 weekly backups

keep monthly backups forever

It's not necessarily obvious how to do something like this with brackup, but
it's actually pretty straightforward. The trick is to define multiple
'sources' in your brackup.conf, one for each backup 'level' you want to use.
For instance, to implement the regime above, you might define the following:

(Okay, brand new year - must be time to get back on the blogging wagon ...)

Linux Journal recently had a really good article
by Philip Martin on Anycast DNS. It's
well worth a read - I just want to point it out and record a cutdown version of
how I've been setting it up recently.

As the super-quick intro, anycast is the idea of providing a network service
at multiple points in a network, and then routing requests to the 'nearest'
service provider for any particular client. There's a one-to-many relationship
between an ip address and the hosts that are providing services on that address.

In the LJ article above, this means you provide a service on a /32 host address,
and then use a(n) (interior) dynamic routing protocol to advertise that address
to your internal routers. If you're a non-cisco linux shop, that means using
quagga/ospf.

The classic anycast service is dns, since it's stateless and benefits from the
high availability and low latency benefits of a distributed anycast service.

So here's my quick-and-dirty notes on setting up an anycast dns server on
CentOS/RHEL using dnsmasq for dns, and quagga zebra/ospfd for the routing.

And then check on your router that the anycast dns address is getting advertised
and picked up by your router. If you're using cisco, you're probably know how to
do that; if you're using linux and quagga, the useful vtysh commands are:

Further to my earlier post, I've spent a good chunk
of time implementing brackup over the last few weeks, both at home for my
personal backups, and at $work on some really large trees. There are a few
gotchas along the way, so thought I'd document some of them here.

Active Filesystems

First, as soon as you start trying to brackup trees on any size you find
that brackup aborts if it finds a file has changed between the time it
initially walks the tree and when it comes to back it up. On an active
filesystem this can happen pretty quickly.

This is arguably reasonable behaviour on brackup's part, but it gets
annoying pretty fast. The cleanest solution is to use some kind of
filesystem snapshot to ensure you're backing up a consistent view of your
data and a quiescent filesystem.

I'm using linux and LVM, so I'm using LVM snapshots for this, using
something like:

You can then do your backup using the /${PART}_snap tree instead of your
original ${PART} one.

Brackup Digests

So snapshots works nicely. Next wrinkle is that by default brackup writes its
digest cache file to the root of your source tree, which in this case is
readonly. So you want to tell brackup to put that in the original tree, not
the snapshot, which you do in the your ~/.brackup.conf file e.g.

I've also added an explicit ignore rule for these digest files here. You
don't really need to back these up as they're just caches, and they can get
pretty large. Brackup automatically skips the digestdb_file for you, but it
doesn't skip any others you might have, if for instance you're backing up
the same tree to multiple targets.

Synching Backups Between Targets

Another nice hack you can do with brackup is sync backups on
filesystem-based targets (that is, Target::Filesystem, Target::Ftp, and
Target::Sftp) between systems. For instance, I did my initial home directory
backup of about 10GB onto my laptop, and then carried my laptop into where
my server is located, and then rsync-ed the backup from my laptop to the
server. Much faster than copying 10GB of data over an ADSL line!

Similarly, at $work I'm doing brackups onto a local backup server on the
LAN, and then rsyncing the brackup tree to an offsite server for disaster
recovery purposes.

There are a few gotchas when doing this, though. One is that
Target::Filesystem backups default to using colons in their chunk file names
on Unix-like filesystems (for backwards-compatibility reasons), while
Target::Ftp and Target::Sftp ones don't. The safest thing to do is just to
turn off colons altogether on Filesystem targets:

Second, brackup uses a local inventory database to avoid some remote
filesystem checks to improve performance, so that if you replicate a backup
onto another target you also need to make a copy of the inventory database
so that brackup knows which chunks are already on your new target.

The inventory database defaults to $HOME/.brackup-target-TARGETNAME.invdb
(see perldoc Brackup::InventoryDatabase), so something like the following
is usually sufficient:

Third, if you want to do a restore using a brackup file (the
SOURCE-DATE.brackup output file brackup produces) from a different
target, you typically need to make a copy and then update the header
portion for the target type and host/path details of your new target.
Assuming you do that and your new target has all the same chunks, though,
restores work just fine.

I've been playing around with Brad Fitzpatrick'sbrackup for the last couple of weeks.
It's a backup tool that "slices, dices, encrypts, and sprays across the
net" - notably to Amazon S3,
but also to filesystems (local or networked), FTP servers, or SSH/SFTP
servers.

I'm using it to backup my home directories and all my image and music
files both to a linux server I have available in a data centre (via
SFTP) and to Amazon S3.

brackup's a bit rough around the edges and could do with some better
documentation and some optimisation, but it's pretty useful as it stands.
Here are a few notes and tips from my playing so far, to save others a
bit of time.

Version: as I write the latest version on CPAN is 1.06, but that's
pretty old - you really want to use the
current subversion trunk
instead. Installation is the standard perl module incantation e.g.

# Checkout from svn or whatever
cd brackup
perl Makefile.PL
make
make test
sudo make install

Basic usage is as follows:

# First-time through (on linux, in my case):
cd
mkdir brackup
cd brackup
brackup
Error: Your config file needs tweaking. I put a commented-out template at:
/home/gavin/.brackup.conf
# Edit the vanilla .brackup.conf that was created for you.
# You want to setup at least one SOURCE and one TARGET section initially,
# and probably try something smallish i.e. not your 50GB music collection!
# The Filesystem target is probably the best one to try out first.
# See '`perldoc Brackup::Root`' and '`perldoc Brackup::Target`' for examples
$EDITOR ~/.brackup.conf
# Now run your first backup changing SOURCE and TARGET below to the names
# you used in your .brackup.conf file
brackup -v --from=SOURCE --to=TARGET
# You can also do a dry run to see what brackup's going to do (undocumented)
brackup -v --from=SOURCE --to=TARGET --dry-run

If all goes well you should get some fairly verbose output about all the files
in your SOURCE tree that are being backed up for you, and finally a brackup
output file (typically named SOURCE-DATE.brackup) should be written to your
current directory. You'll need this brackup file to do your restores, but it's
also stored on the target along with your backup, so you can also retrieve it
from there (using brackup-target, below) if your local copy gets lost, or if
you need to restore to somewhere else.

Here's an interesting one: one of my clients has been seeing mysql
db connections from one of their app servers (and only one) being
periodically locked out, with the following error message reported
when attempting to connect:

Host _hostname_ is blocked because of many connection errors.
Unblock with 'mysqladmin flush-hosts'.

There's no indication in any of the database logs of anything
untoward, or any connection errors at all, in fact. As a workaround,
we've bumped up the max_connect_errors setting on the mysql
instance, and haven't really had time to dig much further.

Till tonight, when I decided to figure out what was going on.

Turns out there's plenty of other people seeing this too, although
MySQL seems to be in "it's not a bug, it's a feature" mode - see
this bug report.

That thread helped clue me in, however. Turns out that mysql counts
any connection to the database, even ones that don't attempt to
make an actual database connection, as a connection error, but they
only log ones that attempt to login. So there's a nice class of
silent errors - and in fact, a nice DOS attack against MySQL - if
you make standard TCP connections to mysql without logging in.

We, being clever and careful, were doing exactly that with
nagios - making a simple TCP connection to
port 3306 - in order to simply and cheaply check that mysql was
listening on that port. Hmmmm.

Easy enough to remedy, of course, once you figure out what's going
on. I even had a nice nagios plugin lying around to let me do more
sophisticated database checks -
check_db_query_rowcount -
so just had to replace the simple check_tcp check with that, and all
is right with the world.

But it's a plain and simple bug, and MySQL need to get it fixed.
Personally I think a simple tcp connection should not count as a
connection error at all without a login attempt (assuming it's not
left half-open etc.). Alternatively, if you do want to count that
as a connection error fine, but at least log some kind of error so
the issue is discoverable and can be handled by someone.