Monthly Archives: January 2012

I spent a few hours today figuring out why the /etc/aliases file was ignored by postfix. Mail to root was being delivered to root@exampl.tld instead of being rewritten by the /etc/aliases file. The solution is to use virtual_alias_maps instead.

In /etc/postfix/main.cf remove the alias_maps and alias_database like so:

We had our first outage since moving to Rackspace on 27 December. I came online to emails saying the site was down. I freaked out. An outage within the first few days on my watch. Crikey.

Looking into the issue, memory usage started spiking around 4:40am. By 12:20pm the server become unresponsive. All available memory and swap space had been filled. It took almost 8 hours for the server to crash. I should have been warned of the problem in that window and fixed it before it ever happened. I set out to sort that, and monit appears to be the best tool for the job.

Issues

I hit a couple of issues. The first one had me stumped for quite a while. On Ubuntu, mysql does not create a pid file by default. This led monit to think it wasn’t running, try to start it, fail, and then freak out. The solution turned out to be simple, add “pid-file = /var/run/mysqld/mysqld.pid” to te mysqld section of my.cnf, then restart mysql.

Second, I used the request /index.html from one of the existing configs on a WordPress domain which does not have an /index.html file, so it returned a 404, monit thought apache was down. Make sure your apache monit config references a url that exists!

Otherwise, monit was a breeze to setup. sudo apt-get install monit, edit /etc/default/monit, create the config files, and then sudo service monit start. I’d recommend keeping an eye on the web interface for the first few minutes, I went away and came back to find monit had killed and restarted apache and mysql a few times because of issues in my config.

Config files

I collated a few resources to build our monit config. I had a real issue figuring out multiple mailservers, but I got there in the end. Here’s a summary of our monit config files. I used the format /etc/monit/conf.d/service.mon as was the default on Ubuntu.

###############################################################################
## Monit control file
###############################################################################
##
## Comments begin with a '#' and extend through the end of the line. Keywords
## are case insensitive. All path's MUST BE FULLY QUALIFIED, starting with '/'.
##
## Below you will find examples of some frequently used statements. For
## information about the control file, a complete list of statements and
## options please have a look in the monit manual.
##
##
###############################################################################
## Global section
###############################################################################
##
## Start monit in the background (run as a daemon):
#
set daemon 60
# set daemon 120 # check services at 2-minute intervals
# with start delay 240 # optional: delay the first check by 4-minutes
# # (by default check immediately after monit start)
#
#
## Set syslog logging with the 'daemon' facility. If the FACILITY option is
## omitted, monit will use 'user' facility by default. If you want to log to
## a stand alone log file instead, specify the path to a log file
#
set logfile /var/log/monit.log
# set logfile syslog facility log_daemon
#
#
### Set the location of monit id file which saves the unique id specific for
### given monit. The id is generated and stored on first monit start.
### By default the file is placed in $HOME/.monit.id.
#
# set idfile /var/.monit.id
#
### Set the location of monit state file which saves the monitoring state
### on each cycle. By default the file is placed in $HOME/.monit.state. If
### state file is stored on persistent filesystem, monit will recover the
### monitoring state across reboots. If it is on temporary filesystem, the
### state will be lost on reboot.
#
# set statefile /var/.monit.state
#
## Set the list of mail servers for alert delivery. Multiple servers may be
## specified using comma separator. By default monit uses port 25 - this
## is possible to override with the PORT option.
set mailserver
smtp.sendgrid.net
port 587
username "%%%USERNAME%%%"
password "%%%PASSWORD%%%"
using tlsv1
,
smtp.gmail.com
port 587
username "%%%USERNAME%%%"
password "%%%PASSWORD%%%"
using tlsv1
# The timeout and hostname are after all mailserver definitions
# with timeout 30 seconds
hostname "%%%SERVER.FQDN.COM%%%"
#
# set mailserver mail.bar.baz, # primary mailserver
# backup.bar.baz port 10025, # backup mailserver on port 10025
# localhost # fallback relay
#
#
## By default monit will drop alert events if no mail servers are available.
## If you want to keep the alerts for a later delivery retry, you can use the
## EVENTQUEUE statement. The base directory where undelivered alerts will be
## stored is specified by the BASEDIR option. You can limit the maximal queue
## size using the SLOTS option (if omitted, the queue is limited by space
## available in the back end filesystem).
#
set eventqueue
basedir /var/monit # set the base directory where events will be stored
slots 100 # optionaly limit the queue size
#
#
## Send status and events to M/Monit (Monit central management: for more
## informations about M/Monit see http://www.tildeslash.com/mmonit).
#
# set mmonit http://monit:monit@192.168.1.10:8080/collector
#
#
## Monit by default uses the following alert mail format:
##
## --8 ## From: monit@$HOST # sender
## Subject: monit alert -- $EVENT $SERVICE # subject
##
## $EVENT Service $SERVICE #
## #
## Date: $DATE #
## Action: $ACTION #
## Host: $HOST # body
## Description: $DESCRIPTION #
## #
## Your faithful employee, #
## monit #
## --8 ##
## You can override this message format or parts of it, such as subject
## or sender using the MAIL-FORMAT statement. Macros such as $DATE, etc.
## are expanded at runtime. For example, to override the sender:
#
set mail-format { from: %%%MONIT@FQDN.COM%%% }
# set mail-format { from: monit@foo.bar }
#
#
## You can set alert recipients here whom will receive alerts if/when a
## service defined in this file has errors. Alerts may be restricted on
## events by using a filter as in the second example below.
#
set alert %%%USER-EMAIL@FQDN.com%%%
# set alert sysadm@foo.bar # receive all alerts
# set alert manager@foo.bar only on { timeout } # receive just service-
# # timeout alert
#
#
## Monit has an embedded web server which can be used to view status of
## services monitored, the current configuration, actual services parameters
## and manage services from a web interface.
#
set httpd port 2812 and
use address localhost
allow %%%USER%%%:%%%PASSWORD%%%
# set httpd port 2812 and
# use address localhost # only accept connection from localhost
# allow localhost # allow localhost to connect to the server and
# allow admin:monit # require user 'admin' with password 'monit'
# allow @monit # allow users of group 'monit' to connect (rw)
# allow @users readonly # allow users of group 'users' to connect readonly
#
#
###############################################################################
## Includes
###############################################################################
##
## It is possible to include additional configuration parts from other files or
## directories.
#
include /etc/monit/conf.d/*.mon

/etc/monit/conf.d/apache2.mon:

# CHECK PROCESS <unique name> <PIDFILE <path> | MATCHING <regex>>
# RHEL httpd, Ubuntu apache2
check process apache2 with pidfile /var/run/apache2.pid
# New style
start program = "/usr/sbin/service apache2 start" with timeout 90 seconds
stop program = "/usr/sbin/service apache2 stop"
# Old style
# start program = "/etc/init.d/apache2 start" with timeout 90 seconds
# stop program = "/etc/init.d/apache2 stop"
# If Apache is using > 80% of the cpu for 5 checks, restart it
if cpu > 80% for 5 cycles then restart
# Could be used to control apache's spawning of threads
# if children > 50 then alert
# if children > 60 then restart
# Check if apache is responding on port 80
if failed host %%%PUBLIC_IP%%% port 80 protocol http
and request "/" # Some smallish page that should be available when server is up
# This page has to exist or the check will fail. Avoid index.html on WordPress for example.
with timeout 10 seconds
# Sometimes Apache doesn't respond right away, so give it two chances
# before forcing a restart.
for 2 cycles
then restart
# Apache requires mysql to be running
# Disable this on web-only nodes.
depends on mysql
# If apache is restarting all the time, timeout.
# A timeout stops monitoring the service and sends an alert.
if 3 restarts within 8 cycles then timeout

# As I understand it, according to Rackspace, CPU is allocated per server
# according to the size whereby a 1G server = loadavg 1, 0.5G = 0.5 load, etc.
# If the cpu is available, it can be utilised over that, but it will cause
# issues in the long term.
check system localhost
# This is a 512Mb slice so sustained load above 0.5 will be problematic
if loadavg (1min) > 6 then alert
if loadavg (5min) > 4 then alert
if loadavg (15min) > 0.5 then alert
# Alert if memory usage hits 80% or higher
if memory usage > 80% then alert
# Don't fully understand these numbers, but they seem sensible
if cpu usage (user) > 70% for 2 cycles then alert
if cpu usage (system) > 50% for 2 cycles then alert
if cpu usage (wait) > 50% for 2 cycles then alert
# If the machine is under enormous load, reboot
if loadavg (1min) > 20 for 3 cycles then exec "/sbin/shutdown -r now"
if loadavg (5min) > 15 for 5 cycles then exec "/sbin/shutdown -r now"
# If memory usage is sustained above 97%, something is wrong, reboot
if memory usage > 97% for 3 cycles then exec "/sbin/shutdown -r now"