The ‘production’ stage was configured, but even adding things like raise '' in the config/deploy/production.rb file failed to have an effect. It was like the file wasn’t even being loaded, except that trying to execute against a non-existent stage (‘work_damn_you’ stage, anyone?) did raise an error.

In the end, out of desperation, I ran dos2unix on the files, which weirdly complained about a binary file. Hrm. I re-saved the files as UTF-8 and all started working.

Having migrated to PostgreSQL 9.2 and set up Streaming Replication successfully, the next obvious thing you want to know is, “Is it still working?” Fortunately, PostgreSQL comes out-of-the-box with a few tools that can help

pg_stat_replication

pg_stat_replication is really only useful on a master database server. It shows the slaves connected to the master, and some other opaque information about ‘where it’s up to’, for example

meaning that slave 10.10.0.10 is 0 bytes behind the master. Most of the time in my experience it stays pretty low, but how big is too big? As long as it stays below the size of 1-2 WAL segment (use “SHOW wal_segment_size” to work out what your config is) then things are fine. If it drops back further, you might want to find out what’s going on.

SELECT pg_last_xact_replay_timestamp()

There’s a few functions that only work on the slave too, most notably pg_last_xact_replay_timestamp(). It shows the timestamp of the last transaction applied to the slave. With a reasonably busy server, using something like:

is good enough to see how far behind the slave is. However, if there’s not enough activity, although the server is completely caught up, this number will continue to increase as now() gets further and further away from the last committed transaction timestamp. Instead, per this post of the mailing list, compare the location information too:

Finally, for anyone using ServerDensity, I updated reinbach’s plugin to include a simple count check of connected slaves on a master, and the slave lag function above.
See it here: https://github.com/dansketcher/sd-postgresql

Having been running PostgreSQL 8.4 for a number of years in production (if it ain’t broke…) the time came to do an upgrade (8.4 on Centos 5 was getting a bit dated). We decided, after some discussion, to take up our hosting provider on a managed HA Windows cluster with shared storage and a redundant slave replica. This was mainly to reduce our own internal server management overhead, as they would be taking care of all parts of the servers’ maintenance.

This all sounded great. After some tests at moderate load, we migrated the databases one by one via a piped pg_dump/pg_restore, which was simplicity itself apart from the outage. But then, we started getting some reports of slowness, which was worrying. I updated the application to log slow queries (easier to do this at the application level so that they could go into Splunk) and I was seeing SOME instances (not all) of query execution previously taking ~3sec take anywhere from 9 seconds to 200(!) seconds. As you can imagine, this was very disconcerting.

Of course, beforehand the server was configured using Windows ‘best practice’ settings, which is mostly the same as Linux but to have a smaller shared_buffers and effective_cache_size settings per the PostgreSQL Wiki page.

The primary approach we started to take was to use EXPLAIN ANALYSE extensively to work out if it was just a case of this new server not being ‘run in’ effectively and not having enough stats. Some solid gains were made, but the problem was not ‘average’ query time, it was the ‘worst case’ query time. Some queries would run fine one day but terribly the next, and of course once the PostgreSQL cache was primed it was as speedy as ever. Many indexes were changed and updated, the DB was VACUUM ANALYSEd, all the usual stuff. In a display of desperation, we even ‘defragged’ the disk (which would have felt very retro if it was not so unpleasant)

At no point was RAM, CPU or Disk IO even stretched. The servers were effectively over-specced, as we ramped them up to see if that would help, and it did, but not to the extent needed in those worst-case queries.

Another very annoying thing was that apparently the PostgreSQL server was not able to maintain connections properly. Despite having more than enough in the max_connections setting, we consistently saw these kinds of errors reported by the client (namespace from the ruby-pg client):

PGError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
PGError: no connection to the server
PG::Error: could not receive data from server: Connection timed out

The last of these errors was particularly nasty, as usually the statement was executed successfully, but the client had no idea and no record, so while each DB was internally consistent, inter-db communication processes would get out of sync and needed to be manually reconciled.

So, how did we solve these problems?

We didn’t. We moved everything off the cluster to a new Linux PostgreSQL 9.2 master/slave streaming replication setup. Many many thanks to Bartek Ciszkowski for his excellent writeup, which laid a very solid foundation for our new setup. Since the new DB servers went into production, all of the work we’ve done with indexes has meant the sites are performing faster than ever, and we’ve not seen one of the errors above either (touch wood!).

Incidentally, we use (async) streaming replication with another PITR recovery setup, which uses old-style backup commands rather than pg_basebackup, so that we can rsync update the base regularly without having to store a new copy every time. Also note, because of the implementation of replication, you can’t run these backup commands on the slave, they have to be run on the master, which is a shame. Still, we rsync to the slave for the PITR backup so at least it’s not doing double disk IO.

I went onto the machine and confirmed using the ssh key there I could check out the repo properly. Unfuddle support tried to help but could not. I fortunately came across this github resource which explains some of the settings. I think on the third or so re-reading I noticed this real impact of this

Agent Forwarding

If you’re using your own private keys for git you might want to tell Capistrano to use agent forwarding with this command. Agent forwarding can make key management much simpler as it uses your local keys instead of keys installed on the server.

ssh_options[:forward_agent] = true

So, I tried removing that line from my cap deploy.rb, and the checkout worked fine.

The important thing to note here is that Rails inserts 2 linebreaks before putting “Processing blah blah” so we’ll use that to split. In the props.conf documentation, it recommends using the LINE_BREAKER item to handle multiline breaks, and it seemed easier to me as it uses just one regexp to break the input stream. The important part for this kind of splitting is to know:

* Wherever the regex matches, the start of the first matching group is considered the end of the
previous event, and the end of the first matching group is considered the start of the next event.
* The contents of the first matching group is ignored as event text.

What this means is that the LINE_BREAKER regexp (using PCRE) must have a match group that can be discarded for matching. If you grep over the default provided files (`grep -Rn “LINE_BREAKER” /opt/splunk/`), you’ll see that most of the options use a linebreak or multiples thereof as the first match group. We’ll do the same. Other than that, we just want to match on the “Processing” text as the start delimiter… which comes out to the below…

To increase our application intelligence, I’ve started installing Splunk. However, what was not clear to me was the best way of getting my Web/App Rails logs into Splunk. The Web server is on a different box (as is the DB server) so I wanted to know the best way of getting the log data to the Splunk host.

Posts like this imply that Syslog-ng is a good way of getting data to Splunk, but the hidden detail is that if the Splunk server is down, the data is lost. A better way (thanks to the helpful folk on IRC) is to set up a Splunk forwarder on the Web/App/DB hosts and go from there. I went initially with setting them up as (free licenced) Standard Forwarders, so that the web interface is still running for configuration. Remember that any firewall (iptables) will need to be opened for the management ports (8000, 8089 ) and the forward port (9997).

Something else I discovered (but can no longer find in the docs) is that if you configure data sources using the web interface on a Forwarder, but subsequently change the instance to a LightForwarder, you also have to move the configuration files from $SPLUNK_HOME/etc/apps/search/local/ to $SPLUNK_HOME/etc/system/local/

Edit 2010-03-17: Harder than it first seemed! That permission only allows Apache to load Passenger. See here for the full requirements

I’ve just started playing with SELinux on Centos, and while the idea is great, it’s not exactly what I’m used to. Take for example adding Phusion Passenger to Apache. When I first restarted, I got this

Not much of a title, I know, but something for fellow stumped google-wanderers..

I’ve been playing around with deprec for automatically provisioning new servers for my Rails applications. However, I hit a strange problem where certain command-line auto-completions did not work; I’d get an error message like this:

$ ls
temp
$ cd [hit tab to get autocomplete]
-sh:

The confusing thing was that only the unpriviliged user exhibited this behaviour, not the root user.

I tried reinstalling the bash-completion package to no avail (even an aptitude purge then aptitude install did not work). Eventually I came across this rather long post on bugs.debian.org where someone else has the same problem. Essentially it comes down to the unpriviliged user using /bin/sh as the default shell; under these circumstances autocompletion is supposed to not function (although presumably without error). This is despite /bin/sh being a symlink to /bin/bash

The fix, then, is actually quite simple. Change the default shell for all your unprivileged users to bash:

# usermod -s /bin/bash dansketcher

and then when you make new users, make sure you select bash explicitly: