Thursday, 20 December 2007

It's been a tough couple of months for Lancaster, with our SE giving us a number of problems.

Our first drama, at the start of the month, was caused by unforseen complications with our upgrade to dcache 1.8. Knowing that we were low on the support list due to being only a Tier 2, but feeling emboldened by the highly useful srm 2.2 workshop in Edinburgh and the good few years we've spent in the dcache trenches we decided to take the plunge. And faced a good few days of downtime beyond the one we had scheduled as we faced down a number of bugs with the early versions of dcache 1.8 (fixed by upgrading to higher patch levels), then faced problems due to changes in the gridftp protocol highlighted inconsisencies with the users on our pnfs node and gridftp door node. Due to a hack long ago several VOs had different user.conf entries and therefore UIDs on our door nodes and pnfs node. This never caused problems before, but after the upgrade the doors were passing the uids to the pnfs node so new files and directories were created with the correct group (as the gids were consistent) but a wrong uid, causing permission troubles whenever a delete was called. This was a classic case of a problem that was hell to the cause behind but once figured out was thankfully easy to solve. Once we fixed that one it was green tests for a while.

Then dcache drama number two came along a week later- a massive postgres db failure on our pnfs node. The postgres database contains all the information that dcache uses to match the fairly anonymously named files on the poolnodes to entries in the pnfs namespace- without it dcache has no idea which files are which, so with it bust the files are almost as good as lost. Which is why it should be backed up regularly. We did this twice daily, as least we thought we did- a cron problem had meant that our backups hadn't been made for a while and a rollback to it would mean a fair amount of data might be lost. So we spent 3 days doing arcane sql rituals to try and bring back the database, but it had too heavily corrupted itself and we had to rollback.

The cause of the database crash and corruption was a "wrap around" error. Postgres requires regular "vacuuming" to clean up after itself, otherwise it essentially starts writing over itself. This crash took us by surprise, as we not only have postgres looking after itself with optimised auto-vacuuming occuring regularly, but during the 1.8 upgrade I took the time to do a manual full vacuum, which was only a week before this one. Also postgres is designed to freeze in the event of being at risk of a wraparound error rather then overwrite itself, and this didn't happen. The first we heard of it pnfs and postgres had stopped responding and there were wraparound error messages in the logs, no warning of the impending disaster.

Luckily the data rollback seems to have not affected the VOs too much. We had one ticket from Atlas, who after we explained our situation to them handily cleaned up their file catalogues. The guys over at dcache hinted at a possible way of rebuilding the lost databases from the pnfs logs, although sadly this isn't simply a case of recreating pnfs related sql entries and they've been too busy with Tier 1 support to look into this further.

Since then we've fixed our backups and applied a nagios test to ensure the backups are less then a day old-the biggest trouble here was that the reluctance to use an old backup meant we wasted over 3 days banging our heads trying to bring back a dead database rather then a few hours it would take to restore from backup and verify things were working. And it appears the experiments were more affected by us being in downtime then by the loss of easily replicatable data. In the end I think I caused more trouble going over the top on my data recovery attempts then if I had been gung ho and used the old backup once things looked a bit bleak for the remains of the postgres database. At least we've now set things up so the likeliness of it happening again is slim, but the circumstances behind the original database errors are still unknown, which leaves me a little worried.

Have a good Winter Festival and Holday Season everyone- but before you head off to your warm fires and cold beers check the age of your backups just in case...

Thursday, 6 December 2007

- core path was set to /tmp/core-various-param in sysctl.conf and was creating a lot of problems to dzero jobs. It was also creating problems to others as they were filling /tmp and consequently maradona errors were looming around. The path has been changed back to the default and also I set core size 0 in limits.conf to prevent any other problem repeating itself with a lesser degree in /scratch.

- dcache doors were open on the wrong nodes. node_config is the correct one but it was copied before stopping dcache-core service and now /etc/init.d/dcache-core stop doesn't have any effect. The doors have also a keep alive script so it is not enough to kill the java proesses one has to kill also the parents.

- cfengine config files are being rewritten to make them less criptic.