Thursday, 28 February 2008

Maintenance log for March 8th 2008Attending: dwmStatus: Completed at 10:00hrs, March 8th 2008.Summary:

The rack containing the Tastycake.net server kalimdor.tastycake.net will be briefly powered down so that the rack can be connected to a newly-installed power-distribution board.

As a result, no services will be accessible whilst the switchover is in progress. The at-risk period will last until 0200hrs, though the colo engineers hope to have normal services resumed by 0030hrs.

Works to be carried out:

Shut down kalimdor.tastycake.net. (Completed)

Wait whilst the co-location engineers switch the rack over to the new power-distribution feed. (Completed)

Boot kalimdor.tastycake.net. (Completed)

Verify services are running normally. (Completed)

Transcript, times are in GMT:

March 8th 2008

[0915] Services verified as functioning correctly. (There was a minor issue with the current experimental DNS service for the dwm.me.uk domain as a result of invalid zone configuration data, corrected. It turns out that you're not allowed a CNAME as well as an SOA for the root of a zone, but A and AAAA records are fine..)

As a result, some disk / directory accesses are blocking indefinitely.

We suspect that this data corruption is occuring somewhere along the disk channel supporting /dev/hdg.

Works to be carried out:

Remove /dev/hdg from all RAID mirrors to prevent further filesystem corruption. (Complete)

Reboot machine into single-user mode. NOTE: No services will be available whilst in single-user mode. (Complete)

Run filesystem verification utilities on all disk filesystems. (Complete)

Restore any damaged files from backups as required. (Complete)

Reboot machine back into normal production operation. (Complete)

Transcript, times are in GMT:

[2250] Incident closed.

[2247] Summary: All of the recovered files were old transient copies of data that had been deleted deliberately, with the possible exception of some of ~anton's image files, which have been copied to his home directory for review.

[2230] Of the remaining files all owned by ~jeremy, all but one are old versions of existing mailboxes - probably an artifact of normal mailbox re-writing operation. (Checking unique message ids shows that the mail messages still exist in the live mailboxes.) The remaining file just contains the junk chars "|a:0:{}" and doesn't appear in my filesystem index comparison. Almost certainly junk, deleted.

[2221] Found that most the disconnected files owned by ~anton are temporary files generated by gallery; deleted. (Christ, ~anton, you've got over a gigabyte of temporary files in there going back years! Clear it out!) His remaining files appear to be old deleted .jpeg photos, but moved them to a RECOVERED_FILES directory in his $HOME to allow for inspection and recovery.

[2210] Generating home directory indexes of affected users on live system and offsite backup for comparison.

[2153] Picking through the disconnected files found in /home:

One mailspool index auto-generated by Dovecot; will be automatically regenerated: deleted.

8.5MB junk mailbox owned by ~jeremy; expendable!

Remaining files owned by ~anton and ~jeremy, no other users affected.

[2152] Running xfs_repair check on /dev/mapper/volume-recover in the background.

[2151] Disconnected inode files in /var are all old Apache logfiles dating to July 2007, which is older than normal retention policy. Deleted.

[2146] Machine back in production. Checking contents of lost+found.

[2144] Reboot in progress.

[2138] All filesystem checks complete, bar /dev/mapper/volume-recover which can be done whilst online. Rebooting to normal production mode.

[2137] xfs_repair completed, no errors found.

[2135] Running full xfs_repair on /dev/mapper/volume-root.

[2134] Appoximately 50 disconnected inodes detected on volume-home, relocted to lost+found. These may be real files, or they may simply be historical artifacts.

[2132] Minor error (link count) detected on volume-var. Full repair run also detected some disconnected inodes; running full repair on volume-home for good measure.

[2128] Re-checking volume-home and volume-var with xfs_repair -n for good measure.

[2127] Second filesystem check of /dev/mapper/volume-root complete, no errors. We may have been fortunate and only had the kernel BUG trigger as a result of a read error and not an earlier write error as previously feared.

Friday, 16 November 2007

HostEurope.com, who provide DNS hosting services for Tastycake.net, are offline and are not answering queries for Tastycake.net addresses.

This may result in difficulties accessing Tastycake.net services, and delay the delivery of email to Tastycake.net email addresses.

As the fault lies somewhere within HostEurope.com's facillities, and not with the Kalimdor server itself, there is little we can do to directly address this problem, and are waiting for HostEurope.com to implement a fix.

In the unlikely event that HostEurope.com do not restore service in a timely fashion, we are preparing contingency plans to move the Tastycake.net domain to another hosting provider.