Tastycake Status: February 2008

Thursday, 28 February 2008

[AT RISK 2008/004] March 8th 2008 - Electrical works planned between 0001-0200hrs

Maintenance log for March 8th 2008
Attending: dwm
Status: Completed at 10:00hrs, March 8th 2008.
Summary:

The rack containing the Tastycake.net server kalimdor.tastycake.net will be briefly powered down so that the rack can be connected to a newly-installed power-distribution board.
As a result, no services will be accessible whilst the switchover is in progress. The at-risk period will last until 0200hrs, though the colo engineers hope to have normal services resumed by 0030hrs.
Works to be carried out:
- Shut down kalimdor.tastycake.net. (Completed)
- Wait whilst the co-location engineers switch the rack over to the new power-distribution feed. (Completed)
- Boot kalimdor.tastycake.net. (Completed)
- Verify services are running normally. (Completed)

Transcript, times are in GMT:

March 8th 2008

[0915] Services verified as functioning correctly. (There was a minor issue with the current experimental DNS service for the dwm.me.uk domain as a result of invalid zone configuration data, corrected. It turns out that you're not allowed a CNAME as well as an SOA for the root of a zone, but A and AAAA records are fine..)
[0015] Power restored, automated reboot in progress. All services runningas normal.

March 7th 2008

[2355] Automated scheduled shutdown executed.
[2230] Delayed-effect shutdown instruction executed; shutdown will occur at 2355hrs.
[2105] Reminder of pending works (tonight!) sent by email to all users.

February 28th 2008

[1420] Initial update to off-site status page.

Saturday, 16 February 2008

[INCIDENT 2008/003] February 16th 2008 - Filesystem corruption, suspect defective IDE channel.

Incident log for February 16th 2008
Attending: dwm, mark, ncm
Status: Completed at 2250hrs GMT.
Summary:

Tastycake.net server kalimdor.tastycake.net has suffered a filesystem corruption problem.
As a result, some disk / directory accesses are blocking indefinitely.
We suspect that this data corruption is occuring somewhere along the disk channel supporting /dev/hdg.
Works to be carried out:
- Remove /dev/hdg from all RAID mirrors to prevent further filesystem corruption. (Complete)
- Reboot machine into single-user mode. NOTE: No services will be available whilst in single-user mode. (Complete)
- Run filesystem verification utilities on all disk filesystems. (Complete)
- Restore any damaged files from backups as required. (Complete)
- Reboot machine back into normal production operation. (Complete)

Transcript, times are in GMT:

[2250] Incident closed.
[2247] Summary: All of the recovered files were old transient copies of data that had been deleted deliberately, with the possible exception of some of ~anton's image files, which have been copied to his home directory for review.
[2230] Of the remaining files all owned by ~jeremy, all but one are old versions of existing mailboxes - probably an artifact of normal mailbox re-writing operation. (Checking unique message ids shows that the mail messages still exist in the live mailboxes.) The remaining file just contains the junk chars "|a:0:{}" and doesn't appear in my filesystem index comparison. Almost certainly junk, deleted.
[2221] Found that most the disconnected files owned by ~anton are temporary files generated by gallery; deleted. (Christ, ~anton, you've got over a gigabyte of temporary files in there going back years! Clear it out!) His remaining files appear to be old deleted .jpeg photos, but moved them to a RECOVERED_FILES directory in his $HOME to allow for inspection and recovery.
[2217] Filesystem checks on /dev/mapper/volume-recover complete, no errors. Re-mounting /vol/recover.
[2210] Generating home directory indexes of affected users on live system and offsite backup for comparison.
[2153] Picking through the disconnected files found in /home:
- One mailspool index auto-generated by Dovecot; will be automatically regenerated: deleted.
- 8.5MB junk mailbox owned by ~jeremy; expendable!
- Remaining files owned by ~anton and ~jeremy, no other users affected.
[2152] Running xfs_repair check on /dev/mapper/volume-recover in the background.
[2151] Disconnected inode files in /var are all old Apache logfiles dating to July 2007, which is older than normal retention policy. Deleted.
[2146] Machine back in production. Checking contents of lost+found.
[2144] Reboot in progress.
[2138] All filesystem checks complete, bar /dev/mapper/volume-recover which can be done whilst online. Rebooting to normal production mode.
[2137] xfs_repair completed, no errors found.
[2135] Running full xfs_repair on /dev/mapper/volume-root.
[2134] Appoximately 50 disconnected inodes detected on volume-home, relocted to lost+found. These may be real files, or they may simply be historical artifacts.
[2132] Minor error (link count) detected on volume-var. Full repair run also detected some disconnected inodes; running full repair on volume-home for good measure.
[2128] Re-checking volume-home and volume-var with xfs_repair -n for good measure.
[2127] Second filesystem check of /dev/mapper/volume-root complete, no errors. We may have been fortunate and only had the kernel BUG trigger as a result of a read error and not an earlier write error as previously feared.
[2124] Filesystem check of /dev/mapper/volume-root complete, no errors. Checking result with xfs_repair -n (as opposed to xfs_check).
[2123] Filesystem check of /dev/mapper/volume-root running, at least minor errors expected.
[2121] Filesystem check of /dev/mapper/volume-home complete, no errors.
[2119] Filesystem check of /dev/mapper/volume-home running.
[2118] Filesystem check of /dev/mapper/volume-var complete, no errors.
[2117] Filesystem check of /dev/md6 (/boot) complete, no errors.
[2116] Machine rebooted into single-user mode. All services unavailable from this point.
[2057] Initial tastycake-status bulletin published.
[2025] walled all logged-in users to advise that emergency maintenance in progress.
[2033] /dev/hdg dropped from all RAID mirrors to avoid further disk corruption. The next step is to reboot the machine into single-user mode to conduct full filesystem checks and repairs.
[2033] Incident announcement sent to all admins via http://twitter.com/tastycake.
[2026] Determined that cause of fault is a faulty data channel to /dev/hdg resulting in incorrect data being written to disk. Begun dropping /dev/hdg from all RAID mirrors to avoid further corruption.
[2016] Filesystem corruption detected in /root/.wajig/kalimdor
[2014] Kernel BUG (internal error alert) spotted by inspection.
[2002] Host monitoring system generates another Critical warning.
[1007] Monitoring system downgrades previous critical warning to minor severity.
[1002] Critical warning generated by host monitoring system, indicating that a significantly higher than normal number of cron processes are running.
[0632] Minor warning generated by host monitoring system, indicating that a higher-than-normal number of cron processes are running concurrently.

Monday, 11 February 2008

[AT RISK 2008/002] February 11th 2008 - emergency kernel upgrade

Maintenance log for February 8th 2008
Attending: dwm
Status: Completed at 1405hrs GMT
Summary:

Tastycake.net server kalimdor.tastycake.net being rebooted (at least once) at approximately 1200noon GMT for an emergency kernel upgrade.
New kernel needed to patch local root escalation vulnerabilities (CVE-2008-0009, CVE-2008-0010).
No Tastycake.net services will be available whilst reboots are occurring.
Works to be carried out:
- Build new linux kernel (2.6.24.2) to replace existing build (2.6.24). (Complete)
- Install new kernel and set as default. (Complete)
- Reboot machine to start using new kernel. (Complete)

Transcript:

[1405] All tests clear, at-risk period concluded.
[1400] Machine rebooted successfully into new kernel. Running final checks..
[1357] Machine rebooted.
[1353] Believed that I have corrected the booting problem (missing /dev/md0 entry in /etc/mdadm/mdadm.conf) and rebooting again. (Again, with 1-minute grace.)
[1347] Successfully rebooted using original kernel; will be fixing raid-auto start, then rebooting again.
[1334] Backup kernel not functioning; appears to not be auto-starting /dev/md0; will need to configure manually. This may take a few minutes..
[1331] Failed to boot using new kernel, power-cycled via power-switch interface.
[1328] Machine reboot.
[1326] Reboot triggered with 1-minute grace delay.
[1320] New kernel installed, ready to reboot. Warning sent via wall to all logged-in users.
[1313] Updated kernel package built, installed in Tastycake package repository.
[1107] Initial update of maintenance log.
[1010] Determined that 2.6.24.1 kernel that had been built overnight has been superceded by 2.6.24.2, building new kernel image.

Saturday, 9 February 2008

[AT RISK 2008/001] February 9th 2008 - scheduled maintenance

Maintenance log for February 8th 2008
Attending: dwm
Status: completed at 15:11hrs
Summary:

Tastycake.net server kalimdor.tastycake.net being taken offline at 1200noon GMT for maintenance.
No Tastycake.net services will be available whilst works are in progress.
Works to be carried out:
- Install third 250GB hard-drive into RAID mirror. (Complete)
- Install GRUB bootloader on third drive. (Complete)
- Replace old 127GB hard-drive with new 250GB replacement. (Complete)
- Install GRUB bootloader on replacement drive. (Complete)
- Create new RAID mirror set on as-yet unallocated space. (Complete)
- Expand LVM working set using new RAID mirror set. (Complete)
- Upgrade local kernel to 2.6.24. (Complete)
- Discontinue local NFS server, use read-only bind mount for /vol/recover instead.
  (New feature in 2.6.24.) (Cancelled)