Thursday 28 February 2008

[AT RISK 2008/004] March 8th 2008 - Electrical works planned between 0001-0200hrs

Maintenance log for March 8th 2008
Attending: dwm
Status: Completed at 10:00hrs, March 8th 2008.
Summary:

  • The rack containing the Tastycake.net server kalimdor.tastycake.net will be briefly powered down so that the rack can be connected to a newly-installed power-distribution board.
  • As a result, no services will be accessible whilst the switchover is in progress. The at-risk period will last until 0200hrs, though the colo engineers hope to have normal services resumed by 0030hrs.
  • Works to be carried out:
    • Shut down kalimdor.tastycake.net. (Completed)
    • Wait whilst the co-location engineers switch the rack over to the new power-distribution feed. (Completed)
    • Boot kalimdor.tastycake.net. (Completed)
    • Verify services are running normally. (Completed)
Transcript, times are in GMT:

March 8th 2008
  • [0915] Services verified as functioning correctly. (There was a minor issue with the current experimental DNS service for the dwm.me.uk domain as a result of invalid zone configuration data, corrected. It turns out that you're not allowed a CNAME as well as an SOA for the root of a zone, but A and AAAA records are fine..)
  • [0015] Power restored, automated reboot in progress. All services runningas normal.
March 7th 2008
  • [2355] Automated scheduled shutdown executed.
  • [2230] Delayed-effect shutdown instruction executed; shutdown will occur at 2355hrs.
  • [2105] Reminder of pending works (tonight!) sent by email to all users.
February 28th 2008
  • [1420] Initial update to off-site status page.

Saturday 16 February 2008

[INCIDENT 2008/003] February 16th 2008 - Filesystem corruption, suspect defective IDE channel.

Incident log for February 16th 2008
Attending: dwm, mark, ncm
Status: Completed at 2250hrs GMT.
Summary:

  • Tastycake.net server kalimdor.tastycake.net has suffered a filesystem corruption problem.
  • As a result, some disk / directory accesses are blocking indefinitely.
  • We suspect that this data corruption is occuring somewhere along the disk channel supporting /dev/hdg.
  • Works to be carried out:
    • Remove /dev/hdg from all RAID mirrors to prevent further filesystem corruption. (Complete)
    • Reboot machine into single-user mode. NOTE: No services will be available whilst in single-user mode. (Complete)
    • Run filesystem verification utilities on all disk filesystems. (Complete)
    • Restore any damaged files from backups as required. (Complete)
    • Reboot machine back into normal production operation. (Complete)
Transcript, times are in GMT:
  • [2250] Incident closed.
  • [2247] Summary: All of the recovered files were old transient copies of data that had been deleted deliberately, with the possible exception of some of ~anton's image files, which have been copied to his home directory for review.
  • [2230] Of the remaining files all owned by ~jeremy, all but one are old versions of existing mailboxes - probably an artifact of normal mailbox re-writing operation. (Checking unique message ids shows that the mail messages still exist in the live mailboxes.) The remaining file just contains the junk chars "|a:0:{}" and doesn't appear in my filesystem index comparison. Almost certainly junk, deleted.
  • [2221] Found that most the disconnected files owned by ~anton are temporary files generated by gallery; deleted. (Christ, ~anton, you've got over a gigabyte of temporary files in there going back years! Clear it out!) His remaining files appear to be old deleted .jpeg photos, but moved them to a RECOVERED_FILES directory in his $HOME to allow for inspection and recovery.
  • [2217] Filesystem checks on /dev/mapper/volume-recover complete, no errors. Re-mounting /vol/recover.
  • [2210] Generating home directory indexes of affected users on live system and offsite backup for comparison.
  • [2153] Picking through the disconnected files found in /home:
    • One mailspool index auto-generated by Dovecot; will be automatically regenerated: deleted.
    • 8.5MB junk mailbox owned by ~jeremy; expendable!
    • Remaining files owned by ~anton and ~jeremy, no other users affected.
  • [2152] Running xfs_repair check on /dev/mapper/volume-recover in the background.
  • [2151] Disconnected inode files in /var are all old Apache logfiles dating to July 2007, which is older than normal retention policy. Deleted.
  • [2146] Machine back in production. Checking contents of lost+found.
  • [2144] Reboot in progress.
  • [2138] All filesystem checks complete, bar /dev/mapper/volume-recover which can be done whilst online. Rebooting to normal production mode.
  • [2137] xfs_repair completed, no errors found.
  • [2135] Running full xfs_repair on /dev/mapper/volume-root.
  • [2134] Appoximately 50 disconnected inodes detected on volume-home, relocted to lost+found. These may be real files, or they may simply be historical artifacts.
  • [2132] Minor error (link count) detected on volume-var. Full repair run also detected some disconnected inodes; running full repair on volume-home for good measure.
  • [2128] Re-checking volume-home and volume-var with xfs_repair -n for good measure.
  • [2127] Second filesystem check of /dev/mapper/volume-root complete, no errors. We may have been fortunate and only had the kernel BUG trigger as a result of a read error and not an earlier write error as previously feared.
  • [2124] Filesystem check of /dev/mapper/volume-root complete, no errors. Checking result with xfs_repair -n (as opposed to xfs_check).
  • [2123] Filesystem check of /dev/mapper/volume-root running, at least minor errors expected.
  • [2121] Filesystem check of /dev/mapper/volume-home complete, no errors.
  • [2119] Filesystem check of /dev/mapper/volume-home running.
  • [2118] Filesystem check of /dev/mapper/volume-var complete, no errors.
  • [2117] Filesystem check of /dev/md6 (/boot) complete, no errors.
  • [2116] Machine rebooted into single-user mode. All services unavailable from this point.
  • [2057] Initial tastycake-status bulletin published.
  • [2025] walled all logged-in users to advise that emergency maintenance in progress.
  • [2033] /dev/hdg dropped from all RAID mirrors to avoid further disk corruption. The next step is to reboot the machine into single-user mode to conduct full filesystem checks and repairs.
  • [2033] Incident announcement sent to all admins via http://twitter.com/tastycake.
  • [2026] Determined that cause of fault is a faulty data channel to /dev/hdg resulting in incorrect data being written to disk. Begun dropping /dev/hdg from all RAID mirrors to avoid further corruption.
  • [2016] Filesystem corruption detected in /root/.wajig/kalimdor
  • [2014] Kernel BUG (internal error alert) spotted by inspection.
  • [2002] Host monitoring system generates another Critical warning.
  • [1007] Monitoring system downgrades previous critical warning to minor severity.
  • [1002] Critical warning generated by host monitoring system, indicating that a significantly higher than normal number of cron processes are running.
  • [0632] Minor warning generated by host monitoring system, indicating that a higher-than-normal number of cron processes are running concurrently.

Monday 11 February 2008

[AT RISK 2008/002] February 11th 2008 - emergency kernel upgrade

Maintenance log for February 8th 2008
Attending: dwm
Status: Completed at 1405hrs GMT
Summary:
  • Tastycake.net server kalimdor.tastycake.net being rebooted (at least once) at approximately 1200noon GMT for an emergency kernel upgrade.
  • New kernel needed to patch local root escalation vulnerabilities (CVE-2008-0009, CVE-2008-0010).
  • No Tastycake.net services will be available whilst reboots are occurring.
  • Works to be carried out:
    • Build new linux kernel (2.6.24.2) to replace existing build (2.6.24). (Complete)
    • Install new kernel and set as default. (Complete)
    • Reboot machine to start using new kernel. (Complete)
Transcript:
  • [1405] All tests clear, at-risk period concluded.
  • [1400] Machine rebooted successfully into new kernel. Running final checks..
  • [1357] Machine rebooted.
  • [1353] Believed that I have corrected the booting problem (missing /dev/md0 entry in /etc/mdadm/mdadm.conf) and rebooting again. (Again, with 1-minute grace.)
  • [1347] Successfully rebooted using original kernel; will be fixing raid-auto start, then rebooting again.
  • [1334] Backup kernel not functioning; appears to not be auto-starting /dev/md0; will need to configure manually. This may take a few minutes..
  • [1331] Failed to boot using new kernel, power-cycled via power-switch interface.
  • [1328] Machine reboot.
  • [1326] Reboot triggered with 1-minute grace delay.
  • [1320] New kernel installed, ready to reboot. Warning sent via wall to all logged-in users.
  • [1313] Updated kernel package built, installed in Tastycake package repository.
  • [1107] Initial update of maintenance log.
  • [1010] Determined that 2.6.24.1 kernel that had been built overnight has been superceded by 2.6.24.2, building new kernel image.

Saturday 9 February 2008

[AT RISK 2008/001] February 9th 2008 - scheduled maintenance

Maintenance log for February 8th 2008
Attending: dwm
Status: completed at 15:11hrs
Summary:
  • Tastycake.net server kalimdor.tastycake.net being taken offline at 1200noon GMT for maintenance.
  • No Tastycake.net services will be available whilst works are in progress.
  • Works to be carried out:
    • Install third 250GB hard-drive into RAID mirror. (Complete)
    • Install GRUB bootloader on third drive. (Complete)
    • Replace old 127GB hard-drive with new 250GB replacement. (Complete)
    • Install GRUB bootloader on replacement drive. (Complete)
    • Create new RAID mirror set on as-yet unallocated space. (Complete)
    • Expand LVM working set using new RAID mirror set. (Complete)
    • Upgrade local kernel to 2.6.24. (Complete)
    • Discontinue local NFS server, use read-only bind mount for /vol/recover instead.
      (New feature in 2.6.24.) (Cancelled)
Transcript, times are in GMT:
  • [1511] Final checks complete, at-risk period ends.
  • [1506] Performing final checks prior to announcing end of at-risk period.
  • [1505] Read-only bind mounts don't seem to be functioning, we perhaps need an updated mount-utils. This can be done safely at a later date.
  • [1456] Initialized /dev/md0 as new LVM PV and added PV to existing volume VG; total capacity: 232GB.
  • [1454] Added new second disk to main RAID mirror set /dev/md7.
  • [1451] Created new RAID mirror across previously-unused disk space.
    (NOTE: /dev/md0 is not the RAID mirror containing /boot, /dev/md6 is.)
  • [1440] Kernel installed, rebooting to verify correct operation and to reload DOS partition tables.
  • [1437] Installing updated kernel packages. (And SNMP security updates, whilst we're here.)
  • [1434] Partitioned unallocated space on all three disks. Leaving creation of new RAID mirror array until last, as it would only be interrupted by reboots anyway..
  • [1431] Partitioned drive 2 to match other disks. Added drive 2 partition to /boot RAID mirror volume. Installed bootloader on new drive.
  • [1423] Replacement drive 2 installed, booting.
  • [1415] Re-sync completed. Rebooting to replace drive 2.
  • [1402] Re-sync 90% complete. (Unfortunately, it seems to be slowing down to about 15MB/sec again.)
  • [1357] Installed spare GRUB bootloader on new disk.
  • [1350] Re-sync 80% complete.
  • [1334] Re-sync 66.6% complete.
  • [1318] Re-sync 50% complete. (Now peaking at ~22MB/sec; ETA at present rate: 44mins.)
  • [1301] Re-sync 33.3% complete. (Now peaking at ~19MB/sec; ETA at present rate: 67mins.)
  • [1250] Re-sync 25% complete. (Looks like its speeding up as it proceeds, probably due to disk geometry.)
  • [1230] Re-sync 10% complete.
  • [1219] RAID re-sync in progress; need to wait for it to complete before replacing disk 2. ETA @ ~15MB/sec: 115mins.
  • [1211] Adding new disk 3 partitions to RAID mirror sets.
  • [1209] Disk installed, server rebooted. Partitioned disk 3 to match existing layout.
  • [1201] Serial terminal up; sent reboot instruction with 2-minute grace.
  • [1159] Sent final warning via wall to save all state; disk installed in caddy and ready for reboot.
  • [1150] Readying disk three for hot-insertion. (Though, because we're running on IDE, this will require a reboot..)
  • [1147] Had to abort the transfer drive update; don't have access to the rear of the rack, and the front-side USB port is far too slow. Will just have to do today's work carefully..
  • [1105] Taking full filesystem image backup to spare transfer drive.
  • [1045] Initial update of offsite maintenance log.
  • [1035] Arrived at Telehouse Docklands.