Tuesday 14 July 2009

[INCIDENT 2009/001] July 14th 2009 - Unexpected server failure

Incident log for July 14th 2009
Attending: dwm, ncm
Status: Completed at 11:32hrs, July 14th 2009.
Summary:
  • The server kalimdor.tastycake.net stopped functioning correctly at or shortly after 0900.02hrs for reasons unknown. This was detected at 1045hrs, and normal service was restored at 1132hrs.

Transcript, times are in GMT+1:
  • [1132] Normal services restored.
  • [1130] All filesystems pass checks. Bring server up into normal multi-user mode.
  • [1113] Server booted into single-user mode using secondary kernel image. Checking all local filesystems for errors.
  • [1057] Reboot into single-user mode failed; initial ramdisk for primary kernel image found to be corrupted or truncated. Rebooting into backup kernel image.
  • [1045] Service failure discovered. Emergency reboot triggered after serial console found unresponsive.
  • [0902] Clients running on kalimdor.tastycake.net time-out from remote services.
  • [0900] kalimdor.tastycake.net stops logging to local system log.

Thursday 28 February 2008

[AT RISK 2008/004] March 8th 2008 - Electrical works planned between 0001-0200hrs

Maintenance log for March 8th 2008
Attending: dwm
Status: Completed at 10:00hrs, March 8th 2008.
Summary:

  • The rack containing the Tastycake.net server kalimdor.tastycake.net will be briefly powered down so that the rack can be connected to a newly-installed power-distribution board.
  • As a result, no services will be accessible whilst the switchover is in progress. The at-risk period will last until 0200hrs, though the colo engineers hope to have normal services resumed by 0030hrs.
  • Works to be carried out:
    • Shut down kalimdor.tastycake.net. (Completed)
    • Wait whilst the co-location engineers switch the rack over to the new power-distribution feed. (Completed)
    • Boot kalimdor.tastycake.net. (Completed)
    • Verify services are running normally. (Completed)
Transcript, times are in GMT:

March 8th 2008
  • [0915] Services verified as functioning correctly. (There was a minor issue with the current experimental DNS service for the dwm.me.uk domain as a result of invalid zone configuration data, corrected. It turns out that you're not allowed a CNAME as well as an SOA for the root of a zone, but A and AAAA records are fine..)
  • [0015] Power restored, automated reboot in progress. All services runningas normal.
March 7th 2008
  • [2355] Automated scheduled shutdown executed.
  • [2230] Delayed-effect shutdown instruction executed; shutdown will occur at 2355hrs.
  • [2105] Reminder of pending works (tonight!) sent by email to all users.
February 28th 2008
  • [1420] Initial update to off-site status page.

Saturday 16 February 2008

[INCIDENT 2008/003] February 16th 2008 - Filesystem corruption, suspect defective IDE channel.

Incident log for February 16th 2008
Attending: dwm, mark, ncm
Status: Completed at 2250hrs GMT.
Summary:

  • Tastycake.net server kalimdor.tastycake.net has suffered a filesystem corruption problem.
  • As a result, some disk / directory accesses are blocking indefinitely.
  • We suspect that this data corruption is occuring somewhere along the disk channel supporting /dev/hdg.
  • Works to be carried out:
    • Remove /dev/hdg from all RAID mirrors to prevent further filesystem corruption. (Complete)
    • Reboot machine into single-user mode. NOTE: No services will be available whilst in single-user mode. (Complete)
    • Run filesystem verification utilities on all disk filesystems. (Complete)
    • Restore any damaged files from backups as required. (Complete)
    • Reboot machine back into normal production operation. (Complete)
Transcript, times are in GMT:
  • [2250] Incident closed.
  • [2247] Summary: All of the recovered files were old transient copies of data that had been deleted deliberately, with the possible exception of some of ~anton's image files, which have been copied to his home directory for review.
  • [2230] Of the remaining files all owned by ~jeremy, all but one are old versions of existing mailboxes - probably an artifact of normal mailbox re-writing operation. (Checking unique message ids shows that the mail messages still exist in the live mailboxes.) The remaining file just contains the junk chars "|a:0:{}" and doesn't appear in my filesystem index comparison. Almost certainly junk, deleted.
  • [2221] Found that most the disconnected files owned by ~anton are temporary files generated by gallery; deleted. (Christ, ~anton, you've got over a gigabyte of temporary files in there going back years! Clear it out!) His remaining files appear to be old deleted .jpeg photos, but moved them to a RECOVERED_FILES directory in his $HOME to allow for inspection and recovery.
  • [2217] Filesystem checks on /dev/mapper/volume-recover complete, no errors. Re-mounting /vol/recover.
  • [2210] Generating home directory indexes of affected users on live system and offsite backup for comparison.
  • [2153] Picking through the disconnected files found in /home:
    • One mailspool index auto-generated by Dovecot; will be automatically regenerated: deleted.
    • 8.5MB junk mailbox owned by ~jeremy; expendable!
    • Remaining files owned by ~anton and ~jeremy, no other users affected.
  • [2152] Running xfs_repair check on /dev/mapper/volume-recover in the background.
  • [2151] Disconnected inode files in /var are all old Apache logfiles dating to July 2007, which is older than normal retention policy. Deleted.
  • [2146] Machine back in production. Checking contents of lost+found.
  • [2144] Reboot in progress.
  • [2138] All filesystem checks complete, bar /dev/mapper/volume-recover which can be done whilst online. Rebooting to normal production mode.
  • [2137] xfs_repair completed, no errors found.
  • [2135] Running full xfs_repair on /dev/mapper/volume-root.
  • [2134] Appoximately 50 disconnected inodes detected on volume-home, relocted to lost+found. These may be real files, or they may simply be historical artifacts.
  • [2132] Minor error (link count) detected on volume-var. Full repair run also detected some disconnected inodes; running full repair on volume-home for good measure.
  • [2128] Re-checking volume-home and volume-var with xfs_repair -n for good measure.
  • [2127] Second filesystem check of /dev/mapper/volume-root complete, no errors. We may have been fortunate and only had the kernel BUG trigger as a result of a read error and not an earlier write error as previously feared.
  • [2124] Filesystem check of /dev/mapper/volume-root complete, no errors. Checking result with xfs_repair -n (as opposed to xfs_check).
  • [2123] Filesystem check of /dev/mapper/volume-root running, at least minor errors expected.
  • [2121] Filesystem check of /dev/mapper/volume-home complete, no errors.
  • [2119] Filesystem check of /dev/mapper/volume-home running.
  • [2118] Filesystem check of /dev/mapper/volume-var complete, no errors.
  • [2117] Filesystem check of /dev/md6 (/boot) complete, no errors.
  • [2116] Machine rebooted into single-user mode. All services unavailable from this point.
  • [2057] Initial tastycake-status bulletin published.
  • [2025] walled all logged-in users to advise that emergency maintenance in progress.
  • [2033] /dev/hdg dropped from all RAID mirrors to avoid further disk corruption. The next step is to reboot the machine into single-user mode to conduct full filesystem checks and repairs.
  • [2033] Incident announcement sent to all admins via http://twitter.com/tastycake.
  • [2026] Determined that cause of fault is a faulty data channel to /dev/hdg resulting in incorrect data being written to disk. Begun dropping /dev/hdg from all RAID mirrors to avoid further corruption.
  • [2016] Filesystem corruption detected in /root/.wajig/kalimdor
  • [2014] Kernel BUG (internal error alert) spotted by inspection.
  • [2002] Host monitoring system generates another Critical warning.
  • [1007] Monitoring system downgrades previous critical warning to minor severity.
  • [1002] Critical warning generated by host monitoring system, indicating that a significantly higher than normal number of cron processes are running.
  • [0632] Minor warning generated by host monitoring system, indicating that a higher-than-normal number of cron processes are running concurrently.

Monday 11 February 2008

[AT RISK 2008/002] February 11th 2008 - emergency kernel upgrade

Maintenance log for February 8th 2008
Attending: dwm
Status: Completed at 1405hrs GMT
Summary:
  • Tastycake.net server kalimdor.tastycake.net being rebooted (at least once) at approximately 1200noon GMT for an emergency kernel upgrade.
  • New kernel needed to patch local root escalation vulnerabilities (CVE-2008-0009, CVE-2008-0010).
  • No Tastycake.net services will be available whilst reboots are occurring.
  • Works to be carried out:
    • Build new linux kernel (2.6.24.2) to replace existing build (2.6.24). (Complete)
    • Install new kernel and set as default. (Complete)
    • Reboot machine to start using new kernel. (Complete)
Transcript:
  • [1405] All tests clear, at-risk period concluded.
  • [1400] Machine rebooted successfully into new kernel. Running final checks..
  • [1357] Machine rebooted.
  • [1353] Believed that I have corrected the booting problem (missing /dev/md0 entry in /etc/mdadm/mdadm.conf) and rebooting again. (Again, with 1-minute grace.)
  • [1347] Successfully rebooted using original kernel; will be fixing raid-auto start, then rebooting again.
  • [1334] Backup kernel not functioning; appears to not be auto-starting /dev/md0; will need to configure manually. This may take a few minutes..
  • [1331] Failed to boot using new kernel, power-cycled via power-switch interface.
  • [1328] Machine reboot.
  • [1326] Reboot triggered with 1-minute grace delay.
  • [1320] New kernel installed, ready to reboot. Warning sent via wall to all logged-in users.
  • [1313] Updated kernel package built, installed in Tastycake package repository.
  • [1107] Initial update of maintenance log.
  • [1010] Determined that 2.6.24.1 kernel that had been built overnight has been superceded by 2.6.24.2, building new kernel image.

Saturday 9 February 2008

[AT RISK 2008/001] February 9th 2008 - scheduled maintenance

Maintenance log for February 8th 2008
Attending: dwm
Status: completed at 15:11hrs
Summary:
  • Tastycake.net server kalimdor.tastycake.net being taken offline at 1200noon GMT for maintenance.
  • No Tastycake.net services will be available whilst works are in progress.
  • Works to be carried out:
    • Install third 250GB hard-drive into RAID mirror. (Complete)
    • Install GRUB bootloader on third drive. (Complete)
    • Replace old 127GB hard-drive with new 250GB replacement. (Complete)
    • Install GRUB bootloader on replacement drive. (Complete)
    • Create new RAID mirror set on as-yet unallocated space. (Complete)
    • Expand LVM working set using new RAID mirror set. (Complete)
    • Upgrade local kernel to 2.6.24. (Complete)
    • Discontinue local NFS server, use read-only bind mount for /vol/recover instead.
      (New feature in 2.6.24.) (Cancelled)
Transcript, times are in GMT:
  • [1511] Final checks complete, at-risk period ends.
  • [1506] Performing final checks prior to announcing end of at-risk period.
  • [1505] Read-only bind mounts don't seem to be functioning, we perhaps need an updated mount-utils. This can be done safely at a later date.
  • [1456] Initialized /dev/md0 as new LVM PV and added PV to existing volume VG; total capacity: 232GB.
  • [1454] Added new second disk to main RAID mirror set /dev/md7.
  • [1451] Created new RAID mirror across previously-unused disk space.
    (NOTE: /dev/md0 is not the RAID mirror containing /boot, /dev/md6 is.)
  • [1440] Kernel installed, rebooting to verify correct operation and to reload DOS partition tables.
  • [1437] Installing updated kernel packages. (And SNMP security updates, whilst we're here.)
  • [1434] Partitioned unallocated space on all three disks. Leaving creation of new RAID mirror array until last, as it would only be interrupted by reboots anyway..
  • [1431] Partitioned drive 2 to match other disks. Added drive 2 partition to /boot RAID mirror volume. Installed bootloader on new drive.
  • [1423] Replacement drive 2 installed, booting.
  • [1415] Re-sync completed. Rebooting to replace drive 2.
  • [1402] Re-sync 90% complete. (Unfortunately, it seems to be slowing down to about 15MB/sec again.)
  • [1357] Installed spare GRUB bootloader on new disk.
  • [1350] Re-sync 80% complete.
  • [1334] Re-sync 66.6% complete.
  • [1318] Re-sync 50% complete. (Now peaking at ~22MB/sec; ETA at present rate: 44mins.)
  • [1301] Re-sync 33.3% complete. (Now peaking at ~19MB/sec; ETA at present rate: 67mins.)
  • [1250] Re-sync 25% complete. (Looks like its speeding up as it proceeds, probably due to disk geometry.)
  • [1230] Re-sync 10% complete.
  • [1219] RAID re-sync in progress; need to wait for it to complete before replacing disk 2. ETA @ ~15MB/sec: 115mins.
  • [1211] Adding new disk 3 partitions to RAID mirror sets.
  • [1209] Disk installed, server rebooted. Partitioned disk 3 to match existing layout.
  • [1201] Serial terminal up; sent reboot instruction with 2-minute grace.
  • [1159] Sent final warning via wall to save all state; disk installed in caddy and ready for reboot.
  • [1150] Readying disk three for hot-insertion. (Though, because we're running on IDE, this will require a reboot..)
  • [1147] Had to abort the transfer drive update; don't have access to the rear of the rack, and the front-side USB port is far too slow. Will just have to do today's work carefully..
  • [1105] Taking full filesystem image backup to spare transfer drive.
  • [1045] Initial update of offsite maintenance log.
  • [1035] Arrived at Telehouse Docklands.

Friday 16 November 2007

[INCIDENT 2007/005] November 16th 2007 - Upstream DNS servers offline

Incident log for November 16th 2007
Attending: dwm, mark
Status: Completed
Summary:
  • HostEurope.com, who provide DNS hosting services for Tastycake.net, are offline and are not answering queries for Tastycake.net addresses.
  • This may result in difficulties accessing Tastycake.net services, and delay the delivery of email to Tastycake.net email addresses.
  • As the fault lies somewhere within HostEurope.com's facillities, and not with the Kalimdor server itself, there is little we can do to directly address this problem, and are waiting for HostEurope.com to implement a fix.
  • In the unlikely event that HostEurope.com do not restore service in a timely fashion, we are preparing contingency plans to move the Tastycake.net domain to another hosting provider.
Transcript, times are in GMT:
Sunday Nov 18th
  • [1200] Upstream DNS problems resolved.
Friday Nov 16th
  • [2045] Incident report posted to http://tastycake-status.blogspot.net/.
  • [2021] HostEurope Support facility identified as offline.
  • [2015] Both authoritive DNS servers (provided by HostEurope.com) identified as failed.
  • [2010] DNS resolution problems first reported for tastycake.net domain.

Wednesday 19 September 2007

[INCIDENT 2007/004] September 19th 2007 - /var full

Incident log for September 19th 2007
Attending: dwm
Status: Completed at 1230hrs GMT+1
Transcript, times are in GMT+1:

  • [1230] Incident closed, all services appear to be running normally.
  • [1158] Upgraded clamav to latest-stable version to address database retrieval problem. Continuing to monitor status of system.
  • [1130] Increased size of /var filesystem online. Restarted failed services: sysklogd, exim, clamav.
  • [0726] /var filesystem on Kalimdor goes full resulting in service failures. Flash messages dispatched from service monitoring agent indicating that SMTP is unavailable.