Saturday 16 February 2008

[INCIDENT 2008/003] February 16th 2008 - Filesystem corruption, suspect defective IDE channel.

Incident log for February 16th 2008
Attending: dwm, mark, ncm
Status: Completed at 2250hrs GMT.
Summary:

  • Tastycake.net server kalimdor.tastycake.net has suffered a filesystem corruption problem.
  • As a result, some disk / directory accesses are blocking indefinitely.
  • We suspect that this data corruption is occuring somewhere along the disk channel supporting /dev/hdg.
  • Works to be carried out:
    • Remove /dev/hdg from all RAID mirrors to prevent further filesystem corruption. (Complete)
    • Reboot machine into single-user mode. NOTE: No services will be available whilst in single-user mode. (Complete)
    • Run filesystem verification utilities on all disk filesystems. (Complete)
    • Restore any damaged files from backups as required. (Complete)
    • Reboot machine back into normal production operation. (Complete)
Transcript, times are in GMT:
  • [2250] Incident closed.
  • [2247] Summary: All of the recovered files were old transient copies of data that had been deleted deliberately, with the possible exception of some of ~anton's image files, which have been copied to his home directory for review.
  • [2230] Of the remaining files all owned by ~jeremy, all but one are old versions of existing mailboxes - probably an artifact of normal mailbox re-writing operation. (Checking unique message ids shows that the mail messages still exist in the live mailboxes.) The remaining file just contains the junk chars "|a:0:{}" and doesn't appear in my filesystem index comparison. Almost certainly junk, deleted.
  • [2221] Found that most the disconnected files owned by ~anton are temporary files generated by gallery; deleted. (Christ, ~anton, you've got over a gigabyte of temporary files in there going back years! Clear it out!) His remaining files appear to be old deleted .jpeg photos, but moved them to a RECOVERED_FILES directory in his $HOME to allow for inspection and recovery.
  • [2217] Filesystem checks on /dev/mapper/volume-recover complete, no errors. Re-mounting /vol/recover.
  • [2210] Generating home directory indexes of affected users on live system and offsite backup for comparison.
  • [2153] Picking through the disconnected files found in /home:
    • One mailspool index auto-generated by Dovecot; will be automatically regenerated: deleted.
    • 8.5MB junk mailbox owned by ~jeremy; expendable!
    • Remaining files owned by ~anton and ~jeremy, no other users affected.
  • [2152] Running xfs_repair check on /dev/mapper/volume-recover in the background.
  • [2151] Disconnected inode files in /var are all old Apache logfiles dating to July 2007, which is older than normal retention policy. Deleted.
  • [2146] Machine back in production. Checking contents of lost+found.
  • [2144] Reboot in progress.
  • [2138] All filesystem checks complete, bar /dev/mapper/volume-recover which can be done whilst online. Rebooting to normal production mode.
  • [2137] xfs_repair completed, no errors found.
  • [2135] Running full xfs_repair on /dev/mapper/volume-root.
  • [2134] Appoximately 50 disconnected inodes detected on volume-home, relocted to lost+found. These may be real files, or they may simply be historical artifacts.
  • [2132] Minor error (link count) detected on volume-var. Full repair run also detected some disconnected inodes; running full repair on volume-home for good measure.
  • [2128] Re-checking volume-home and volume-var with xfs_repair -n for good measure.
  • [2127] Second filesystem check of /dev/mapper/volume-root complete, no errors. We may have been fortunate and only had the kernel BUG trigger as a result of a read error and not an earlier write error as previously feared.
  • [2124] Filesystem check of /dev/mapper/volume-root complete, no errors. Checking result with xfs_repair -n (as opposed to xfs_check).
  • [2123] Filesystem check of /dev/mapper/volume-root running, at least minor errors expected.
  • [2121] Filesystem check of /dev/mapper/volume-home complete, no errors.
  • [2119] Filesystem check of /dev/mapper/volume-home running.
  • [2118] Filesystem check of /dev/mapper/volume-var complete, no errors.
  • [2117] Filesystem check of /dev/md6 (/boot) complete, no errors.
  • [2116] Machine rebooted into single-user mode. All services unavailable from this point.
  • [2057] Initial tastycake-status bulletin published.
  • [2025] walled all logged-in users to advise that emergency maintenance in progress.
  • [2033] /dev/hdg dropped from all RAID mirrors to avoid further disk corruption. The next step is to reboot the machine into single-user mode to conduct full filesystem checks and repairs.
  • [2033] Incident announcement sent to all admins via http://twitter.com/tastycake.
  • [2026] Determined that cause of fault is a faulty data channel to /dev/hdg resulting in incorrect data being written to disk. Begun dropping /dev/hdg from all RAID mirrors to avoid further corruption.
  • [2016] Filesystem corruption detected in /root/.wajig/kalimdor
  • [2014] Kernel BUG (internal error alert) spotted by inspection.
  • [2002] Host monitoring system generates another Critical warning.
  • [1007] Monitoring system downgrades previous critical warning to minor severity.
  • [1002] Critical warning generated by host monitoring system, indicating that a significantly higher than normal number of cron processes are running.
  • [0632] Minor warning generated by host monitoring system, indicating that a higher-than-normal number of cron processes are running concurrently.