Friday, 27 July 2007

[INCIDENT 2007/002] July 27th 2007 - Filesystem failures

Incident log for July 27th 2007
Attending: dwm, nick, mark
Status: Completed at 22:24hrs, GMT+1
Transcript, times are in GMT+1:
  • [2224] Restored to full multi-user mode with all services. (MySQL proved a little problematic - the automatic table management tooling didn't fix absolutely everything up - but that should be fine now.) Hopefully that's the last we'll see of this particular class of problem, at least for a while - but we'll be keeping a close eye on things just to make sure.
  • [2201] Rebooting into full multi-user now.
  • [2144] New kernel is installed. Rebooting again into single-user to test.
  • [2136] Booted in single-user mode. Next step: download and install a new more-up-to-date kernel.
  • [2114] Okay, rebooting into single-user mode. We're going to move the UDMA-6-capable disk to /dev/hda, the first disk slot that's not attached to the Promise controller, in case it's the Promise IDE controller itself not liking UDMA-6 operation.
  • [2050] Quick SMART checks completed with no errors. Quick checks writing test files to temporary filesystems also passed.
  • [2047] Running SMART self-tests on both remaining IDE disks.
  • [2015] Hmm, re-checking the newly reconstructed XFS filesystems showed minor errors again on /, easily fixed - but it may be that we haven't fully eliminated the cause of any underlying IO problems.
  • [2012] All of the filesystems are back and intact. Next steps will be to reboot the machine into single-user and check that everything boots up that's supposed to. Also, I'm going to try some additional block-level tests to check that data being written to the drives is being recorded and read correctly. First, however, the curry Nick and I ordered has arrived at the Security desk..
  • [1956] rsync is showing some files which are absent without leave; these are most likely the files relegated to lost+found. Rather than manually re-sort these files, we're going to restore those that are affected from the backup image - if anything proves to be missing, we still have the file fragments in lost+found that we can dig through.
  • [1947] Okay, /home filesystem has been checked and had its errors fixed, with a few files getting relocated to lost+found. We're going to do some whole-filesystem comparisons between our backup image and the repaired /home to try to make sure that nothing vital is missing.
  • [1935] Root filesystem has been restored. /boot filesystem passes all checks. /var contained some errors, fixed. Now running xfs_repair on /home to see how well it can cope with the errors on that volume.
  • [1833] Okay, we've got the RAID mirror up on the two good disks and have the LVM volumes up. Mounted the root filesystem; it appears that the earlier xfs_repair run moved everything into lost+found, which is fairly amusing. Proceeding to create new root filesystem and restore the contents from the backup image on the transit disk.
  • [1820] Performed some further tests, and determined that the error is persistent on that specific disk (as opposed to the disk channel it was occupying.) This disk, one of the disks that dates back to Kalimdor's original installation, is now considered suspect and has been removed. We will be rebuilding on the two other disks only.
  • [1755] Manual re-assembly of the RAID mirror showed that disk 2 of 3 had an invalid and out-of-date RAID superblock. This raises the possibility that this disk, or its ribbon cable, is bad. We've removed it from the working set and proceeding on the other two disks.
  • [1732] Machine has passed Memtest86 up to and beyond pass #4; it's fine. Next step: reboot from a rescue disk and restore the root filesystem from our transit disk.
  • [1705] Console hooked up, now running extended hardware memory test.
  • [1650] dwm and nick now on-site and online in Telehouse. Next steps: re-establish a functional root filesystem and conduct a thorough memory hardware check.
  • [1448] Backup complete. Packing up ready to head into Telehouse.
  • [1410] More than 2/3 complete - currently about 170,000 files left to go.
  • [1343] Copy of backup image to transit drive now 50% complete - about 300,000 files left to go.
  • [1317] Returned with the USB-SATA adaptor. Transfer of backup image to the transit drive is approximately 1/3 complete. Nick is now also en-route from Southampton to Telehouse; ETA sometime after 1500hrs.
  • [1226] Copying of backup data to transit drive now underway (all 600,000+ files of it.) Departing now for the local Maplin to pick-up a USB-SATA adaptor, back soon..
  • [1208] Called ahead to a Maplin a couple miles away - they've got the USB-SATA adaptor that we need. Just finished wiring the transit SATA drive into the offsite-backup server; about to start data copy to transit drive.
  • [1204] Received authorization to enter Telehouse building.
  • [1144] Updated plan:
    1. Copy the offsite-backup image to removeable media - in practice, a spare 250GB SATA disk.
    2. Whilst that's running, I'll head out to procure a USB-SATA adaptor so that we can copy the data back onto Kalimdor.
    3. Request for authorization to enter Telehouse has been submitted, we'll hopefully get that by the time we're ready to head in. Nick is also heading up from Winchester to assist with the recovery.
    4. Once we're onsite, we should be able to restore Kalimdor to good working order - in the worst case, restoring absolutely everything from the offsite-backup image taken early this morning.
  • [1129] Okay, that unfortunately didn't work. xfs_repair nuked /sbin/init. Off-site recovery methods are now exhausted, physical local access will now be necessary.
  • [1125] Root filesystem repair complete. Rebooting again into single-user mode.
  • [1121] Okay, plan: attempt filesystem recovery of /, see if we can get the recovery tools properly functional. In parallel, also preparing for physical entry to Telehouse with mobile copy of offsite-backup image.
  • [1110] Sent request for physical Telehouse access to co-lo.
  • [1059] Planning next steps.
  • [1049] Read-only check of / (root) filesystem is showing fairly extensive corruption. Other filesystems may be similarly affected. It may be necessary to physically go to Telehouse to rebuild the host from offsite backup.
  • [1044] Read-only check of /var XFS filesystem failed to terminate. Rebooting again to return to ground state.
  • [1021] Checking /var filesystem.
  • [1018] Executed restart via serial-console.
  • [1013] First response to issue. SSH, Apache services malfunctioning.
  • [0939] Issue raised via text-message.
Comments:
  • The bad news is that this has been a several-hour-long outage, for which we deeply apologise. The good news is that recovery seemed to go well and we believe any data loss from this significant filesystem failure was very minimal.
  • We think the root cause of the recent problems was a faulty disk. Unfortunately, rather than failing and refusing to function, we believe that it was silently recording data incorrectly - causing problems when it was read from again in normal operation. We have removed this disk and will be replacing it with a fresh replacement.
  • You may find it amusing to learn that today is Sysadmin Appreciation Day. If only the machines themselves respected such hallowed events..