Friday 27 July 2007

[INCIDENT 2007/002] July 27th 2007 - Filesystem failures

Incident log for July 27th 2007
Attending: dwm, nick, mark
Status: Completed at 22:24hrs, GMT+1
Transcript, times are in GMT+1:
  • [2224] Restored to full multi-user mode with all services. (MySQL proved a little problematic - the automatic table management tooling didn't fix absolutely everything up - but that should be fine now.) Hopefully that's the last we'll see of this particular class of problem, at least for a while - but we'll be keeping a close eye on things just to make sure.
  • [2201] Rebooting into full multi-user now.
  • [2144] New kernel is installed. Rebooting again into single-user to test.
  • [2136] Booted in single-user mode. Next step: download and install a new more-up-to-date kernel.
  • [2114] Okay, rebooting into single-user mode. We're going to move the UDMA-6-capable disk to /dev/hda, the first disk slot that's not attached to the Promise controller, in case it's the Promise IDE controller itself not liking UDMA-6 operation.
  • [2050] Quick SMART checks completed with no errors. Quick checks writing test files to temporary filesystems also passed.
  • [2047] Running SMART self-tests on both remaining IDE disks.
  • [2015] Hmm, re-checking the newly reconstructed XFS filesystems showed minor errors again on /, easily fixed - but it may be that we haven't fully eliminated the cause of any underlying IO problems.
  • [2012] All of the filesystems are back and intact. Next steps will be to reboot the machine into single-user and check that everything boots up that's supposed to. Also, I'm going to try some additional block-level tests to check that data being written to the drives is being recorded and read correctly. First, however, the curry Nick and I ordered has arrived at the Security desk..
  • [1956] rsync is showing some files which are absent without leave; these are most likely the files relegated to lost+found. Rather than manually re-sort these files, we're going to restore those that are affected from the backup image - if anything proves to be missing, we still have the file fragments in lost+found that we can dig through.
  • [1947] Okay, /home filesystem has been checked and had its errors fixed, with a few files getting relocated to lost+found. We're going to do some whole-filesystem comparisons between our backup image and the repaired /home to try to make sure that nothing vital is missing.
  • [1935] Root filesystem has been restored. /boot filesystem passes all checks. /var contained some errors, fixed. Now running xfs_repair on /home to see how well it can cope with the errors on that volume.
  • [1833] Okay, we've got the RAID mirror up on the two good disks and have the LVM volumes up. Mounted the root filesystem; it appears that the earlier xfs_repair run moved everything into lost+found, which is fairly amusing. Proceeding to create new root filesystem and restore the contents from the backup image on the transit disk.
  • [1820] Performed some further tests, and determined that the error is persistent on that specific disk (as opposed to the disk channel it was occupying.) This disk, one of the disks that dates back to Kalimdor's original installation, is now considered suspect and has been removed. We will be rebuilding on the two other disks only.
  • [1755] Manual re-assembly of the RAID mirror showed that disk 2 of 3 had an invalid and out-of-date RAID superblock. This raises the possibility that this disk, or its ribbon cable, is bad. We've removed it from the working set and proceeding on the other two disks.
  • [1732] Machine has passed Memtest86 up to and beyond pass #4; it's fine. Next step: reboot from a rescue disk and restore the root filesystem from our transit disk.
  • [1705] Console hooked up, now running extended hardware memory test.
  • [1650] dwm and nick now on-site and online in Telehouse. Next steps: re-establish a functional root filesystem and conduct a thorough memory hardware check.
  • [1448] Backup complete. Packing up ready to head into Telehouse.
  • [1410] More than 2/3 complete - currently about 170,000 files left to go.
  • [1343] Copy of backup image to transit drive now 50% complete - about 300,000 files left to go.
  • [1317] Returned with the USB-SATA adaptor. Transfer of backup image to the transit drive is approximately 1/3 complete. Nick is now also en-route from Southampton to Telehouse; ETA sometime after 1500hrs.
  • [1226] Copying of backup data to transit drive now underway (all 600,000+ files of it.) Departing now for the local Maplin to pick-up a USB-SATA adaptor, back soon..
  • [1208] Called ahead to a Maplin a couple miles away - they've got the USB-SATA adaptor that we need. Just finished wiring the transit SATA drive into the offsite-backup server; about to start data copy to transit drive.
  • [1204] Received authorization to enter Telehouse building.
  • [1144] Updated plan:
    1. Copy the offsite-backup image to removeable media - in practice, a spare 250GB SATA disk.
    2. Whilst that's running, I'll head out to procure a USB-SATA adaptor so that we can copy the data back onto Kalimdor.
    3. Request for authorization to enter Telehouse has been submitted, we'll hopefully get that by the time we're ready to head in. Nick is also heading up from Winchester to assist with the recovery.
    4. Once we're onsite, we should be able to restore Kalimdor to good working order - in the worst case, restoring absolutely everything from the offsite-backup image taken early this morning.
  • [1129] Okay, that unfortunately didn't work. xfs_repair nuked /sbin/init. Off-site recovery methods are now exhausted, physical local access will now be necessary.
  • [1125] Root filesystem repair complete. Rebooting again into single-user mode.
  • [1121] Okay, plan: attempt filesystem recovery of /, see if we can get the recovery tools properly functional. In parallel, also preparing for physical entry to Telehouse with mobile copy of offsite-backup image.
  • [1110] Sent request for physical Telehouse access to co-lo.
  • [1059] Planning next steps.
  • [1049] Read-only check of / (root) filesystem is showing fairly extensive corruption. Other filesystems may be similarly affected. It may be necessary to physically go to Telehouse to rebuild the host from offsite backup.
  • [1044] Read-only check of /var XFS filesystem failed to terminate. Rebooting again to return to ground state.
  • [1021] Checking /var filesystem.
  • [1018] Executed restart via serial-console.
  • [1013] First response to issue. SSH, Apache services malfunctioning.
  • [0939] Issue raised via text-message.
Comments:
  • The bad news is that this has been a several-hour-long outage, for which we deeply apologise. The good news is that recovery seemed to go well and we believe any data loss from this significant filesystem failure was very minimal.
  • We think the root cause of the recent problems was a faulty disk. Unfortunately, rather than failing and refusing to function, we believe that it was silently recording data incorrectly - causing problems when it was read from again in normal operation. We have removed this disk and will be replacing it with a fresh replacement.
  • You may find it amusing to learn that today is Sysadmin Appreciation Day. If only the machines themselves respected such hallowed events..

Thursday 26 July 2007

[INCIDENT 2007/001] July 26th 2007 - Kernel error

Incident log for July 26th 2007
Attending: dwm, mark
Status: Completed at 16:45hrs, GMT +1
Transcript, Times are in GMT+1:
  • [1645] Back to normal operation.
  • [1643] Final checks complete, booting to full operation.
  • [1642] Reboot complete, executing final checks.
  • [1639] Filesystem check of / complete. Executing reboot to single-user.
  • [1635] Filesystem /home check complete. Proceeding to check / (root filesystem).
  • [1631] Filesystem check of /export/recover complete. Proceeding to check /home.
  • [1616] /var check complete. Now checking /export/recover.
  • [1614] kalimdor.tastycake.net rebooted via serial console via SysRq. Checking /var filesystem.
  • [1611] dwm logged in via remote root shell.
  • [1610] Alarm raised by mark; kernel OOPS reported in XFS filesystem code. SSH services unavailable.

Sunday 22 July 2007

Kalimdor maintainance log - July 22nd 2007

Kalimdor maintenance log for July 22nd 2007.
Attending: dwm
Status: Completed at 23:30hrs, GMT+1.

Objectives:
  1. [ABORTED] Replace existing power-supply unit (PSU) with new more-efficient model (80PLUS-rated) provided by Jump Networks.
  2. [COMPLETE] Repair faulty inode on /home filesystem.
Transcript, times are in GMT+1:
  • [2330] Kalimdor.tastycake.net has been returned to full multi-user mode, and is running all services. This ends the at-risk period.
  • [2324] Spare disk re-added to RAID, rebuild in progress. Switching to multi-user mode.
  • [2320] Rebooted successfully. Satisfied that all is well. Rebooting again, this time to replace backup disk.
  • [2313] Minor housekeeping errors on / fixed. Now rebooting. (Still in single-user mode.)
  • [2308] Minor housekeeping errors on /var fixed.
  • [2304] Quota checks complete; now double-checking other filesystems.
  • [2302] xfs_copy of /home complete. xfs_check shows new filesystem is intact. Performing first mount; quotacheck running.
  • [2255] xfs_copy is now more than 80% complete.
  • [2249] xfs_copy is now more than 60% complete.
  • [2244] xfs_copy is now more than 40% complete.
  • [2239] xfs_copy is now more than 20% complete.
  • [2232] xfs_copy is now running, copying the contents of the previously-created backup volume to /home.
  • [2224] Okay, xfs_repair just isn't working, and my window for getting home tonight is closing. Going to reconstruct and repopulat e /home from scratch.
  • [2220] Despite xfs_repair fixing some specific issues, mounting /home and checking shows that the errors have not been corrected. This raises a new hypothesis: the RAID mirror isn't fully synchronized, or isn't syncing data correctly. Investigating.
    [2206] Hmm, given how the building alarm keeps coming and going (and started at 2200), it's probably a test. Carrying on..
  • [2201] And that's the building fire alarm.
  • [2159] Rebooted with one disk removed. xfs_repair has now run once successfully over /home with minor changes (removals to lost+found) - rerunning again to see if the FS has now settled to a good state.
  • [2145] xfs_repair reported and corrected some errors; however, re-running xfs_repair reported even more errors - I suspect that the /home filesystem is either suffering from a serious problem, or the underlying LVM is malfunctioning badly -- most likely the former. However, to be sure, I'm going to pull one of the RAID mirror disks and keep it in reserve. In the worst case, I will be able to repopulate any broken filesystems from the spare disk.
  • [2142] Comical error message of the day: bad (negative) size -2500720168097138090 on inode 580671.
    Fsck continues..
  • [2133] Backup complete. Double-checking integrity of backup FS, then will re-run fsck on /home.
  • [2002] /home filesystem backup running. It's only completed a couple of GB so far, so it'll take a good few minutes to complete. Taking advantage of the delay to go and fetch some food before I pass out!
  • [1953] The filesystem check has turned up the expected single-inode error; however, xfs_repair is unable to fully repair the filesystem. Now making a seperate copy of /home before continuing, just to be on the safe side.
  • [1936] Old power supply has been replaced and Kalimdor has been re-installed in the rack. Now rebooting to single-user mode to perform the planned filesystem checks.
  • [1908] Aha: it turns out this particular sub-variant of PSU doesn't include a particular -5v line necessary for correct operation. (We've got a ATX12V PSU, and the new one we have is an ATX12V v2.2 PSU. Frustratingly, they're not backwards compatible. ) Now going through the delicate process of removing the new PSU and threading the old one back in.
  • [1851] The reinstalled machine is failing to power-up with the new PSU, though it's able to drive its networking status lights, none of the fans are running and it fails to respond to the power-switch. Working to identify the fault now, though if we can't fix this very quickly we'll have to fall back to our older (working) PSU.
  • [1827] New power supply installed, machine re-assembled. Getting a power cable to the optical drive was indeed very fiddly, but achieved now. Unfortunately, the new PSU doesn't have a seperate IEC break-out socket for mounting on the rear of the case, and there's nowhere to physically attach the new PSU inside the rack itself. About to reinstall in the rack now.
  • [1757] Swapping out the PSU. Cable-running and re-mounting on the inside of the case is a little fiddly, so will take a few more minutes.
  • [1705] Obtained access to TFM-8 server room containing Kalimdor. Proceeding to execute a clean shutdown.

Thursday 19 July 2007

[AT RISK] Kalimdor downtime THIS SUNDAY from 1700hrs

Hello all,

Kalimdor.tastycake.net will be taken out of service for on

Sunday 22nd July (THIS SUNDAY) from 1700hrs

... in order to carry out preventative maintenance. Whilst these works are in progress, NONE of the tastycake.net services will be available. I anticipate that the works will only take approximately 30 minutes to complete.

Status updates will be posted to our new offsite status page, http://tastycake-status.blogspot.com.

The works to be carried out are as follows:
  • Replacement of Kalimdor's internal power-supply unit (PSU) with a new, more-efficient equivalent that conforms to the 80PLUS specification.
  • A filesystem check on /home in order to clear a stuck inode.
I apologise for the short notice; if these planned works present a problem, or if you have any other comments, concerns or queries, please contact us the administrators via all-heroes([a])tastycake.net.

Cheers,
David