Sunday, 22 July 2007

Kalimdor maintainance log - July 22nd 2007

Kalimdor maintenance log for July 22nd 2007.
Attending: dwm
Status: Completed at 23:30hrs, GMT+1.

Objectives:
  1. [ABORTED] Replace existing power-supply unit (PSU) with new more-efficient model (80PLUS-rated) provided by Jump Networks.
  2. [COMPLETE] Repair faulty inode on /home filesystem.
Transcript, times are in GMT+1:
  • [2330] Kalimdor.tastycake.net has been returned to full multi-user mode, and is running all services. This ends the at-risk period.
  • [2324] Spare disk re-added to RAID, rebuild in progress. Switching to multi-user mode.
  • [2320] Rebooted successfully. Satisfied that all is well. Rebooting again, this time to replace backup disk.
  • [2313] Minor housekeeping errors on / fixed. Now rebooting. (Still in single-user mode.)
  • [2308] Minor housekeeping errors on /var fixed.
  • [2304] Quota checks complete; now double-checking other filesystems.
  • [2302] xfs_copy of /home complete. xfs_check shows new filesystem is intact. Performing first mount; quotacheck running.
  • [2255] xfs_copy is now more than 80% complete.
  • [2249] xfs_copy is now more than 60% complete.
  • [2244] xfs_copy is now more than 40% complete.
  • [2239] xfs_copy is now more than 20% complete.
  • [2232] xfs_copy is now running, copying the contents of the previously-created backup volume to /home.
  • [2224] Okay, xfs_repair just isn't working, and my window for getting home tonight is closing. Going to reconstruct and repopulat e /home from scratch.
  • [2220] Despite xfs_repair fixing some specific issues, mounting /home and checking shows that the errors have not been corrected. This raises a new hypothesis: the RAID mirror isn't fully synchronized, or isn't syncing data correctly. Investigating.
    [2206] Hmm, given how the building alarm keeps coming and going (and started at 2200), it's probably a test. Carrying on..
  • [2201] And that's the building fire alarm.
  • [2159] Rebooted with one disk removed. xfs_repair has now run once successfully over /home with minor changes (removals to lost+found) - rerunning again to see if the FS has now settled to a good state.
  • [2145] xfs_repair reported and corrected some errors; however, re-running xfs_repair reported even more errors - I suspect that the /home filesystem is either suffering from a serious problem, or the underlying LVM is malfunctioning badly -- most likely the former. However, to be sure, I'm going to pull one of the RAID mirror disks and keep it in reserve. In the worst case, I will be able to repopulate any broken filesystems from the spare disk.
  • [2142] Comical error message of the day: bad (negative) size -2500720168097138090 on inode 580671.
    Fsck continues..
  • [2133] Backup complete. Double-checking integrity of backup FS, then will re-run fsck on /home.
  • [2002] /home filesystem backup running. It's only completed a couple of GB so far, so it'll take a good few minutes to complete. Taking advantage of the delay to go and fetch some food before I pass out!
  • [1953] The filesystem check has turned up the expected single-inode error; however, xfs_repair is unable to fully repair the filesystem. Now making a seperate copy of /home before continuing, just to be on the safe side.
  • [1936] Old power supply has been replaced and Kalimdor has been re-installed in the rack. Now rebooting to single-user mode to perform the planned filesystem checks.
  • [1908] Aha: it turns out this particular sub-variant of PSU doesn't include a particular -5v line necessary for correct operation. (We've got a ATX12V PSU, and the new one we have is an ATX12V v2.2 PSU. Frustratingly, they're not backwards compatible. ) Now going through the delicate process of removing the new PSU and threading the old one back in.
  • [1851] The reinstalled machine is failing to power-up with the new PSU, though it's able to drive its networking status lights, none of the fans are running and it fails to respond to the power-switch. Working to identify the fault now, though if we can't fix this very quickly we'll have to fall back to our older (working) PSU.
  • [1827] New power supply installed, machine re-assembled. Getting a power cable to the optical drive was indeed very fiddly, but achieved now. Unfortunately, the new PSU doesn't have a seperate IEC break-out socket for mounting on the rear of the case, and there's nowhere to physically attach the new PSU inside the rack itself. About to reinstall in the rack now.
  • [1757] Swapping out the PSU. Cable-running and re-mounting on the inside of the case is a little fiddly, so will take a few more minutes.
  • [1705] Obtained access to TFM-8 server room containing Kalimdor. Proceeding to execute a clean shutdown.