<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7934364102285528663</id><updated>2012-02-16T10:43:42.351Z</updated><title type='text'>Tastycake Status</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://tastycake-status.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>12</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-3155763607070779658</id><published>2009-07-14T12:18:00.004+01:00</published><updated>2009-07-14T12:33:58.143+01:00</updated><title type='text'>[INCIDENT 2009/001] July 14th 2009 - Unexpected server failure</title><content type='html'>Incident log for July 14th 2009&lt;br /&gt;Attending: dwm, ncm&lt;br /&gt;Status: Completed at 11:32hrs, July 14th 2009.&lt;br /&gt;Summary: &lt;br /&gt;&lt;ul&gt;&lt;li&gt;The server &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; stopped functioning correctly at or shortly after 0900.02hrs for reasons unknown.  This was detected at 1045hrs, and normal service was restored at 1132hrs.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Transcript, times are in GMT+1:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[1132] Normal services restored.&lt;/li&gt;&lt;li&gt;[1130] All filesystems pass checks. Bring server up into normal multi-user mode.&lt;/li&gt;&lt;li&gt;[1113] Server booted into single-user mode using secondary kernel image. Checking all local filesystems for errors.&lt;/li&gt;&lt;li&gt;[1057] Reboot into single-user mode failed; initial ramdisk for primary kernel image found to be corrupted or truncated. Rebooting into backup kernel image.&lt;/li&gt;&lt;li&gt;[1045] Service failure discovered. Emergency reboot triggered after serial console found unresponsive.&lt;/li&gt;&lt;li&gt;[0902] Clients running on &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; time-out from remote services.&lt;/li&gt;&lt;li&gt;[0900] &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; stops logging to local system log.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-3155763607070779658?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/3155763607070779658'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/3155763607070779658'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2009/07/incident-2009001-july-14th-2009.html' title='[INCIDENT 2009/001] July 14th 2009 - Unexpected server failure'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-5987691233638306974</id><published>2008-02-28T14:22:00.002Z</published><updated>2008-03-11T16:13:22.172Z</updated><title type='text'>[AT RISK 2008/004] March 8th 2008 - Electrical works planned between 0001-0200hrs</title><content type='html'>&lt;p&gt;Maintenance log for March 8th 2008&lt;br /&gt;Attending: dwm&lt;br /&gt;Status: Completed at 10:00hrs, March 8th 2008.&lt;br /&gt;Summary:&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;The rack containing the Tastycake.net server &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; will be briefly powered down so that the rack can be connected to a newly-installed power-distribution board.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;As a result, no services will be accessible whilst the switchover is in progress. The at-risk period will last until 0200hrs, though the colo engineers hope to have normal services resumed by 0030hrs.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Works to be carried out:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Shut down &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt;. (Completed)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Wait whilst the co-location engineers switch the rack over to the new power-distribution feed. (Completed)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Boot &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt;. (Completed)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Verify services are running normally. (Completed)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;Transcript, times are in GMT:&lt;br /&gt;&lt;br /&gt;March 8th 2008&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[0915] Services verified as functioning correctly.  (There was a minor issue with the current experimental DNS service for the &lt;tt&gt;dwm.me.uk&lt;/tt&gt; domain as a result of invalid zone configuration data, corrected.  It turns out that you're not allowed a CNAME as well as an SOA for the root of a zone, but A and AAAA records are fine..)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[0015] Power restored, automated reboot in progress.  All services runningas normal.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;March 7th 2008&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[2355] Automated scheduled shutdown executed.&lt;/li&gt;&lt;li&gt;[2230] Delayed-effect shutdown instruction executed; shutdown will occur at 2355hrs.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2105] Reminder of pending works (tonight!) sent by email to all users.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;February 28th 2008&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[1420] Initial update to off-site status page.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-5987691233638306974?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/5987691233638306974'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/5987691233638306974'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2008/02/at-risk-2008004-march-8th-2008.html' title='[AT RISK 2008/004] March 8th 2008 - Electrical works planned between 0001-0200hrs'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-7209299678341034758</id><published>2008-02-16T20:38:00.015Z</published><updated>2008-02-28T14:13:32.929Z</updated><title type='text'>[INCIDENT 2008/003] February 16th 2008 - Filesystem corruption, suspect defective IDE channel.</title><content type='html'>&lt;p&gt;Incident log for February 16th 2008&lt;br /&gt;Attending: dwm, mark, ncm&lt;br /&gt;Status: Completed at 2250hrs GMT.&lt;br /&gt;Summary:&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Tastycake.net server &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; has suffered a filesystem corruption problem.&lt;/li&gt;&lt;li&gt;As a result, some disk / directory accesses are blocking indefinitely.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;We suspect that this data corruption is occuring somewhere along the disk channel supporting &lt;tt&gt;/dev/hdg&lt;/tt&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Works to be carried out:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Remove &lt;tt&gt;/dev/hdg&lt;/tt&gt; from all RAID mirrors to prevent further filesystem corruption. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Reboot machine into single-user mode.  NOTE: No services will be available whilst in single-user mode. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Run filesystem verification utilities on all disk filesystems. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Restore any damaged files from backups as required. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Reboot machine back into normal production operation. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;Transcript, times are in GMT:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[2250] Incident closed.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2247] Summary: All of the recovered files were old transient copies of data that had been deleted deliberately, with the possible exception of some of &lt;tt&gt;~anton&lt;/tt&gt;'s image files, which have been copied to his home directory for review.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2230] Of the remaining files all owned by &lt;tt&gt;~jeremy&lt;/tt&gt;, all but one are old versions of existing mailboxes - probably an artifact of normal mailbox re-writing operation.  (Checking unique message ids shows that the mail messages still exist in the live mailboxes.)  The remaining file just contains the junk chars "&lt;tt&gt;|a:0:{}&lt;/tt&gt;" and doesn't appear in my filesystem index comparison.  Almost certainly junk, deleted.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2221] Found that most the disconnected files owned by &lt;tt&gt;~anton&lt;/tt&gt; are temporary files generated by gallery; deleted.  (Christ, &lt;tt&gt;~anton&lt;/tt&gt;, you've got over a gigabyte of temporary files in there going back years!  Clear it out!)  His remaining files appear to be old deleted .jpeg photos, but moved them to a &lt;tt&gt;RECOVERED_FILES&lt;/tt&gt; directory in his &lt;tt&gt;$HOME&lt;/tt&gt; to allow for inspection and recovery.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2217] Filesystem checks on &lt;tt&gt;/dev/mapper/volume-recover&lt;/tt&gt; complete, no errors.  Re-mounting &lt;tt&gt;/vol/recover&lt;/tt&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2210] Generating home directory indexes of affected users on live system and offsite backup for comparison.&lt;/li&gt;&lt;li&gt;[2153] Picking through the disconnected files found in &lt;tt&gt;/home&lt;/tt&gt;:&lt;ul&gt;&lt;li&gt; One mailspool index auto-generated by Dovecot; will be automatically regenerated: deleted.&lt;/li&gt;&lt;li&gt;8.5MB junk mailbox owned by &lt;tt&gt;~jeremy&lt;/tt&gt;; expendable!&lt;/li&gt;&lt;li&gt;Remaining files owned by &lt;tt&gt;~anton&lt;/tt&gt; and &lt;tt&gt;~jeremy&lt;/tt&gt;, no other users affected.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;[2152] Running &lt;tt&gt;xfs_repair&lt;/tt&gt; check on &lt;tt&gt;/dev/mapper/volume-recover&lt;/tt&gt; in the background.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2151] Disconnected inode files in &lt;tt&gt;/var&lt;/tt&gt; are all old Apache logfiles dating to July 2007, which is older than normal retention policy.  Deleted.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2146] Machine back in production.  Checking contents of &lt;tt&gt;lost+found&lt;/tt&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2144] Reboot in progress.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2138] All filesystem checks complete, bar &lt;tt&gt;/dev/mapper/volume-recover&lt;/tt&gt; which can be done whilst online.  Rebooting to normal production mode.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2137] &lt;tt&gt;xfs_repair&lt;/tt&gt; completed, no errors found.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2135] Running full &lt;tt&gt;xfs_repair&lt;/tt&gt; on &lt;tt&gt;/dev/mapper/volume-root&lt;/tt&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2134] Appoximately 50 disconnected inodes detected on &lt;tt&gt;volume-home&lt;/tt&gt;, relocted to &lt;tt&gt;lost+found&lt;/tt&gt;.  These may be real files, or they may simply be historical artifacts.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2132] Minor error (link count) detected on &lt;tt&gt;volume-var&lt;/tt&gt;.  Full repair run also detected some disconnected inodes; running full repair on &lt;tt&gt;volume-home&lt;/tt&gt; for good measure.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2128] Re-checking &lt;tt&gt;volume-home&lt;/tt&gt; and &lt;tt&gt;volume-var&lt;/tt&gt; with &lt;tt&gt;xfs_repair -n&lt;/tt&gt; for good measure.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2127] Second filesystem check of &lt;tt&gt;/dev/mapper/volume-root&lt;/tt&gt; complete, no errors.  We may have been fortunate and only had the kernel BUG trigger as a result of a read error and not an earlier write error as previously feared.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2124] Filesystem check of &lt;tt&gt;/dev/mapper/volume-root&lt;/tt&gt; complete, no errors.  Checking result with &lt;tt&gt;xfs_repair -n&lt;/tt&gt; (as opposed to &lt;tt&gt;xfs_check&lt;/tt&gt;).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2123] Filesystem check of &lt;tt&gt;/dev/mapper/volume-root&lt;/tt&gt; running, at least minor errors expected.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2121] Filesystem check of &lt;tt&gt;/dev/mapper/volume-home&lt;/tt&gt; complete, no errors.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2119] Filesystem check of &lt;tt&gt;/dev/mapper/volume-home&lt;/tt&gt; running.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2118] Filesystem check of &lt;tt&gt;/dev/mapper/volume-var&lt;/tt&gt; complete, no errors.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2117] Filesystem check of &lt;tt&gt;/dev/md6&lt;/tt&gt; (&lt;tt&gt;/boot&lt;/tt&gt;) complete, no errors.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2116] Machine rebooted into single-user mode.  All services unavailable from this point.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2057] Initial tastycake-status bulletin published.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2025] &lt;tt&gt;wall&lt;/tt&gt;ed all logged-in users to advise that emergency maintenance in progress.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2033] &lt;tt&gt;/dev/hdg&lt;/tt&gt; dropped from all RAID mirrors to avoid further disk corruption.  The next step is to reboot the machine into single-user mode to conduct full filesystem checks and repairs.&lt;/li&gt;&lt;li&gt;[2033] Incident announcement sent to all admins via http://twitter.com/tastycake.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2026] Determined that cause of fault is a faulty data channel to &lt;tt&gt;/dev/hdg&lt;/tt&gt; resulting in incorrect data being written to disk.  Begun dropping &lt;tt&gt;/dev/hdg&lt;/tt&gt; from all RAID mirrors to avoid further corruption.&lt;/li&gt;&lt;li&gt;[2016] Filesystem corruption detected in &lt;tt&gt;/root/.wajig/kalimdor&lt;/tt&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2014] Kernel BUG (internal error alert) spotted by inspection.&lt;/li&gt;&lt;li&gt;[2002] Host monitoring system generates another Critical warning.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1007] Monitoring system downgrades previous critical warning to minor severity.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1002] Critical warning generated by host monitoring system, indicating that a significantly higher than normal number of &lt;tt&gt;cron&lt;/tt&gt; processes are running.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[0632] Minor warning generated by host monitoring system, indicating that a higher-than-normal number of &lt;tt&gt;cron&lt;/tt&gt; processes are running concurrently.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-7209299678341034758?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/7209299678341034758'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/7209299678341034758'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2008/02/incident-2008003-february-16th-2007.html' title='[INCIDENT 2008/003] February 16th 2008 - Filesystem corruption, suspect defective IDE channel.'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-3391378180430572824</id><published>2008-02-11T11:04:00.000Z</published><updated>2008-02-11T14:06:46.941Z</updated><title type='text'>[AT RISK 2008/002] February 11th 2008 - emergency kernel upgrade</title><content type='html'>Maintenance log for February 8th 2008&lt;br /&gt;Attending: dwm&lt;br /&gt;Status: Completed at 1405hrs GMT&lt;br /&gt;Summary:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Tastycake.net server &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; being rebooted (at least once) at approximately 1200noon GMT for an emergency kernel upgrade.&lt;/li&gt;&lt;li&gt;New kernel needed to patch local root escalation vulnerabilities (CVE-2008-0009, CVE-2008-0010).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;No Tastycake.net services will be available whilst reboots are occurring.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Works to be carried out:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Build new linux kernel (2.6.24.2) to replace existing build (2.6.24). (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Install new kernel and set as default.  (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Reboot machine to start using new kernel. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;Transcript:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[1405] All tests clear, at-risk period concluded.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1400] Machine rebooted successfully into new kernel.  Running final checks..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1357] Machine rebooted.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1353] Believed that I have corrected the booting problem (missing &lt;tt&gt;/dev/md0&lt;/tt&gt; entry in &lt;tt&gt;/etc/mdadm/mdadm.conf&lt;/tt&gt;) and rebooting again.  (Again, with 1-minute grace.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1347] Successfully rebooted using original kernel; will be fixing raid-auto start, then rebooting again.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1334] Backup kernel not functioning; appears to not be auto-starting &lt;tt&gt;/dev/md0&lt;/tt&gt;; will need to configure manually.  This may take a few minutes..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1331] Failed to boot using new kernel, power-cycled via power-switch interface.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1328] Machine reboot.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1326] Reboot triggered with 1-minute grace delay.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1320] New kernel installed, ready to reboot.  Warning sent via &lt;tt&gt;wall&lt;/tt&gt; to all logged-in users.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1313] Updated kernel package built, installed in Tastycake package repository.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1107] Initial update of maintenance log.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1010] Determined that 2.6.24.1 kernel that had been built overnight has been superceded by 2.6.24.2, building new kernel image.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-3391378180430572824?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/3391378180430572824'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/3391378180430572824'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2008/02/at-risk-2008002-february-11th-2008.html' title='[AT RISK 2008/002] February 11th 2008 - emergency kernel upgrade'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-6191642687896525260</id><published>2008-02-09T10:43:00.000Z</published><updated>2008-02-09T15:12:13.831Z</updated><title type='text'>[AT RISK 2008/001] February 9th 2008 - scheduled maintenance</title><content type='html'>Maintenance log for February 8th 2008&lt;br /&gt;Attending: dwm&lt;br /&gt;Status: completed at 15:11hrs&lt;br /&gt;Summary:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Tastycake.net server &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; being taken offline at 1200noon GMT for maintenance.&lt;/li&gt;&lt;li&gt;No Tastycake.net services will be available whilst works are in progress.&lt;/li&gt;&lt;li&gt;Works to be carried out:&lt;br /&gt;&lt;ul&gt;        &lt;li&gt;Install third 250GB hard-drive into RAID mirror. (Complete)&lt;br /&gt;&lt;/li&gt;    &lt;li&gt;Install GRUB bootloader on third drive. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Replace old 127GB hard-drive with new 250GB replacement. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Install GRUB bootloader on replacement drive. (Complete)&lt;br /&gt;&lt;/li&gt;    &lt;li&gt;Create new RAID mirror set on as-yet unallocated space. (Complete)&lt;br /&gt;&lt;/li&gt;    &lt;li&gt;Expand LVM working set using new RAID mirror set. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Upgrade local kernel to 2.6.24. (Complete)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Discontinue local NFS server, use read-only bind mount for /vol/recover instead.&lt;br /&gt;(New feature in 2.6.24.) (Cancelled)&lt;br /&gt;&lt;/li&gt;  &lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;Transcript, times are in GMT:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[1511] Final checks complete, at-risk period ends.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1506] Performing final checks prior to announcing end of at-risk period.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1505] Read-only bind mounts don't seem to be functioning, we perhaps need an updated &lt;tt&gt;mount-utils&lt;/tt&gt;. This can be done safely at a later date.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1456] Initialized &lt;tt&gt;/dev/md0&lt;/tt&gt; as new LVM PV and added PV to existing &lt;tt&gt;volume&lt;/tt&gt; VG; total capacity: 232GB.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1454] Added new second disk to main RAID mirror set &lt;tt&gt;/dev/md7&lt;/tt&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1451] Created new RAID mirror across previously-unused disk space.&lt;br /&gt;(NOTE: &lt;tt&gt;/dev/md0&lt;/tt&gt; is not the RAID mirror containing &lt;tt&gt;/boot&lt;/tt&gt;, &lt;tt&gt;/dev/md6&lt;/tt&gt; is.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1440] Kernel installed, rebooting to verify correct operation and to reload DOS partition tables.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1437] Installing updated kernel packages.  (And SNMP security updates, whilst we're here.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1434] Partitioned unallocated space on all three disks.  Leaving creation of new RAID mirror array until last, as it would only be interrupted by reboots anyway..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1431] Partitioned drive 2 to match other disks.  Added drive 2 partition to &lt;tt&gt;/boot&lt;/tt&gt; RAID mirror volume. Installed bootloader on new drive.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1423] Replacement drive 2 installed, booting.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1415] Re-sync completed.  Rebooting to replace drive 2.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1402] Re-sync 90% complete.  (Unfortunately, it seems to be slowing down to about 15MB/sec again.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1357] Installed spare GRUB bootloader on new disk.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1350] Re-sync 80% complete.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1334] Re-sync 66.6% complete.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1318] Re-sync 50% complete.  (Now peaking at ~22MB/sec; ETA at present rate: 44mins.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1301] Re-sync 33.3% complete. (Now peaking at ~19MB/sec; ETA at present rate: 67mins.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1250] Re-sync 25% complete.  (Looks like its speeding up as it proceeds, probably due to disk geometry.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1230] Re-sync 10% complete.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1219] RAID re-sync in progress; need to wait for it to complete before replacing disk 2. ETA @ ~15MB/sec: 115mins.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1211] Adding new disk 3 partitions to RAID mirror sets.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1209] Disk installed, server rebooted.  Partitioned disk 3 to match existing layout.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1201] Serial terminal up; sent reboot instruction with 2-minute grace.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1159] Sent final warning via &lt;tt&gt;wall&lt;/tt&gt; to save all state; disk installed in caddy and ready for reboot.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1150] Readying disk three for hot-insertion.  (Though, because we're running on IDE, this will require a reboot..)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1147] Had to abort the transfer drive update; don't have access to the rear of the rack, and the front-side USB port is far too slow.  Will just have to do today's work carefully..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1105] Taking full filesystem image backup to spare transfer drive.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1045] Initial update of offsite maintenance log.&lt;/li&gt;&lt;li&gt;[1035] Arrived at Telehouse Docklands.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-6191642687896525260?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/6191642687896525260'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/6191642687896525260'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2008/02/at-risk-2008001-february-9th-2008.html' title='[AT RISK 2008/001] February 9th 2008 - scheduled maintenance'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-1407718378714639372</id><published>2007-11-16T20:35:00.000Z</published><updated>2008-02-09T10:54:33.668Z</updated><title type='text'>[INCIDENT 2007/005] November 16th 2007 - Upstream DNS servers offline</title><content type='html'>Incident log for November 16th 2007&lt;br /&gt;Attending: dwm, mark&lt;br /&gt;Status: Completed&lt;br /&gt;Summary:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;HostEurope.com, who provide DNS hosting services for Tastycake.net, are offline and are not answering queries for Tastycake.net addresses.&lt;/li&gt;&lt;li&gt;This may result in difficulties accessing Tastycake.net services, and delay the delivery of email to Tastycake.net email addresses.&lt;/li&gt;&lt;li&gt;As the fault lies somewhere within HostEurope.com's facillities, and not with the Kalimdor server itself, there is little we can do to directly address this problem, and are waiting for HostEurope.com to implement a fix.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In the unlikely event that HostEurope.com do not restore service in a timely fashion, we are preparing contingency plans to move the Tastycake.net domain to another hosting provider.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Transcript, times are in GMT:&lt;br /&gt;Sunday Nov 18th&lt;ul&gt;&lt;li&gt;[1200] Upstream DNS problems resolved.&lt;/li&gt;&lt;/ul&gt;Friday Nov 16th&lt;ul&gt;&lt;li&gt;[2045] Incident report posted to http://tastycake-status.blogspot.net/.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2021] HostEurope Support facility identified as offline.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2015] Both authoritive DNS servers (provided by HostEurope.com) identified as failed.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2010] DNS resolution problems first reported for tastycake.net domain.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-1407718378714639372?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/1407718378714639372'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/1407718378714639372'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2007/11/incident-2007005-november-16th-2007.html' title='[INCIDENT 2007/005] November 16th 2007 - Upstream DNS servers offline'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-1297598620715743469</id><published>2007-09-19T13:26:00.000+01:00</published><updated>2007-09-19T13:39:12.995+01:00</updated><title type='text'>[INCIDENT 2007/004] September 19th 2007 - /var full</title><content type='html'>Incident log for September 19th 2007&lt;br /&gt;Attending: dwm&lt;br /&gt;Status: Completed at 1230hrs GMT+1&lt;br /&gt;Transcript, times are in GMT+1:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;[1230] Incident closed, all services appear to be running normally.&lt;br /&gt;&lt;li&gt;[1158] Upgraded clamav to latest-stable version to address database retrieval problem.  Continuing to monitor status of system.&lt;br /&gt;&lt;li&gt;[1130] Increased size of &lt;tt&gt;/var&lt;/tt&gt; filesystem online.  Restarted failed services: sysklogd, exim, clamav.&lt;br /&gt;&lt;li&gt;[0726] &lt;tt&gt;/var&lt;/tt&gt; filesystem on Kalimdor goes full resulting in service failures.  Flash messages dispatched from service monitoring agent indicating that SMTP is unavailable.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-1297598620715743469?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/1297598620715743469'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/1297598620715743469'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2007/09/incident-2007004-september-19th-2007.html' title='[INCIDENT 2007/004] September 19th 2007 - /var full'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-3864240676365276939</id><published>2007-08-02T09:40:00.000+01:00</published><updated>2007-08-02T14:26:22.784+01:00</updated><title type='text'>[INCIDENT 2007/003] August 2nd 2007 - Airconditioning failure</title><content type='html'>Incident log for August 2nd 2007&lt;br /&gt;Attending: dwm, nick&lt;br /&gt;Status: In progress&lt;br /&gt;Transcript, times are in GMT+1:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;[1126] Kalimdor rebooted, throttled CPU to 900MHz until aircon fixed.&lt;br /&gt;&lt;li&gt;[1005] Text to Jump support requesting urgent powercycle - worried about potential CPU busy-wait loop and resulting temperature spike.&lt;br /&gt;&lt;li&gt;[0955] Email to Jump support requesting powercycle.&lt;br /&gt;&lt;li&gt;[0950] Hard lock after attempted CPU throttling to avoid burnout.&lt;br /&gt;&lt;li&gt;[0940] Inspected temperatures on kalimdor - 80-85 degrees CPU.  This with ondemand throttling!&lt;br /&gt;&lt;li&gt;[0156] Email from Jump informing of air conditioning failure in TFM8.&lt;br /&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-3864240676365276939?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/3864240676365276939'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/3864240676365276939'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2007/08/incident-2007003-august-2nd-2007_02.html' title='[INCIDENT 2007/003] August 2nd 2007 - Airconditioning failure'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-2871715500589353779</id><published>2007-07-27T10:20:00.000+01:00</published><updated>2007-07-27T22:27:09.423+01:00</updated><title type='text'>[INCIDENT 2007/002] July 27th 2007 - Filesystem failures</title><content type='html'>Incident log for July 27th 2007&lt;br /&gt;Attending: dwm, nick, mark&lt;br /&gt;Status: Completed at 22:24hrs, GMT+1&lt;br /&gt;Transcript, times are in GMT+1:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[2224] Restored to full multi-user mode with all services.  (MySQL proved a little problematic - the automatic table management tooling didn't fix absolutely everything up - but that should be fine now.)  Hopefully that's the last we'll see of this particular class of problem, at least for a while - but we'll be keeping a close eye on things just to make sure.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2201] Rebooting into full multi-user now.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2144] New kernel is installed.  Rebooting again into single-user to test.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2136] Booted in single-user mode.  Next step: download and install a new more-up-to-date kernel.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2114] Okay, rebooting into single-user mode.  We're going to move the UDMA-6-capable disk to &lt;tt&gt;/dev/hda&lt;/tt&gt;, the first disk slot that's not attached to the Promise controller, in case it's the Promise IDE controller itself not liking UDMA-6 operation.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2050] Quick SMART checks completed with no errors.  Quick checks writing test files to temporary filesystems also passed.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2047] Running SMART self-tests on both remaining IDE disks.&lt;/li&gt;&lt;li&gt;[2015] Hmm, re-checking the newly reconstructed XFS filesystems showed minor errors again on &lt;tt&gt;/&lt;/tt&gt;, easily fixed - but it may be that we haven't fully eliminated the cause of any underlying IO problems.&lt;/li&gt;&lt;li&gt;[2012] All of the filesystems are back and intact.  Next steps will be to reboot the machine into single-user and check that everything boots up that's supposed to.  Also, I'm going to try some additional block-level tests to check that data being written to the drives is being recorded and read correctly.  First, however, the curry Nick and I ordered has arrived at the Security desk..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1956] &lt;tt&gt;rsync&lt;/tt&gt; is showing some files which are absent without leave; these are most likely the files relegated to &lt;tt&gt;lost+found&lt;/tt&gt;.   Rather than manually re-sort these files, we're going to restore those that are affected from the backup image - if anything proves to be missing, we still have the file fragments in &lt;tt&gt;lost+found&lt;/tt&gt; that we can dig through.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1947] Okay, &lt;tt&gt;/home&lt;/tt&gt; filesystem has been checked and had its errors fixed, with a few files getting relocated to &lt;tt&gt;lost+found&lt;/tt&gt;.  We're going to do some whole-filesystem comparisons between our backup image and the repaired &lt;tt&gt;/home&lt;/tt&gt; to try to make sure that nothing vital is missing.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1935] Root filesystem has been restored.   &lt;tt&gt;/boot&lt;/tt&gt; filesystem passes all checks.   &lt;tt&gt;/var&lt;/tt&gt; contained some errors, fixed.   Now running &lt;tt&gt;xfs_repair&lt;/tt&gt; on /home to see how well it can cope with the errors on that volume.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1833] Okay, we've got the RAID mirror up on the two good disks and have the LVM volumes up.  Mounted the root filesystem; it appears that the earlier &lt;tt&gt;xfs_repair&lt;/tt&gt; run moved &lt;span style="font-weight: bold;"&gt;everything&lt;/span&gt; into &lt;tt&gt;lost+found&lt;/tt&gt;, which is fairly amusing.  Proceeding to create new root filesystem and restore the contents from the backup image on the transit disk.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1820] Performed some further tests, and determined that the error is persistent on that specific disk (as opposed to the disk channel it was occupying.)  This disk, one of the disks that dates back to Kalimdor's original installation, is now considered suspect and has been removed.  We will be rebuilding on the two other disks only.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1755] Manual re-assembly of the RAID mirror showed that disk 2 of 3 had an invalid and out-of-date RAID superblock.    This raises the possibility that this disk, or its ribbon cable, is bad.  We've removed it from the working set and proceeding on the other two disks.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1732] Machine has passed Memtest86 up to and beyond pass #4; it's fine.  Next step: reboot from a rescue disk and restore the root filesystem from our transit disk.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1705] Console hooked up, now running extended hardware memory test.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1650] dwm and nick now on-site and online in Telehouse.  Next steps:  re-establish a functional root filesystem and conduct a thorough memory hardware check.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1448] Backup complete.  Packing up ready to head into Telehouse.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1410] More than 2/3 complete - currently about 170,000 files left to go.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1343] Copy of backup image to transit drive now 50% complete - about 300,000 files left to go.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1317] Returned with the USB-SATA adaptor.  Transfer of backup image to the transit drive is approximately 1/3 complete.  Nick is now also en-route from Southampton to Telehouse; ETA sometime after 1500hrs.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1226] Copying of backup data to transit drive now underway (all 600,000+ files of it.)  Departing now for the local Maplin to pick-up a USB-SATA adaptor, back soon..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1208] Called ahead to a Maplin a couple miles away - they've got the USB-SATA adaptor that we need.  Just finished wiring the transit SATA drive into the offsite-backup server; about to start data copy to transit drive.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1204] Received authorization to enter Telehouse building.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1144] Updated plan:&lt;br /&gt;1. Copy the offsite-backup image to removeable media - in practice, a spare 250GB SATA disk.&lt;br /&gt;2. Whilst that's running, I'll head out to procure a USB-SATA adaptor so that we can  copy the data back onto Kalimdor.&lt;br /&gt;3. Request for authorization to enter Telehouse has been submitted, we'll hopefully get that by the time we're ready to head in.  Nick is also heading up from Winchester to assist with the recovery.&lt;br /&gt;4. Once we're onsite, we should be able to restore Kalimdor to good working order - in the worst case, restoring absolutely everything from the offsite-backup image taken early this morning.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1129] Okay, that unfortunately didn't work.  &lt;tt&gt;xfs_repair&lt;/tt&gt; nuked &lt;tt&gt;/sbin/init&lt;/tt&gt;.  Off-site recovery methods are now exhausted, physical local access will now be necessary.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1125] Root filesystem repair complete.  Rebooting again into single-user mode.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1121] Okay, plan: attempt filesystem recovery of /, see if we can get the recovery tools properly functional.  In parallel, also preparing for physical entry to Telehouse with mobile copy of offsite-backup image.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1110] Sent request for physical Telehouse access to co-lo.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1059] Planning next steps.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1049] Read-only check of &lt;tt&gt;/&lt;/tt&gt; (root) filesystem is showing fairly extensive corruption.   Other filesystems may be similarly affected.    It may be necessary to physically go to Telehouse to rebuild the host from offsite backup.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1044] Read-only check of &lt;tt&gt;/var&lt;/tt&gt; XFS filesystem failed to terminate.  Rebooting again to return to ground state.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1021] Checking &lt;tt&gt;/var&lt;/tt&gt; filesystem.&lt;/li&gt;&lt;li&gt;[1018]  Executed restart via serial-console.&lt;/li&gt;&lt;li&gt;[1013] First response to issue.  SSH, Apache services malfunctioning.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[0939] Issue raised via text-message.&lt;/li&gt;&lt;/ul&gt;Comments:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The bad news is that this has been a several-hour-long outage, for which we deeply apologise.  The good news is that recovery seemed to go well and we believe any data loss from this significant filesystem failure was very minimal.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;We think the root cause of the recent problems was a faulty disk.  Unfortunately, rather than failing and refusing to function, we believe that it was silently recording data incorrectly - causing problems when it was read from again in normal operation.  We have removed this disk and will be replacing it with a fresh replacement.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;You may find it amusing to learn that today is &lt;a href="http://it.slashdot.org/it/07/07/27/1546203.shtml"&gt;Sysadmin Appreciation Day&lt;/a&gt;.   If only the machines themselves respected such hallowed events..&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-2871715500589353779?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/2871715500589353779'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/2871715500589353779'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2007/07/incident-2007002-july-27th-2007.html' title='[INCIDENT 2007/002] July 27th 2007 - Filesystem failures'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-8796366422215174745</id><published>2007-07-26T16:28:00.000+01:00</published><updated>2007-07-26T16:57:15.713+01:00</updated><title type='text'>[INCIDENT 2007/001] July 26th 2007 - Kernel error</title><content type='html'>Incident log for July 26th 2007&lt;br /&gt;Attending: dwm, mark&lt;br /&gt;Status: Completed at 16:45hrs, GMT +1&lt;br /&gt;Transcript, Times are in GMT+1:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[1645] Back to normal operation.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1643] Final checks complete, booting to full operation.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1642] Reboot complete, executing final checks.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1639] Filesystem check of &lt;tt&gt;/&lt;/tt&gt; complete.  Executing reboot to single-user.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1635] Filesystem &lt;tt&gt;/home&lt;/tt&gt; check complete.  Proceeding to check &lt;tt&gt;/&lt;/tt&gt; (root filesystem).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1631] Filesystem check of &lt;tt&gt;/export/recover&lt;/tt&gt; complete.  Proceeding to check &lt;tt&gt;/home&lt;/tt&gt;.&lt;/li&gt;&lt;li&gt;[1616] &lt;tt&gt;/var&lt;/tt&gt; check complete.  Now checking &lt;tt&gt;/export/recover&lt;/tt&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1614] &lt;tt&gt;kalimdor.tastycake.net&lt;/tt&gt; rebooted via serial console via SysRq.  Checking &lt;tt&gt;/var&lt;/tt&gt; filesystem.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1611] &lt;tt&gt;dwm&lt;/tt&gt;  logged in via remote root shell.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[1610] Alarm raised by &lt;tt&gt;mark&lt;/tt&gt;; kernel OOPS reported in XFS filesystem code.  SSH services unavailable.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-8796366422215174745?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/8796366422215174745'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/8796366422215174745'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2007/07/incident-2007001-july-26th-2007-kernel.html' title='[INCIDENT 2007/001] July 26th 2007 - Kernel error'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-7705190636132666561</id><published>2007-07-22T17:04:00.000+01:00</published><updated>2007-07-22T23:33:16.435+01:00</updated><title type='text'>Kalimdor maintainance log - July 22nd 2007</title><content type='html'>Kalimdor maintenance log for July 22nd 2007.&lt;br /&gt;Attending: dwm&lt;br /&gt;Status:  Completed at 23:30hrs, GMT+1.&lt;br /&gt;&lt;br /&gt;Objectives:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;[ABORTED] Replace existing power-supply unit (PSU) with new more-efficient model (80PLUS-rated) provided by Jump Networks.&lt;/li&gt;&lt;li&gt;[COMPLETE] Repair faulty inode on /home filesystem.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Transcript, times are in GMT+1:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;[2330] Kalimdor.tastycake.net has been returned to full multi-user mode, and is running all services.  &lt;span style="font-weight: bold;"&gt;This ends the at-risk period. &lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2324] Spare disk re-added to RAID, rebuild in progress.  Switching to multi-user mode.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2320] Rebooted successfully.  Satisfied that all is well.  Rebooting again, this time to replace backup disk.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2313] Minor housekeeping errors on &lt;tt&gt;/&lt;/tt&gt; fixed.  Now rebooting.  (Still in single-user mode.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2308] Minor housekeeping errors on &lt;tt&gt;/var&lt;/tt&gt; fixed.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2304] Quota checks complete; now double-checking other filesystems.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2302] &lt;tt&gt;xfs_copy&lt;/tt&gt; of &lt;tt&gt;/home&lt;/tt&gt; complete.  &lt;tt&gt;xfs_check&lt;/tt&gt; shows new filesystem is intact.  Performing first mount; quotacheck running.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2255] &lt;tt&gt;xfs_copy&lt;/tt&gt; is now more than 80% complete.&lt;/li&gt;&lt;li&gt;[2249] &lt;tt&gt;xfs_copy&lt;/tt&gt; is now more than 60% complete.&lt;/li&gt;&lt;li&gt;[2244] &lt;tt&gt;xfs_copy&lt;/tt&gt; is now more than 40% complete.&lt;/li&gt;&lt;li&gt;[2239] &lt;tt&gt;xfs_copy&lt;/tt&gt; is now more than 20% complete.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2232] &lt;tt&gt;xfs_copy&lt;/tt&gt; is now running, copying the contents of the previously-created backup volume to &lt;tt&gt;/home&lt;/tt&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2224] Okay, &lt;tt&gt;&lt;tt&gt;xfs_repair&lt;/tt&gt; &lt;/tt&gt; just isn't working, and my window for getting home tonight is closing.  Going to reconstruct and repopulat e &lt;tt&gt;&lt;tt&gt;/home&lt;/tt&gt; &lt;/tt&gt; from scratch.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2220] Despite &lt;tt&gt;&lt;tt&gt;xfs_repair&lt;/tt&gt; &lt;/tt&gt; fixing some specific issues, mounting &lt;tt&gt;/home&lt;/tt&gt; and checking shows that the errors have not been corrected.  This raises a new hypothesis: the RAID mirror isn't fully synchronized, or isn't syncing data correctly.  Investigating.&lt;br /&gt;[2206] Hmm, given how the building alarm keeps coming and going (and started at 2200), it's probably a test.  Carrying on..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2201] And that's the building fire alarm.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2159] Rebooted with one disk removed.  &lt;tt&gt;xfs_repair&lt;/tt&gt; has now run once successfully over &lt;tt&gt;/home&lt;/tt&gt; with minor changes (removals to &lt;tt&gt;&lt;tt&gt;lost+found&lt;/tt&gt;&lt;/tt&gt;) - rerunning again to see if the FS has now settled to a good state.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2145] &lt;tt&gt;&lt;tt&gt;xfs_repair&lt;/tt&gt;&lt;/tt&gt; reported and corrected some errors; however, re-running &lt;tt&gt;&lt;tt&gt;xfs_repair&lt;/tt&gt; &lt;/tt&gt; reported &lt;em&gt;even more&lt;/em&gt; errors - I suspect that the &lt;tt&gt;/home&lt;/tt&gt; filesystem is either suffering from a serious problem, &lt;em&gt;or&lt;/em&gt; the underlying LVM is malfunctioning badly -- most likely the former. However, to be sure, I'm going to pull one of the RAID mirror disks and keep it in reserve.  In the worst case, I will be able to repopulate any broken filesystems from the spare disk.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2142] Comical error message of the day: &lt;tt&gt;bad (negative) size -2500720168097138090 on inode 580671&lt;/tt&gt;.&lt;br /&gt;Fsck continues..&lt;br /&gt;&lt;/li&gt;&lt;li&gt;[2133] Backup complete.   Double-checking integrity of backup FS, then will re-run fsck on &lt;tt&gt;/home&lt;/tt&gt;.&lt;/li&gt;&lt;li&gt;[2002] &lt;tt&gt;/home&lt;/tt&gt; filesystem backup running. It's only completed a couple of GB so far, so it'll take a good few minutes to complete. Taking advantage of the delay to go and fetch some food before I pass out!&lt;/li&gt;&lt;li&gt;[1953] The filesystem check has turned up the expected single-inode error; however, &lt;tt&gt;xfs_repair&lt;/tt&gt; is unable to fully repair the filesystem. Now making a seperate copy of &lt;tt&gt;/home&lt;/tt&gt; before continuing, just to be on the safe side.&lt;/li&gt;&lt;li&gt;[1936] Old power supply has been replaced and Kalimdor has been re-installed in the rack. Now rebooting to single-user mode to perform the planned filesystem checks.&lt;/li&gt;&lt;li&gt;[1908] Aha: it turns out this particular sub-variant of PSU doesn't include a particular -5v line necessary for correct operation. (We've got a ATX12V PSU, and the new one we have is an ATX12V v2.2 PSU. Frustratingly, they're not backwards compatible. ) Now going through the delicate process of removing the new PSU and threading the old one back in.&lt;/li&gt;&lt;li&gt;[1851] The reinstalled machine is failing to power-up with the new PSU, though it's able to drive its networking status lights, none of the fans are running and it fails to respond to the power-switch. Working to identify the fault now, though if we can't fix this very quickly we'll have to fall back to our older (working) PSU.&lt;/li&gt;&lt;li&gt;[1827] New power supply installed, machine re-assembled. Getting a power cable to the optical drive was indeed very fiddly, but achieved now. Unfortunately, the new PSU doesn't have a seperate IEC break-out socket for mounting on the rear of the case, and there's nowhere to physically attach the new PSU inside the rack itself. About to reinstall in the rack now.&lt;/li&gt;&lt;li&gt;[1757] Swapping out the PSU. Cable-running and re-mounting on the inside of the case is a little fiddly, so will take a few more minutes.&lt;/li&gt;&lt;li&gt;[1705] Obtained access to TFM-8 server room containing Kalimdor.  Proceeding to execute  a clean shutdown.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-7705190636132666561?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/7705190636132666561'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/7705190636132666561'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2007/07/kalimdor-maintainance-log-july-22nd.html' title='Kalimdor maintainance log - July 22nd 2007'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-7934364102285528663.post-1959655436706531099</id><published>2007-07-19T18:36:00.000+01:00</published><updated>2007-07-19T18:40:59.732+01:00</updated><title type='text'>[AT RISK] Kalimdor downtime THIS SUNDAY from 1700hrs</title><content type='html'>Hello all,&lt;br /&gt;&lt;br /&gt;Kalimdor.tastycake.net will be taken out of service for on&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt; Sunday 22nd July (THIS SUNDAY) from 1700hrs&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;... in order to carry out preventative maintenance. Whilst these works are in progress, NONE of the tastycake.net services will be available.  I anticipate that the works will only take approximately 30 minutes to complete.&lt;br /&gt;&lt;br /&gt;Status updates will be posted to our new offsite status page, http://tastycake-status.blogspot.com.&lt;br /&gt;&lt;br /&gt;The works to be carried out are as follows:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Replacement of Kalimdor's internal power-supply unit (PSU) with a new, more-efficient equivalent that conforms to the 80PLUS specification.&lt;/li&gt;&lt;li&gt;A filesystem check on /home in order to clear a stuck inode.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;I apologise for the short notice; if these planned works present a problem, or if you have any other comments, concerns or queries, please contact us the administrators via all-heroes([a])tastycake.net.&lt;br /&gt;&lt;br /&gt;Cheers,&lt;br /&gt;David&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7934364102285528663-1959655436706531099?l=tastycake-status.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/1959655436706531099'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7934364102285528663/posts/default/1959655436706531099'/><link rel='alternate' type='text/html' href='http://tastycake-status.blogspot.com/2007/07/at-risk-kalimdor-downtime-this-sunday.html' title='[AT RISK] Kalimdor downtime THIS SUNDAY from 1700hrs'/><author><name>Tastycake Administrators</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='27' src='http://tastycake.net/logo.jpg'/></author></entry></feed>
