Going Analog and Finding Digital

or what happens when a server blows a drive. I start using the turntable again, since the mp3s are on a crashed drive.

Last week was a bad week for hardware in my life. This webserver randomly turned off one nught (as it does every so often). My 160gig media hard drive lost it’s superblock and a whole lot more, and a server at work one hour away lost a processor fan. But this isn’t a story about fans, this is a story about hard drives. Hard drives with lots of data that are too big to backup to anything but other hard drives.

In the last couple of weeks, I’ve updated to the new Debian stable and replaced a very loud power supply in this machine. I thought that it was preventative maintenance. But then, last week when I was out of town, it locked up, later determined to have been due to IDE errors. Of course, it has to happen when I’m out of town. On reboot, I get errors like:

Jun 30 09:01:50 cabbage kernel: hdg: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jun 30 09:01:50 cabbage kernel: hdg: dma_intr: error=0x40 { UncorrectableError }, LBAsect=191, high=0, low=191, sector=128
Jun 30 09:01:50 cabbage kernel: end_request: I/O error, dev 22:01 (hdg), sector 128
Jun 30 09:01:50 cabbage kernel: sh-2006: reiserfs read_super_block: bread failed (dev 22:01, block 64, size 1024)
Jun 30 09:01:50 cabbage kernel: sh-2021: reiserfs_read_super: can not find reiserfs on ide3(34,1)

Nice. This drive has 300 cds of mp3s backed up on the original audio disks, 50 or 60 gigs of un backed up baby movies (the raw footage, I’ve got dvds and working copies), 30 gigs of pictures that are all backed up, and random other stuff that isn’t critical, but nice to have. Not something that I really want to restore from the original sources. Especially the mp3s. So, off to Fry’s for a new drive, they have a 200gig Seagate for $50. Plug it in, partition for one big partition, and off to the restoration races.

First attempt is reiserfsck on the original bad drive, but of course, it says that it’s a hardware problem. So, time to copy what I can from the old drive to the new one. dd fails instantly, since the first couple of blocks of the drive are bad.

dd_rescue to the rescue. (umm, sorry). Where dd exits on error, dd_rescue just goes really slowly, and can go either forwards or backwards. I started from the end of the drive, and got about half of the data before the machine hung after resetting the ide bus.

dd_rescue -r /dev/hdg1 /dev/hde1

Update GNU ddrescue looks to be a better option than dd_rescue + dd_rhelp.

At this point, I should have given up on recovering the drive on that machine, moved the drives to another machine that didn’t need the Promise IDE controller. I also should have just used dd_rhelp to automate what I wound up doing manually for several hours. It took 10 or 15 more reboots after IDE errors for me to decide move to the other machine and run off of a better controller.

After about a day of the drives churning, and several thousand unreadable blocks later, I had a copy of the bad partition on a fresh new drive. Next task, reconstruct what I could of the file system. It turns out that the superblock and the volume bitmap were pretty well hosed, so first task was to recreate the superblock. I then tried the not too invasice –check option, but had many uncorrectable errors. So, time to bring out the big gun of –rebuild-tree, which should reconstruct as much of the filesystem as possible by scanning the whole disk. 18 hours later (or so) I had an error free file system, with a bunch of missing files.

resierfsck --create-superblock /dev/hde1 
reiserfsck --check /dev/hde1 
reiserfsck --rebuild-tree /dev/hde1
reiserfsck --check /dev/hde1

But what files are missing and how many files are silently corrupted? And do I trust the new filesystem? It would have been good to get a list of the block numbers that were bad, but I did that recovery over 2 machines, and one of them started off of a live CD, so I don’t really have that info. I can’t find anything that tells me if the rebuild-tree option will just trash files with bad leaf nodes, or if they are not detected at all. And I don’t trust the new filesystem, so I decided to copy all of the files to other drives, reformat, and copy back.

At this point, it would have been really nice if everything just worked. But, of course, it didn’t or I wouldn’t be writing this paragraph. I moved the the drive with the recovered data and added another large drive to the promise controller. While copying,

cp -a

failed halfway through when the machine hung. Rsync gave a list of missing files, which I found useful, but it too hung the machine a few times, and I got

kernel bug in page_alloc.c

errors that from a quick googling tend to indicate hardware trouble. Memtest86 seems to indicate that this isn’t a memory or memory controller issue, so I’m guessing that the Promise card is bad.

A year ago, I thought that this machine was old enough that I was on the bubble about getting a IDE controller or replacing the motherboard. (Which would have cascaded to the processor, memory, and ethernet cards). Since then, I’ve replaced the power supply, the boot drive, and then there’s this fiasco. And now I’m trying to figure out if I should buy a new machine and swap in the hard drives, or just get a mac mini and get out of the aging x86 hardware repair business.

But the silver linings are that the drive failed 2 weeks before the warranty ran out, and while railing against the crappy sound quality of my analog copies of “The Name of This Band is the Talking Heads”, I found out that it had finally been converted to digital for the first time late last year. And that’s a worthy discovery no matter what.

No comments

No comments yet. Be the first.

Leave a reply

You must be logged in to post a comment.