Restoring a Broken Linux RAID Array

About 18 months ago I set up a Linux media server for my home. It was made from an old Dell desktop that my neighbor was (literally) discarding, and a pair of new, identical Seagate hard disks. Since I was going to be spending a lot of time copying my CDs to this server, I configured a RAID-1 array that mirrored the hard disks; that way, there would always be a current backup. The OS was Ubuntu Linux 6.06 Server, and it used software RAID.

One of the drives started failing last week, so happy though I was to have a handy backup, I was a bit daunted about the prospect of restoring a broken RAID array. You see, there’s plenty of tutorials on how to set up software RAID, but not that many resources on what to do after a drive breaks. It actually turned out to be really easy, thanks in large part to my friends Yossie’s and Eric’s guidance, so I decided to document the process here.

Determining if Your RAID Array is Broken

This part is really easy. Log in as root, and run:

cat /proc/mdstat

You should see something like this if there is a problem:

Personalities : [raid1]
md1 : active raid1 hda2[0]
1927680 blocks [2/1] [U_]

md0 : active raid1 hda1[0]
310640768 blocks [2/1] [U_]

unused devices: <none>

Note the underscore in the [U_]. A healthy RAID array will not have the underscore. Instead, it would say [UU]. Also note that hda is active (it’s listed in the md0/md1 lines), but hdb is conspicuously absent. So hdb is in trouble.

Finding the Broken Drive

If you know which drive hdb is, then remove it, but it’s probably best to run a test to be sure it actually has errors. For this, I found a very versatile free tool called Ultimate Boot CD. You download the ISO and burn it onto a CD. You can use the CD to boot your PC. It’s packed with diagnostic tools, such as hard disk testers, memory testers, etc. Run the appropriate one for your brand of hard disk, on each disk in the array. The Seagate tester I used lists the hard disks serial numbers, so there’s no confusion once you open the case to remove the bad drive.

Ensure You Can Boot From the Good Drive

Depending on which drive is bad, and how you originally configured them, you may not be able to boot properly from the good drive. Best to test that you can. Remove the bad drive, and try to boot. You may need to modify jumper settings and/or move the good drive to the master plug on the IDE cable. If you can boot, great. If not, the Grub loader may be missing from the the good drive. To reinstall it, put the drives back where they were so that you can boot, log in as root, and run:

grub-install /dev/hda

(Assuming you need to install it on hda).

Retest that you can now boot off the remaining drive with the bad one removed.

Replace the Broken Drive

Obtain a matching drive, and install it in the case.

I only discovered (happily) after this happened that Seagate drives come with a 5-year warranty. If a drive goes bad, you can check if it’s covered using Seagate’s Warranty Checker. You don’t even need your original receipt to process the return, just the serial and model numbers. The whole process couldn’t be easier. I opted for their $20 premium service, where they send you a refurbished drive that matches yours immediately, via 2-day air shipping. The replacement drive comes with a box and a prepaid label to return the broken one to Seagate. The return shipping alone would have cost me $10, so that service seems like a bargain.

Rebuild the RAID Array

This is the bit I was daunted by, but it really turned out to be fairly simple. After the new drive is installed, boot up and log in as root.

Verify which drive is now part of the array. Run:

cat /proc/mdstat

You should see something like:

md1 : active raid1 hda2[1]
1927680 blocks [2/1] [_U]

md0 : active raid1 hda1[1]
310640768 blocks [2/1] [_U]

In this case, hda is the working drive.

Look at the current partition table of the working drive (hda in my case). Run:

fdisk /dev/hda
p (for print)
q (to exit)

Output should be something like:

Disk /dev/hda: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot      Start         End      Blocks   Id  System
/dev/hda1   *           1       38673   310640841   fd  Linux raid autodetect
/dev/hda2           38674       38913     1927800   fd  Linux raid autodetect

Now configure the partitions of the new drive (hdb) to match the working one (hda). Run:

fdisk /dev/hda

  1. Enter n (for new partition).
  2. Enter p (for primary partition).
  3. Make it the first partition (i.e. 1).
  4. Start at cylinder 1 (the default).
  5. Use the End value for the last cylinder of partition 1 on hda (i.e. 38673).
  6. Change the new partition’s type to match the Id value of your first
    partition (i.e. “fd”). This is important for the RAID controller to understand that this is part of the RAID array.
    Enter t (for “change a partition’s system id”).
    Enter fd.
  7. That would have configured the first partition hdb1. Repeat steps 1-6 for all the partitions on hda. In my case, there was only one additional one. Remember to use the copy Start and End blocks.
  8. when you’re finished, enter w (for “write table to disk and exit”).

Now you’re ready to hot-add the new drive to your RAID array. You’ll need to run mdadm for each partition you need to restore (two in my case). Use the output of /proc/mdstat as a guide. e.g.

md1 : active raid1 hda2[1]
1927680 blocks [2/1] [_U]

md0 : active raid1 hda1[1]
310640768 blocks [2/1] [_U]

Here md1 currently maps to hda2 only. We need to add hdb2 to md1. md0 currently maps to hda1 only. We need to add hdb1 to md0. So in this case, you would run the following commands:

mdadm /dev/md0 -a /dev/hdb1
mdadm /dev/md1 -a /dev/hdb2

That’s it! The RAID array should be doing its magic.

Verify That It’s Updating

Examine the output from /proc/mdstat now, and you should see something like:

Personalities : [raid1]
md1 : active raid1 hdb2[2] hda2[1]
1927680 blocks [2/1] [_U]
resync=DELAYED

md0 : active raid1 hdb1[2] hda1[1]
310640768 blocks [2/1] [_U]
[>………………..]  recovery =  1.0% (3196864/310640768) finish=158.8min speed=32257K/sec

unused devices: <none>

Since it’ll take several hours, you can log out of the shell, and let it synchronize the two disks.

Install the Grub Loader on Your New Drive

Since you just added hdb, you should install the Grub loader, in case hda ever fails or is removed. Run:

grub-install /dev/hdb

… and you should be good to go.

5 thoughts on “Restoring a Broken Linux RAID Array

  1. Nice article. I am considering making a raid 1 homeserver with the cheap Intel D945GCLF2 board.
    How did you notice that 1 of the drives broke down? I mean the system will not warn you for that, right? Unless you run cat /proc/mdstat ?

  2. Hi Jack,

    The symptoms were that the machine would freeze, and would have to be rebooted. I could log in to it (it was headless), use it for a few minutes, then it would just hang. The terminal would become unresponsive.

    Then I began searching through the logs and found something that pointed me to RAID.

    And yeah, then I ran cat /proc/mdstat and saw output like this:

    Personalities : [raid1]
    md1 : active raid1 hda2[0]
    1927680 blocks [2/1] [U_]

    md0 : active raid1 hda1[0]
    310640768 blocks [2/1] [U_]

    unused devices:

    … which tells you that hda is the only one currently in use.

    -Antun

  3. Thanks for the clear instructions. I just noticed by accident that one of the drives in my RAID 1 array had died.

    I’m interested that you said your machine was freezing etc. Shouldn’t everything just carry on as normal when one drive in a RAID 1 array breaks? I didn’t notice anything wrong with my server and I don’t even know how long that drive had been offline. Out of interest, do you have an explanation as to why your machine became unstable?

  4. Hi Alex,

    I don’t know why my machine was freezing. I’m kind of glad it was, because otherwise I would never have checked. Maybe it can become unstable depending on the type of failure that the drive experiences? Also, it could be the particular RAID software.

    -Antun

Leave a Reply

Your email address will not be published. Required fields are marked *