Recently my friend had a problem with his server (runs on Ubuntu), where his RAID array suffered a failed hard disk replacement (“Un des disques sur le serveur vient de subir un changement de disque” as he said in French over telephone) and wanted me to help him out.

In a Linux RAID array (Software RAID), when a failed hard disk replacement occurs or you need to remove a failed hard disk and want to add a new hard disk to the RAID array without losing data (for our case, we presume that you already removed and replaced it with a new disk), your RAID array is left with one functioning disk in Software RAID and one empty disk with no Partition Table.

In such case, you need to copy the intact Partition Table of the functioning disk to the new empty disk and then rebuild the software RAID array with the help of “mdadm” command.

In my friend’s case, the /dev/sdb has been replaced and /dev/sda is functioning well.

Warning: Please remember the correct drive/partition and don’t mess with it, else you will end up with smoking both drives and suffer serious loss of data.

Warning: Playing with “mdadm” is risky as we don’t know very well of it’s functioning and sometimes it may not work for you depending upon the parameters of your system specific only to you. So be extra careful.

Step 1:

Login to your server via SSH. When my friend asked for help, I was on Mac OS X. Not wanting to restart my system and logging into my Windows 8 (and again I needed to install “PuTTY”) or Fedora 21, I just used the Mac OS X terminal and “ssh” command.

$ ssh root@yourserver.provider.domain

or with IP address

$ ssh root@xxx.xxx.xxx.xxx

Or, launch a “Terminal” → from menu “Shell” → select “New Remote Connection…” or press key combinations “Shift+Command+K” → new window “New Remote Connection” opens. From that window, under “Service” select “Secure Shell (ssh)” → under “Server” click “+” button and add either your “server name” or “IP address” → in “User” textbox type a user or “root” → “SSH (Automatic)” will be selected by default once you selected “Service” → in below dropdown box the command that is going to be issued will be displayed automatically → Click “Connect” → a new Shell window opens and ask for password. Provide password and once into the Server’s shell proceed from there.

In the above method, the configuration is saved under a file “know_hosts” under a newly created folder “.ssh” under your “User Home Directory”. So that, in future, you can directly select your choice of server from the “New Remote Connection” window. To access the file, you can issue a command like “vi .ssh/known_hosts” from user’s home directory.

Step 2:

First thing first! Do the necessary backup of disks. Backup with your usual method or Google how to do it or ask your friends. Since you’re maintaining a Server, I assume that you are adept at backing up regularly.

Step 3:

If you don’t have “mdadm”, install it:

$ yum install mdadm (or)

$ apt-get install mdadm (or)

$ aptitude install mdadm (or)

$ rpm -Uvh --nodeps ftp://ftp.ovh.net/made-in-ovh/sources/mdadm-2.5.5p1-1.i386.rpm
(find the right and latest “mdadm” through Google search)

Or, use whatever the command you’re comfortable with installing packages under your Linux distro.

Step 4:

Now, check the status of the RAID array using the command “cat /proc/mdstat”. This returns the output of multi-disk status.

Below is an example of two functioning healthy RAID arrays.

mdstat Output 01

In the above example, it shows two healthy RAID 1 arrays. As each array has a status of [2/2] and [UU], this means that out of 2 devices in the array, both the 2 devices are functional and both are Up.

In case of failing or replaced disks, there are several possibilities.

Possibility 1:

Below is an example, where the failed drive (/dev/sdb) has been replaced with a new blank drive and therefore the status reads [2/1] and [U_] as out of 2 devices in the array, only 1 is functional and only 1 is up. This is because the /dev/sda drive is still functioning, whereas the /dev/sdb drive has been replaced and is needed to be added back into the array and rebuilt. The output shows that the /dev/sdb drive has failed by the trailing (F) behind sdb3[2] and sdb1[2]

mdstat Output 02

Possibility 2:

Below is an example, where instead of one of the devices in each array being marked as failed (as in the above example), one of the devices in each array is not listed at all. In such case, the next step can be skipped.

mdstat Output 03

Possibility 3:

In my friend’s case, earlier, I had a different output like that of below. Notice that, in md1, it’s not sdb1, sdb2 or sdb3, but just plain sdb. And it shows, both sdb[1] and sda1[0] are up [UU] and 2 out of 2 devices are functioning [2/2].

mdstat Output 04

But later, it has changed to above example (Possibility 2) after few hours. So, I proceeded with the case of “Possibility 2”.

Note:[faulty]” in the “Personalities” line is not an indication of problem with the array. Instead, it’s diagnostic personality. The “Personalities” line tells you what RAID level the kernel currently supports. Last two examples show all possible “Personalities”.

Step 5:

In case of “Possibility 1” and “Possibility 3”, you need to remove failed devices from both the arrays.

$ mdadm --manage /dev/md1 --remove /dev/sdb1
$ mdadm --manage /dev/md3 --remove /dev/sdb3

or

$ mdadm --manage /dev/md1 --remove /dev/sdb

mdadm Output 04

In case of “Possibility 2”, these devices are not listed in both md1 and md2, so you don’t need to remove them and you can skip this step altogether.

Step 6:

Now, issue “fdisk -l” command to list all the partitions of both disks.

Important:

If “fdisk -l” returns the below message:

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.

Your disk’s Partition Table in not MBR. Instead, it’s GPT (GUID Partition Table). In that case, you use “parted” command.

If “fdisk -l” displays “Disk /dev/sdb doesn’t contain a valid partition table”, as in below example, for the failed disk /dev/sdb, it’s fine. You can skip this step.

fdisk Output 01

Instead, if “fdisk -l” lists partitions for failed disk /dev/sdb, as in below example, you need to get into fdisk “fdisk /dev/sdb” and delete these partitions (Google and consult a good manual to how to do it, both for “fdisk” and “parted”).

fdisk Output 02

Once it’s done, reboot the server to delete the partitions and re-read the partition tables. Issue either “reboot” or “shutdown -r now”.

Step 7:

Now replicate the partition by copying the Partition Table of the healthy disk (/dev/sda) to the empty disk (/dev/sdb). Please be extra careful by providing the right disk names, otherwise it will wipe out the data in functioning healthy drive.

For MBR disks, issue:

$ sfdisk -d /dev/sda | sfdisk /dev/sdb

Two different examples of above command’s output.

Example 1:

sfdisk Output 01

Example 2:

sfdisk Output 02

Sometimes you may encounter, error like:

sfdisk: ERROR: sector 0 does not have an msdos signature

In that case, issue the command with “–force” option,

$ sfdisk -d /dev/sda | sfdisk --force /dev/sdb

For GPT disks, issue:

sgdisk -R=/dev/sdb /dev/sda

Now the partition structure of the replaced/empty disk matches the partition structure of the functioning healthy disk, that contains data.

Step 8:

Now, we are going to use “mdadm” command. Issue:

$ mdadm --help (and/or)
$ mdadm --misc --help

to get help and familiarize yourself with the “mdadm” command.

To get detailed information on the status of the RAID arrays, issue:

$ mdadm --misc --detail /dev/md1
$ mdadm --misc --detail /dev/md2

The output may look like below:

mdadm Output 01

Notice that, “State :” could be anything like “active, degraded” or “clean, degraded” for failed disks and “active” or “clean” for functioning disks. Also, take notice of the last three lines.

Step 9:

To find out which partition should be added to which array, issue “cat /etc/mdadm.conf” or “cat /etc/mdadm/mdadm.conf”.

mdadm.conf Output 01

Sometimes you may not get the needed information like the example below. In such case, don’t worry, you can easily guess out the right disks and right arrays with the output of previous commands we executed so far.

mdadm.conf Output 02

Step 10:

As the /dev/sdb has been replaced, you need to add the /dev/sdb partitions to the correct arrays. The output from the last step states that /dev/sdb1 should be added to the /dev/md1 array. Issue the command:

$ mdadm /dev/md1 --manage --add /dev/sdb1

Now, check the RAID array status by issuing “cat /proc/mdstat” command again. Since the correct partition has been added back to the correct array, the data should begin copying over to the new drive and recovery and thus rebuilding of /dev/sdb1 will occur.

mdadm Output 02

Once the rebuilding process is done the output of “mdadm –misc –detail /dev/md1” should look like below example.

mdadm Output 03

Do the same for the /dev/sdb2 by issuing:

$ mdadm /dev/md2 --manage --add /dev/sdb2

Step 11:

Now that both the partitions /dev/sdb1 and /dev/sdb2 are recovered and added to correct arrays and the arrays are rebuilt, we need to enable the swap partition. To verify the swap partitions, issue the command:

$ cat /proc/swaps

To enable the swap partition for /dev/sdb, issue the commands:

$ mkswap /dev/sdb3
$ swapon -p 1 /dev/sdb3 (or)
$ swapon -a

Step 12:

Issue a final “fdisk -l” to verify the condition of all partitions of both the disks.

fdisk Output 03

Note: It’s normal that “fdisk” display both /dev/md1 and /dev/md2 don’t contain valid partition tables. It’s because “fdisk” is written for handling only single disks and their partitions. MD devices are Multiple Disks and are manipulated using “mdadm” command.

Thanks for the read and please leave comments 🙂