The disks in RAID 1 array may fail at any time.
Situation:
One day morning, a message may show up
A Fail event had been detected on md device /dev/md0. It could be related to component device /dev/sda1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[2](F)
204736 blocks super 1.0 [2/1] [_U]
md2 : active raid1 sda3[2](F) sdb3[1]
483855168 blocks super 1.1 [2/1] [_U]
bitmap: 3/4 pages [12KB], 65536KB chunk
md1 : active raid1 sda2[2](F) sdb2[1]
4192192 blocks super 1.1 [2/1] [_U]
unused devices:
it shows that sda has an issue (may not be hardware failure). Before deciding to replace the hard drive, you can try repairing the RAID 1. you need to remove the failed/missing device with mdadm first and the re-add it to start the rebuild/sync.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# mdadm --manage /dev/md0 --remove /dev/sda1 mdadm: hot removed /dev/sda1 from /dev/md0 # mdadm --manage /dev/md2 --remove /dev/sda3 mdadm: hot removed /dev/sda3 from /dev/md2 # mdadm --manage /dev/md1 --remove /dev/sda2 mdadm: hot removed /dev/sda2 from /dev/md1 # mdadm --manage /dev/md0 --add /dev/sda1 mdadm: added /dev/sda1 # mdadm --manage /dev/md1 --add /dev/sda2 mdadm: added /dev/sda2 # mdadm --manage /dev/md2 --add /dev/sda3 mdadm: re-added /dev/sda3 |
Check the status
1 2 3 |
cat /proc/mdstat |
Personalities : [raid1]
md0 : active raid1 sda1[2] sdb1[1]
204736 blocks super 1.0 [2/2] [UU]
md2 : active raid1 sda3[2] sdb3[1]
483855168 blocks super 1.1 [2/1] [_U]
[==========>..........] recovery = 51.6% (249879168/483855168) finish=272.1min speed=14328K/sec
bitmap: 3/4 pages [12KB], 65536KB chunk
md1 : active raid1 sda2[2] sdb2[1]
4192192 blocks super 1.1 [2/2] [UU]
Replace the failed drive with a new hard drive
If you are going to replace the hard drive, use the existing drive and mirror its partition table structure to the new drive.
1 2 3 |
sfdisk -d /dev/sdb | sfdisk /dev/sda |
then use the command mentioned before to add the partitions back into the RAID Arrays
1 2 3 4 5 |
# mdadm --manage /dev/md0 --add /dev/sda1 # mdadm --manage /dev/md1 --add /dev/sda2 # mdadm --manage /dev/md2 --add /dev/sda3 |
Install Grub on new hard drive MBR:
We need install grub on the MBR of the newly installed hard drive. So that in case the other drive fails the new drive will be able to boot the OS.
Enter the Grub command line:
1 2 3 |
# grub |
Locate grub setup files:
1 2 3 |
grub> find /grub/stage1 |
On a RAID 1 with two drives present you should expect to get
(hd0,0)
(hd1,0)
Install grub on the MBR:
1 2 3 4 5 6 |
grub> device (hd0) /dev/sda grub> root (hd0,0) grub> setup (hd0) grub>quit |
We made the second drive /dev/sdb device (hd0) because putting grub on it this way puts a bootable mbr on the 2nd drive and when the first drive is missing the second drive will boot.
This will insure that if the first drive in the Raid Array fails or has already failed that you can boot to the Operating System with the second drive.
How can I detect if grub is installed in /dev/sda and /dev/sdb’s MBR?
You can issue command:
1 2 3 4 |
# dd if=/dev/sda bs=512 count=1 | xxd | grep -i grub # dd if=/dev/sdb bs=512 count=1 | xxd | grep -i grub |
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00103986 s, 492 kB/s
0000180: 4752 5542 2000 4765 6f6d 0048 6172 6420 GRUB .Geom.Hard
To check the failed HD info
1 2 3 |
smartctl -a /dev/sda |
or
1 2 3 |
# hdparm -I /dev/sda |
Check Disk Temperature
1 2 3 4 5 6 7 8 9 10 11 |
hddtemp /dev/sda dev/sda: Maxtor 6H500F0: 49�C 194 Temperature_Celsius 0x0032 035 253 000 Old_age Always - 49 # hddtemp /dev/sdb /dev/sdb: Maxtor 6H500F0: 49�C 194 Temperature_Celsius 0x0032 035 253 000 Old_age Always - 49 |
To replace the failed hard drive and rebuild the RAID 1, you can check those links
http://www.kernelhardware.org/replacing-failed-raid-drive/
http://wiki.contribs.org/Raid#Resynchronising_a_Failed_RAID
http://serverfault.com/questions/481774/degradedarray-event-on-dev-md1
https://www.centos.org/forums/viewtopic.php?t=24641
http://www.ducea.com/2009/03/08/mdadm-cheat-sheet/
http://serverfault.com/questions/97565/raid1-how-do-i-fail-a-drive-thats-marked-as-removed
http://techblog.tgharold.com/2009/01/removing-failed-non-existent-drive-from.shtml
https://bbs.archlinux.org/viewtopic.php?id=106919
it is really helpful to help me repair my software RAID1!