Tuesday, June 18, 2013

Crash during md array recovery

I just got a machine crash while my RAID5 array was rebuilding. After the system come back, the array was marked as inactive:
md2 : inactive sdb1[0] sde1[4](S) sdc1[3] sdd1[1]
      7814047744 blocks
No matter what I tried, it didn't want to activate the array. For example, I tried the following command:
mdadm --assemble --force --scan /dev/md2
But there wasn't any message, and no change. I also tried reboot, that didn't helped either. The thing is, I have very important data on that array and I didn't want to experiment with commands to risk data loss. Googling around, I found several possible solutions, for example this, this, or this. Even though those had some similarities, none helped me. In the end, I realized that the array is started (it is shown in /proc/mdstat) and that fact is the reason why previous assemble command didn't do anything, and also why I was receiving errors like this one:
# mdadm -A /dev/md/2 /dev/sd{b,c,d,e}1
mdadm: cannot open device /dev/sdb1: Device or resource busy
In the end, what helped, is that I stopped the array using:
mdadm -S /dev/md2
and then I started it again. First, I tried:
# mdadm --assemble --scan /dev/md2
mdadm: /dev/md2 assembled from 3 drives and 1 spare - not enough to start the array while not clean - consider --force.
which obviously didn't help. But, as suggested, I tried with --force option:
mdadm --assemble --force --scan /dev/md2
After this, the array was again in inactive state. Then, I stopped it and started it again, and suddenly, it was active and in recovery mode. I'm not sure if this restart was necessary. The main reason I did it was that I thought that array was brought in recovered state, which I thought would destroy data on it and so I quickly stopped it. But, after I stopped it I realized that I was mistaken, so I started it again, and this time it was in recovery mode.

One more thing that bothered me was if the array was properly rebuilding, i.e. writing corrected data on a failed device and not on some healthy one. Namely, looking into /proc/mdstat shows disks status [UU_U] which could mean that third disk, i.e. sdd1, was rebuilt, which is wrong. But then, using the command:
mdadm --misc --detail /dev/md2
I saw that sde1 was rebuilt which was what I expected. And while I'm at the output of this command, it is interesting to know that MD subsystem knows which disks are out of sync using epoch number that is displayed too.

Finally, as disks became larger and larger, rebuilding array takes more and more time. My array of 4 disks each of 2T size, takes about 10 hours to rebuild. This is a lot of time. So, maybe its time to switch to btrfs or zfs which have integrated RAID functionality and thus rebuilding array in their case is much faster. Alternatively, MD subsystem should be thought to take a note on which blocks changed and thus only rebuild those blocks instead of a whole disk.

No comments:

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)