Everything about nothing: mdadm

Showing posts with label mdadm. Show all posts

Tuesday, June 18, 2013

Crash during md array recovery

I just got a machine crash while my RAID5 array was rebuilding. After the system come back, the array was marked as inactive:

md2 : inactive sdb1[0] sde1[4](S) sdc1[3] sdd1[1]
7814047744 blocks

No matter what I tried, it didn't want to activate the array. For example, I tried the following command:

mdadm --assemble --force --scan /dev/md2

But there wasn't any message, and no change. I also tried reboot, that didn't helped either. The thing is, I have very important data on that array and I didn't want to experiment with commands to risk data loss. Googling around, I found several possible solutions, for example this, this, or this. Even though those had some similarities, none helped me. In the end, I realized that the array is started (it is shown in /proc/mdstat) and that fact is the reason why previous assemble command didn't do anything, and also why I was receiving errors like this one:

# mdadm -A /dev/md/2 /dev/sd{b,c,d,e}1
mdadm: cannot open device /dev/sdb1: Device or resource busy

In the end, what helped, is that I stopped the array using:

mdadm -S /dev/md2

and then I started it again. First, I tried:

# mdadm --assemble --scan /dev/md2
mdadm: /dev/md2 assembled from 3 drives and 1 spare - not enough to start the array while not clean - consider --force.

which obviously didn't help. But, as suggested, I tried with --force option:

mdadm --assemble --force --scan /dev/md2

After this, the array was again in inactive state. Then, I stopped it and started it again, and suddenly, it was active and in recovery mode. I'm not sure if this restart was necessary. The main reason I did it was that I thought that array was brought in recovered state, which I thought would destroy data on it and so I quickly stopped it. But, after I stopped it I realized that I was mistaken, so I started it again, and this time it was in recovery mode.

One more thing that bothered me was if the array was properly rebuilding, i.e. writing corrected data on a failed device and not on some healthy one. Namely, looking into /proc/mdstat shows disks status [UU_U] which could mean that third disk, i.e. sdd1, was rebuilt, which is wrong. But then, using the command:

mdadm --misc --detail /dev/md2

I saw that sde1 was rebuilt which was what I expected. And while I'm at the output of this command, it is interesting to know that MD subsystem knows which disks are out of sync using epoch number that is displayed too.

Finally, as disks became larger and larger, rebuilding array takes more and more time. My array of 4 disks each of 2T size, takes about 10 hours to rebuild. This is a lot of time. So, maybe its time to switch to btrfs or zfs which have integrated RAID functionality and thus rebuilding array in their case is much faster. Alternatively, MD subsystem should be thought to take a note on which blocks changed and thus only rebuild those blocks instead of a whole disk.

Sunday, February 5, 2012

Getting CentOS on RAID after using text mode installation...

Well, this was a hack. I got a firewall to install with only 512MB RAM which isn't enough for graphical install. And, using text install means that there are no customizations available, in particular, no customized disk partitioning. But since I don't install anything important without RAID I had to somehow do it. The options are:

Temporarily add more RAM only during installation process.
Use kickstart file.
Juggle with partitions.

Option 1 wasn't acceptable since I didn't have any extra RAM modules available, and option 2 seamed too complicated as I don't have USB flash drive to store kickstart file on it and network install is a bit too much on home network. So, I decided to go with option 3. Note that there was one more thing that was in favor for option 3, namely one disk has capacity of 80GB while the other 250GB, which means there is extra space I can use as a temporary storage. Still, even if I didn't have it, option 3 is still viable with a bit more juggling.

So, the general idea is as follow:

Install CentOS on a temporary partition.
Create RAID array and move CentOS there.
Fix boot.

Those three steps are not in strict sequence and so clearly separated, as we'll see, but are logically grouped. Also, just to clarify, a helper partition is a disk space after 80th GB on a second (larger) disk!

Install CentOS on a helper partition

This is a first step and it's easy. Boot from DVD and start installation process. Except from one small detail. Namely, how to persuade installer to use helper partition when it does almost everything automatically? Well, that's actually easy to solve. After Anaconda starts, but before doing anything, switch to second virtual terminal (Alt+F2) and use fdisk to create partitions for RAID. In my case those were:

/dev/sda1 and /dev/sdb1 of size 256MB for /boot partition
/dev/sda2 and /dev/sdb2 of size 2G for swap
/dev/sda3 and /dev/sdb3 that take the rest of space to fill 80GB and that will be root (/) partition.

This will take first 80GB on both disks, and leave empty space on a second disk. Now, go back to installer and continue installation. When the installer asks you where to install Linux, select "Emtpy space". Note that later I was thinking that it was better if I created another partition so that free space is smaller. This speeds up the process of installation as creating file systems is faster! Anyway, for a suggestion on minimal CentOS installation you can look what I wrote in this post.

Create RAID arrays and move CentOS there

After the installer finishes, don't boot into a new system. Boot again from DVD and select rescue mode. Also, allow installer to search for existing installations and select to mount them in read-write mode! Finally, select shell from a menu that appears.

Now, create RAID arrays. If you used partitions as I did, then the following sequence of commands will do the work.

mdadm -C /dev/md0 --metadata=0.90 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

This creates first RAID1 array for boot partition. Note that I'm using metadata version 0.90. This is because grub doesn't understand later formats! That "little" fact costed me a lot of time!

mdadm -C /dev/md1 --level=1 --raid-devices=2 /dev/sda2 /dev/sdb2
mdadm -C /dev/md2 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3

Those two commands create RAID1 arrays for swap and root. Maybe at this point you'll need to activate arrays (i.e. it isn't done automatically by above mdadm commands):

mdadm --activate --scan

Now, create file systems on those arrays:

mkfs.ext4 /dev/md0
mkfs.ext4 /dev/md2
mkswap /dev/md1

RAID arrays are now created. The next step is to move installed CentOS from temporary partition. First, you have to mount destination filesystems:

mkdir /mnt/s

mount /dev/md2 /mnt/s

mkdir /mnt/s/boot

mount /dev/md0 /mnt/s/boot

After they are mounted, copy all the files:

cd /mnt/sysimage

rsync -av bin boot etc home lib lib64 opt root sbin tmp usr var /mnt/s/

Note that I skipped some in-memory file systems like proc, dev, and similar ones. You should only create those directories without content as their content is recreated during each boot and held in memory:

cd /mnt/s
mkdir dev media mnt proc selinux srv sys

That's it for filesystems. We need now to adjust /etc/fstab and /etc/grub.conf files as they reference temporary partition/filesystem used by installer. So, change /etc/fstab to contain the following lines/filesystems:

/dev/md2    /       ext4    defaults   1 1
/dev/md0    /boot   ext4    defaults   1 2
/dev/md1    swap    swap    defaults   0 0

There will be also lines starting with tmpfs, devpts, sysfs and proc. Leave those as is and remove all the other lines. As for the /etc/grub.conf file, you need to modify any occurrences of (hdN,M). Those will be in two places. One in splashimage line, that one isn't so important. And the other in line starting with root keyword. That one IS important! Also, remove from the line that starts with keyword kernel any word that contains substring LV (logical volume!). After a change, this line should look something like this:

kernel /vmlinuz-2.6.32-220.el6.x86_64 ro root=/dev/md2 LANG=en_US.UTF-8 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us

Note that this is a single line! The important part is root=, others might differ if you selected some other language or keyboard layout during installation process. Also, I removed keywords rhgb and quiet. Those prevent me to see kernel messages during the boot process. And those messages could be very important, especially in this case when we are doing something that could impact early boot process itself!

Finally, remove temporary partition. Do that using fdisk, just delete that partition. Note that this will take effect on next reboot, so nothing will happen and the computer won't crash. :)

And, that's it for moving CentOS to RAID partition.

Fixing boot loader

This is the last step. Boot from DVD and select rescue mode. Again allow installer to scan disk for Linux partitions. But this time installer should find RAID partitions and CentOS on them. Mount them in read-write mode and enter shell again. Execute the following command:

chroot /mnt/sysimage

This will activate targeted CentOS on RAID.

First, we have to recreate /etc/mdadm.conf. Note that if you don't do that the system won't boot. This troubled me until I figured out that that file contained stale data filled by Anaconda and I didn't refresh its content. So, to refresh it, do the following:

mdadm --examine --scan > /etc/mdadm.conf

Open /etc/mdadm.conf in the editor and add the following line at the beginning:

DEVICES /dev/sd*[0-9] /dev/hd*[0-9]

Also be careful, I typed etc instead of dev and it costed me two additional reboots. :)

Now, recreate initramfs image. Do that with the following command:

mkinitrd -f -v /boot/initramfs-2.6.32-220.el6.x86_64.img 2.6.32-220.el6.x86_64

Be careful, -f forces mkinitrd to overwrite existing initramfs file, so maybe it would be good to make a copy of that file, just in case.

One final step and that's it, reinstall the boot loader! Do this with the following commands:

grub-install /dev/sda

grub-install /dev/sdb

If the two commands return error (some problem with stage1 or stage2 files) then do it "manually" like this:

# grub
grub> root (hd0,0)
grub> setup (hd0)
grub> root (hd1,0)
grub> setup (hd1)
grub> exit

Note that you type text in bold, while everything else is response from the system. And that's it! Reboot the system, this time from the hard disk, and you should have minimal CentOS installation on RAID partitions!

Thursday, November 24, 2011

Re-adding SATA disk to software RAID without rebooting...

It happened second time that on the one of the servers I'm maintaining one of the SATA disks suddenly was disconnected from the server. Looking into log files, I found the following error messages:

kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptbase_reply
kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
last message repeated 62 times

and then a lot of messages like the following one:

kernel: sd 0:0:1:0: SCSI error: return code = 0x00010000
kernel: end_request: I/O error, dev sdb, sector 1264035833

This triggered RAID to log the following type of messages:

kernel: raid5:md0: read error not correctable (sector 28832 on sdb2)

and finally to remove failed disk from array:

kernel: RAID5 conf printout:

kernel: --- rd:3 wd:2 fd:1
kernel: disk 0, o:1, dev:sda2
kernel: disk 1, o:0, dev:sdb2
kernel: disk 2, o:1, dev:sdc2
kernel: RAID5 conf printout:
kernel: --- rd:3 wd:2 fd:1
kernel: disk 0, o:1, dev:sda2
kernel: disk 2, o:1, dev:sdc2

I yet need to find out what happened, but in the mean time the consequence of those error messages was that one disk was disconnected, and removed from RAID array, and I received the following mail from the mdmonitor process on the server:

This is an automatically generated mail message from mdadm
running on mail.somedomain

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sdb2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc2[2] sdb2[3](F) sda2[0]
1952989696 blocks level 5, 256k chunk, algorithm 2 [3/2] [U_U]

unused devices:

Since this happened exactly at noot which is a time when everybody uses mail server it isn't exactly an option to reboot the server, not unless I absolutely have to. In this case I decided that I'm going to reboot it after work hours and in the mean time I can either just wait or try to rebuild RAID. If I wait, there is a risk of another disk failing and that would bring the server down. So, as this happened already, and I knew that the disk is OK and it will be re added after reboot, I decided to try to do that immediately and on a live system.

So, the first thing is to request kernel to rescan SATA/SCSI bus in order to find "new" devices. This is done using the following command:

echo "- - -" > /sys/class/scsi_host/host0/scan

After this, disk reappeared, but the problem was that the name now is /dev/sde and not /dev/sdb. To get disk always the same name I would need to mess with udev, which I was not prepared to do now. (And, btw, I have recently read about a patch that allows you to do just that, to rename existing device, but I think it was rejected on the ground that this kind of stuff is better done in user space, i.e. modifying udev rules.)

Now, the only problem was to "convice" RAID subsystem to re add disk. I thought that it would find disk and attach it, but eventually, I just used the following command:

mdadm --manage /dev/md0 --add /dev/sde2

The command notified me that the disk was already a member of array and that it is being re-added. Afterwords, sync process was started, that will take some time:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde2[3] sdc2[2] sdb2[4](F) sda2[0]
      1952989696 blocks level 5, 256k chunk, algorithm 2 [3/2] [U_U]
      [=>...................] recovery = 7.6% (74281344/976494848) finish=204.9min speed=73355K/sec

unused devices:

It would be ideal for transient errors, like this one, that RAID subsystem memorizes only changes and when the disk is readded to apply only those changes. But, I didn't managed to find a way how to do that, and I also think that that functionality is no implemented at all.

Anyway, after synchronization process finished this is the content of /proc/mdstat file:

#cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde2[1] sdc2[2] sdb2[3](F) sda2[0]
1952989696 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]

unused devices:

As you can see sdb2 is still here. Trying to remove it isn't possible because there is no corresponding device node:

# mdadm --manage /dev/md0 -r /dev/sdb2
mdadm: cannot find /dev/sdb2: No such file or directory
[root@mail ~]# mdadm --manage /dev/md0 -r sdb2
mdadm: cannot find sdb2: No such file or directory

So, I decided to wait until reboot.

Edit: I did reboot few days ago, and after reboot everything came to normal state, i.e. it was before disk was removed from array!

[201211114] Update: Again this happened almost exactly at noon. Here is what was recorded in log files:

Nov 14 12:00:02 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptbase_reply
Nov 14 12:00:07 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: raid5: Disk failure on sdc2, disabling device. Operation continuing on 2 devices
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: raid5:md0: read error not correctable (sector 1263629840 on sdc2).
Nov 14 12:00:08 mail kernel: RAID5 conf printout:
Nov 14 12:00:08 mail kernel: --- rd:3 wd:2 fd:1
Nov 14 12:00:08 mail kernel: disk 0, o:1, dev:sda2
Nov 14 12:00:08 mail kernel: disk 1, o:1, dev:sdb2
Nov 14 12:00:08 mail kernel: disk 2, o:0, dev:sdc2
Nov 14 12:00:08 mail kernel: RAID5 conf printout:
Nov 14 12:00:08 mail kernel: --- rd:3 wd:2 fd:1
Nov 14 12:00:08 mail kernel: disk 0, o:1, dev:sda2
Nov 14 12:00:08 mail kernel: disk 1, o:1, dev:sdb2

And then, the system by itself re-scanned array, but it didn't re add disk to array:

Nov 14 12:00:44 mail kernel: mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 6, phy 2, sas_addr 0x8a843926a69f9691
Nov 14 12:00:44 mail kernel: Vendor: ATA Model: WDC WD1001FALS-0 Rev: 0K05
Nov 14 12:00:44 mail kernel: Type: Direct-Access ANSI SCSI revision: 05
Nov 14 12:00:44 mail kernel: SCSI device sde: 1953525168 512-byte hdwr sectors (1000205 MB)
Nov 14 12:00:44 mail kernel: sde: Write Protect is off
Nov 14 12:00:44 mail kernel: SCSI device sde: drive cache: write back
Nov 14 12:00:44 mail kernel: SCSI device sde: 1953525168 512-byte hdwr sectors (1000205 MB)
Nov 14 12:00:44 mail kernel: sde: Write Protect is off
Nov 14 12:00:44 mail kernel: SCSI device sde: drive cache: write back
Nov 14 12:00:44 mail kernel: sde: sde1 sde2
Nov 14 12:00:44 mail kernel: sd 0:0:4:0: Attached scsi disk sde
Nov 14 12:00:44 mail kernel: sd 0:0:4:0: Attached scsi generic sg2 type 0

So I had to manually issue the following command: