kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptbase_replyand then a lot of messages like the following one:
kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
last message repeated 62 times
kernel: sd 0:0:1:0: SCSI error: return code = 0x00010000This triggered RAID to log the following type of messages:
kernel: end_request: I/O error, dev sdb, sector 1264035833
kernel: raid5:md0: read error not correctable (sector 28832 on sdb2)and finally to remove failed disk from array:
I yet need to find out what happened, but in the mean time the consequence of those error messages was that one disk was disconnected, and removed from RAID array, and I received the following mail from the mdmonitor process on the server:kernel: RAID5 conf printout:kernel: --- rd:3 wd:2 fd:1kernel: disk 0, o:1, dev:sda2
kernel: disk 1, o:0, dev:sdb2
kernel: disk 2, o:1, dev:sdc2
kernel: RAID5 conf printout:
kernel: --- rd:3 wd:2 fd:1
kernel: disk 0, o:1, dev:sda2
kernel: disk 2, o:1, dev:sdc2
This is an automatically generated mail message from mdadmSince this happened exactly at noot which is a time when everybody uses mail server it isn't exactly an option to reboot the server, not unless I absolutely have to. In this case I decided that I'm going to reboot it after work hours and in the mean time I can either just wait or try to rebuild RAID. If I wait, there is a risk of another disk failing and that would bring the server down. So, as this happened already, and I knew that the disk is OK and it will be re added after reboot, I decided to try to do that immediately and on a live system.
running on mail.somedomain
A Fail event had been detected on md device /dev/md0.
It could be related to component device /dev/sdb2.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdc2[2] sdb2[3](F) sda2[0]
1952989696 blocks level 5, 256k chunk, algorithm 2 [3/2] [U_U]
unused devices:
So, the first thing is to request kernel to rescan SATA/SCSI bus in order to find "new" devices. This is done using the following command:
echo "- - -" > /sys/class/scsi_host/host0/scanAfter this, disk reappeared, but the problem was that the name now is /dev/sde and not /dev/sdb. To get disk always the same name I would need to mess with udev, which I was not prepared to do now. (And, btw, I have recently read about a patch that allows you to do just that, to rename existing device, but I think it was rejected on the ground that this kind of stuff is better done in user space, i.e. modifying udev rules.)
Now, the only problem was to "convice" RAID subsystem to re add disk. I thought that it would find disk and attach it, but eventually, I just used the following command:
mdadm --manage /dev/md0 --add /dev/sde2The command notified me that the disk was already a member of array and that it is being re-added. Afterwords, sync process was started, that will take some time:
# cat /proc/mdstatIt would be ideal for transient errors, like this one, that RAID subsystem memorizes only changes and when the disk is readded to apply only those changes. But, I didn't managed to find a way how to do that, and I also think that that functionality is no implemented at all.
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde2[3] sdc2[2] sdb2[4](F) sda2[0]
1952989696 blocks level 5, 256k chunk, algorithm 2 [3/2] [U_U]
[=>...................] recovery = 7.6% (74281344/976494848) finish=204.9min speed=73355K/sec
unused devices:
Anyway, after synchronization process finished this is the content of /proc/mdstat file:
#cat /proc/mdstatAs you can see sdb2 is still here. Trying to remove it isn't possible because there is no corresponding device node:
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde2[1] sdc2[2] sdb2[3](F) sda2[0]
1952989696 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]
unused devices:
# mdadm --manage /dev/md0 -r /dev/sdb2So, I decided to wait until reboot.
mdadm: cannot find /dev/sdb2: No such file or directory
[root@mail ~]# mdadm --manage /dev/md0 -r sdb2
mdadm: cannot find sdb2: No such file or directory
Edit: I did reboot few days ago, and after reboot everything came to normal state, i.e. it was before disk was removed from array!
[201211114] Update: Again this happened almost exactly at noon. Here is what was recorded in log files:
Nov 14 12:00:02 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptbase_reply
Nov 14 12:00:07 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: raid5: Disk failure on sdc2, disabling device. Operation continuing on 2 devices
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: Unhandled error code
Nov 14 12:00:08 mail kernel: sd 0:0:2:0: SCSI error: return code = 0x00010000
Nov 14 12:00:08 mail kernel: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Nov 14 12:00:08 mail kernel: raid5:md0: read error not correctable (sector 1263629840 on sdc2).
Nov 14 12:00:08 mail kernel: RAID5 conf printout:
Nov 14 12:00:08 mail kernel: --- rd:3 wd:2 fd:1
Nov 14 12:00:08 mail kernel: disk 0, o:1, dev:sda2
Nov 14 12:00:08 mail kernel: disk 1, o:1, dev:sdb2
Nov 14 12:00:08 mail kernel: disk 2, o:0, dev:sdc2
Nov 14 12:00:08 mail kernel: RAID5 conf printout:
Nov 14 12:00:08 mail kernel: --- rd:3 wd:2 fd:1
Nov 14 12:00:08 mail kernel: disk 0, o:1, dev:sda2
Nov 14 12:00:08 mail kernel: disk 1, o:1, dev:sdb2
And then, the system by itself re-scanned array, but it didn't re add disk to array:
Nov 14 12:00:44 mail kernel: mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 6, phy 2, sas_addr 0x8a843926a69f9691
Nov 14 12:00:44 mail kernel: Vendor: ATA Model: WDC WD1001FALS-0 Rev: 0K05
Nov 14 12:00:44 mail kernel: Type: Direct-Access ANSI SCSI revision: 05
Nov 14 12:00:44 mail kernel: SCSI device sde: 1953525168 512-byte hdwr sectors (1000205 MB)
Nov 14 12:00:44 mail kernel: sde: Write Protect is off
Nov 14 12:00:44 mail kernel: SCSI device sde: drive cache: write back
Nov 14 12:00:44 mail kernel: SCSI device sde: 1953525168 512-byte hdwr sectors (1000205 MB)
Nov 14 12:00:44 mail kernel: sde: Write Protect is off
Nov 14 12:00:44 mail kernel: SCSI device sde: drive cache: write back
Nov 14 12:00:44 mail kernel: sde: sde1 sde2
Nov 14 12:00:44 mail kernel: sd 0:0:4:0: Attached scsi disk sde
Nov 14 12:00:44 mail kernel: sd 0:0:4:0: Attached scsi generic sg2 type 0
So I had to manually issue the following command:
mdadm --manage /dev/md0 --add /dev/sde2
No comments:
Post a Comment