Tuesday, June 18, 2013

Crash during md array recovery

I just got a machine crash while my RAID5 array was rebuilding. After the system come back, the array was marked as inactive:
md2 : inactive sdb1[0] sde1[4](S) sdc1[3] sdd1[1]
      7814047744 blocks
No matter what I tried, it didn't want to activate the array. For example, I tried the following command:
mdadm --assemble --force --scan /dev/md2
But there wasn't any message, and no change. I also tried reboot, that didn't helped either. The thing is, I have very important data on that array and I didn't want to experiment with commands to risk data loss. Googling around, I found several possible solutions, for example this, this, or this. Even though those had some similarities, none helped me. In the end, I realized that the array is started (it is shown in /proc/mdstat) and that fact is the reason why previous assemble command didn't do anything, and also why I was receiving errors like this one:
# mdadm -A /dev/md/2 /dev/sd{b,c,d,e}1
mdadm: cannot open device /dev/sdb1: Device or resource busy
In the end, what helped, is that I stopped the array using:
mdadm -S /dev/md2
and then I started it again. First, I tried:
# mdadm --assemble --scan /dev/md2
mdadm: /dev/md2 assembled from 3 drives and 1 spare - not enough to start the array while not clean - consider --force.
which obviously didn't help. But, as suggested, I tried with --force option:
mdadm --assemble --force --scan /dev/md2
After this, the array was again in inactive state. Then, I stopped it and started it again, and suddenly, it was active and in recovery mode. I'm not sure if this restart was necessary. The main reason I did it was that I thought that array was brought in recovered state, which I thought would destroy data on it and so I quickly stopped it. But, after I stopped it I realized that I was mistaken, so I started it again, and this time it was in recovery mode.

One more thing that bothered me was if the array was properly rebuilding, i.e. writing corrected data on a failed device and not on some healthy one. Namely, looking into /proc/mdstat shows disks status [UU_U] which could mean that third disk, i.e. sdd1, was rebuilt, which is wrong. But then, using the command:
mdadm --misc --detail /dev/md2
I saw that sde1 was rebuilt which was what I expected. And while I'm at the output of this command, it is interesting to know that MD subsystem knows which disks are out of sync using epoch number that is displayed too.

Finally, as disks became larger and larger, rebuilding array takes more and more time. My array of 4 disks each of 2T size, takes about 10 hours to rebuild. This is a lot of time. So, maybe its time to switch to btrfs or zfs which have integrated RAID functionality and thus rebuilding array in their case is much faster. Alternatively, MD subsystem should be thought to take a note on which blocks changed and thus only rebuild those blocks instead of a whole disk.


Upgrading Alfresco

This is a short note on how I upgraded Alfresco. The basic idea was to just replace WAR files while keeping configuration files with local modifications intact. To be able to do that, I'll unpack WAR archive, integrate changes I made to running instance of Alfresco, create new WAR archives and place them in tomcat webapps folder so that he unpacks and deploys them.

Preparation

So, I downloaded Alfresco 4.2.c. More specifically, I downloaded file alfresco-community-4.2.c.zip. The version I had was 4.0.d.

To find out what configuration files are changed in the running instance I unpacked alfresco.war archive (that file is in the downloaded archive) into a separate directory using unzip command. I suggest that you create directory alfresco, enter that directory and then run unzip command. There, I run the following script:
#!/bin/bash
OLDPATH=/var/lib/tomcat6/webapps/alfresco/
for i in `find . -path ./WEB-INF/classes/alfresco/messages -prune -o -name \*properties -print`
do
        # If file doesn't exist we don't need to check it, go to next one
        [ -f $OLDPATH/$i ] || continue
        # If the old and new files are the same, then skip it also
        cmp -s $OLDPATH/$i $i && continue
        # diff -uN $OLDPATH/$i $i | less
        echo $i
done
The script showed me what files have changed:
./WEB-INF/classes/test/alfresco/test-hibernate-cfg.properties./WEB-INF/classes/alfresco/model/dataTypeAnalyzers_en.properties./WEB-INF/classes/alfresco/workflow/invitation-nominated-workflow-messages_ja.properties./WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/slingshot/wiki/move.post_it.properties./WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/slingshot/wiki/move.post_de.properties./WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/slingshot/wiki/move.post_nl.properties./WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/slingshot/wiki/move.post_es.properties./WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/slingshot/wiki/move.post_fr.properties./WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/slingshot/wiki/move.post_ja.properties./WEB-INF/classes/alfresco/templates/webscripts/org/alfresco/slingshot/wiki/move.post.properties./WEB-INF/classes/alfresco/domain/hibernate-cfg.properties./WEB-INF/classes/alfresco/repository.properties./WEB-INF/classes/alfresco/subsystems/email/OutboundSMTP/outboundSMTP.properties./WEB-INF/classes/alfresco/subsystems/thirdparty/default/swf-transform.properties./WEB-INF/classes/alfresco/subsystems/thirdparty/default/imagemagick-transform.properties./WEB-INF/classes/alfresco/subsystems/fileServers/default/file-servers.properties./WEB-INF/classes/alfresco/subsystems/Synchronization/default/default-synchronization.properties./WEB-INF/classes/alfresco/subsystems/Search/solr/solr-search.properties./WEB-INF/classes/alfresco/subsystems/Search/solr/solr-backup.properties./WEB-INF/classes/alfresco/subsystems/Authentication/ldap-ad/ldap-ad-authentication.properties./WEB-INF/classes/alfresco/subsystems/Authentication/ldap/ldap-authentication.properties./WEB-INF/classes/alfresco/subsystems/Authentication/kerberos/kerberos-authentication.properties./WEB-INF/classes/alfresco/subsystems/Authentication/alfrescoNtlm/alfresco-authentication.properties./WEB-INF/classes/alfresco/subsystems/Authentication/passthru/passthru-authentication-context.properties./WEB-INF/classes/alfresco/subsystems/OOoDirect/default/openoffice-transform.properties./WEB-INF/classes/alfresco/version.properties./WEB-INF/classes/log4j.properties
Of those, localization files are not important (the ones ending in _[a-z][a-z].properties). If you uncomment a line containing less then the script will compare each file and show you difference in less. Based on that, I factored out the following configuration files that were changed:
./WEB-INF/classes/alfresco/repository.properties
./WEB-INF/classes/alfresco/subsystems/Authentication/ldap/ldap-authentication.properties
./WEB-INF/classes/alfresco/subsystems/Authentication/kerberos/kerberos-authentication.properties
./WEB-INF/classes/log4j.properties
I also found that I didn't change mail configuration data in the file:
./WEB-INF/classes/alfresco/subsystems/email/OutboundSMTP/outboundSMTP.properties
The next step was to find what changes are due to local configuration, and which ones are due to the changes in upstream. Namely, I'll take old configuration files but the changes in the new version have to be propagated. It turned out that repository.properties doesn't have any changes, while the other three have. So, I started to change files in the new version of Alfresco, that I unpacked. Finally, when all the changes are done, create new archive:
cd ..
mv alfresco.war alfresco.war.OLD
cd alfresco
zip -9rv ../alfresco .
cd ..
mv alfresco.zip alfresco.war
The first two commands rename old archive, next two create a new archive, and the final command changes name to have extension WAR. I assumed here that you unpacked original WAR archive into directory called alfresco.

All this has to be done with share.war archive too. In my case, the script showed that only log4j.properties has been changed so I incorporated changes and created a modified share.war archive.

Update

Finally, I stopped tomcat, and alfresco using:
service tomcat6 stop
created a copy of existing alfresco and share directories in tomcat's webapp directory. I also renamed old alfresco.war and share.war, and moved the ones I prepared in their place. Take care about permissions, tomcat has to be owner of everything! Then, I started tomcat with:
service tomcat6 start
and also started to pray that it works. Well, almost, I watched logs (/var/log/tomcat6/catalina.out) because I doubt that praying would help. ;)

Everything was OK, i.e. errors that I received (openoffice, pdf2swf) were expected because I didn't inistalled them. But, two errors were not expected:
java.io.FileNotFoundException: alfresco.log (Permission denied)
well, that one was cause by share.war not being able to reopen log file already being used by alfresco.war. So, I changed in appropriate log4j.properties that it uses its own separate log file. Except that it turned that I forgot to change log4j.properties. Anyway, I gave separate log to share.war just to be on a safe side and to finish this finally. The second error was:
java.net.BindException: Permission denied
That one was caused by FTP server not being able to bind to a low numbered port. This is OK, because I'm not running tomcat as a root. So, I'm safe to ignore it.

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)