RAID5 - OS crash & HDD replacement

  • Hi guys, problem is the following - my NAS is a Dell T20 server running currently 5x 2TBs disks in RAID5. This has been set up a while back with OMV4, running on an USB stick. In the last week I noticed a couple of issues (mainly I couldn't log into the webinterface) so I just restarted the server. Noting that there seemed to be some consistency issues on the filesystem (it was booting default to initramfs which I was able to fix with fsck). Indepedently, I had a look into my webinterface once it was back up online and could see that there were some issues with one of the HDDs (red alarm signal in SMART - having some bad sectors) and the RAID was graded as 'clean, degraded' - files were still accessible.


    So I did get a new HDD and wanted to replace it over the weekend - unfortunately there hit a power outage on Thursday/Friday and the system went off. Rebooting the machine the USB thumb drive seemed unreadable so I did set up a new OMV installation (v5) on a different thumb drive. After successfully setting it up I replaced the drive that was claimed to have bad sectors with a new one - unfortunately I do not see the RAID now in the webinterface. I thought software RAID would be hot swappable, especially because 4 drives are still operating fine, looks like that is now a bit of an issue...I have now put the old (bad sectors) HDD back in and seem to have issues to get the RAID back into active mode:


    Code
    root@openmediavault:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
    md127 : inactive sdd[3] sda[0] sde[5] sdc[4]
    7813542240 blocks super 1.2
    unused devices: <none>

    This is the defective drive:

    And these are the ones that operate correctly:


    Now I am feeling a bit hopeless - I am not sure how to interpret this?


    Though it looks like OMV can still see all HDDs and correctly considers them being in a RAID:


    Code
    root@openmediavault:~# blkid
    /dev/sdf1: UUID="fe602a13-21c8-49b7-a9b0-de6c397bd6a9" TYPE="ext4" PARTUUID="ad5bf966-01"
    /dev/sdf5: UUID="83f6ba28-0283-47b2-98a4-f832ad8486c1" TYPE="swap" PARTUUID="ad5bf966-05"
    /dev/sde: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="70f0181e-be31-b75e-e78f-498a7379f630" LABEL="Nas:RAID5" TYPE="linux_raid_member"
    /dev/sda: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="c762fde1-9e8f-0b91-cca4-ab0c6ae35729" LABEL="Nas:RAID5" TYPE="linux_raid_member"
    /dev/sdc: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="e392ac97-159d-f386-b562-983e0ed8929a" LABEL="Nas:RAID5" TYPE="linux_raid_member"
    /dev/sdd: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="9590adb7-f274-c2eb-b2f9-b5041dae9a1a" LABEL="Nas:RAID5" TYPE="linux_raid_member"
    /dev/sdb: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="82f55d04-1e52-1e01-6af8-d94b6c2aa6cf" LABEL="Nas:RAID5" TYPE="linux_raid_member"


    It says still active, but FAILED/not started - does anyone have an idea here how to circumvent this maybe?


    Code
    root@openmediavault:~# cat /sys/block/md127/md/array_state
    inactive


    Thanks in advance!!

  • I thought software RAID would be hot swappable

    I wished it was :)


    The post is clear but also confusing, I'm guessing that the cat /proc/mdstat was when you added the new drive and the output from blkid is from when you put the failing drive back in.


    To get the array active it needs to be stopped then reassembled you may also have to create an mdadm conf, but first things first;

    mdadm --stop /dev/md127

    mdadm --assemble --force --verbose /dev/md127 /dev/sd[abcde]

    this usually works, but no guarantee

    Raid is not a backup! Would you go skydiving without a parachute?

  • doesn't look like this did the trick - it considers them as 'busy'?


    Code
    root@openmediavault:~# mdadm --assemble --force --verbose /dev/md127 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
    mdadm: looking for devices for /dev/md127
    mdadm: /dev/sda is busy - skipping
    mdadm: /dev/sdb is busy - skipping
    mdadm: /dev/sdc is busy - skipping
    mdadm: /dev/sdd is busy - skipping
    mdadm: /dev/sde is busy - skipping
    root@openmediavault:~#
  • did you stop it first as per my post

    Weird, I just tried it again, same sequence! Now it shows up in OMV - still as clean, degraded. I suspect I need to decouple the defective HDD now somehow?


  • If mdadm doesn't report the array as stopped the second command will report the array as busy :)


    /dev/sdb is that the drive with the bad sectors

    Raid is not a backup! Would you go skydiving without a parachute?

  • If mdadm doesn't report the array as stopped the second command will report the array as busy :)


    /dev/sdb is that the drive with the bad sectors

    Yes it is. It is the one that is missing in the webinterface (under RAID management).


    Can I just remove the faulty one, add the functioning one and start to grow the RAID/add to the array or is there anything else I need to do?

  • Can I just remove the faulty one

    No!! you have to tell mdadm what to do, this is the normal procedure,


    mdadm /dev/md127 --fail /dev/sdb

    mdadm /dev/md127 --remove /dev/sdb


    you should then be able to physically remove the drive, install the new drive, then;

    Storage -> Disks select the new drive, click wipe on the menu and select short.

    Raid Management -> Select the raid, on the menu click recover, a dialog box should display the new drive, select it and click OK, the raid should now rebuild

    Raid is not a backup! Would you go skydiving without a parachute?

  • Just checked - /dev/sdb seems to be already removed from the array, could it be that mdadm has done that just by itself? Would it be safe now to physically remove the faulty drive and connect the new one?


  • Just checked - /dev/sdb seems to be already removed from the array, could it be that mdadm has done that just by itself

    Highly probable, the (possibly out of date) error is correctable, whilst mdadm knows about sdb it's effectively removed it due to the error, you should be good to carry on, replace the drive and rebuild.

    Raid is not a backup! Would you go skydiving without a parachute?

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!