Disk failure during RAID 5 rebuild

  • Hello there,

    In what seems to be my annual OMV disk-failure post: I am running a RAID 5 array across 6 discs (3 x 3TB Seagate and 3 x 4TB WD) and was alerted to a SMART error on one of the discs the other day (24 pending sectors on 1 of the 3TB Seagates). I bought and installed a new drive (4TB WD) to replace the disc with the error, and the rebuild has been running for the last day or so.

    I received an email from OMV earlier this morning, stating that the rebuild had failed at 70.5% :( It appears that the spare disk in the array is now also exhibiting a SMART error (one of the other 3TB Seagates ... I've actually got 2 other new 4TB WD intended to replace the Seagates).

    Have run the following commands from Putty this morning:

    cat /proc/mdstat

    Personalities : [raid6] [raid5] [raid4]
    md127 : active raid5 sde[6](S) sda[7](F) sdf[8] sdd[11] sdc[9] sdb[10]
    14651325440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/4] [_UUUU _]


    /dev/sdb: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="5e6cbf95-0477-b0a2-a422-78f7f29ab984" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sdc: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="39ffaeb0-5e20-0b00-9000-2dfbbe71e62e" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/md127: LABEL="MEDIAVAULT" UUID="127de519-96fc-4eb8-8c38-21fb02009bd1" TYPE="xfs"
    /dev/sdd: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="6c06fcfa-fdc5-a834-295d-114dd8a848a7" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sdf: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="2ec934c5-f798-aa47-7218-e5319d493298" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sda: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="5fdfd228-cccd-f51e-0ee5-e8e6eaf3d87c" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sdg1: UUID="2621b640-5f9a-46d6-ae41-fcce0e913052" TYPE="ext4"
    /dev/sdg5: UUID="e4b50853-771c-4585-8c1e-e9afdc87d55b" TYPE="swap"
    /dev/sdg3: UUID="204DF80071966050" TYPE="ntfs"
    /dev/sde: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="8bba6526-3bcf-ca2a-ba95-6df86d4c07fa" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"

    fdisk -l

    Is there any hope on being able to add back in the original disk and force a rebuild using mdadm --assemble, and then swap out the spare/failed disc and rebuild again?

    Any other suggestions to try and salvage the array? I know I should have migrated from RAID 5 some time ago, and I probably won't get any sympathy in that respect, but would really appreciate any input in terms of next steps.

  • Update: I was able to successfully rebuild the array with the original drives and have access to my files with the array reporting as ‘Clean’.

    Given the failure that occurred when swapping out the drive with 24 pending sectors, what would be my next best course of action? Swapping out the drive with 40 pending sectors (which is the one I believe caused the failure)? I’m sure I’ve swapped out discs with just a few errors without issue previously.

    Thanks in advance!

  • Swapping out the drive with 40 pending sectors

    You should be able to remove the drive using Raid Management in the GUI;

    Select the array, on the menu click delete, in the dialog select the drive to be removed, click ok, this will remove the drive from the array.
    Remove the drive from the machine and install the new drive.
    Storage -> Disks select the new drive, wipe from the menu, short will be enough.
    The next option will either work or it will fail;
    Raid Management, select the array on the menu click recover, in the dialog if the drive does not show you will have to format it then repeat the recover, (format the drive the same as your array)
    The array will now rebuild.

  • Hi @geaves

    I followed your instructions and the recover process started successfully, but only got 0.4% through before failing. I received a notification email stating that "A FailSpare event had been detected on md device /dev/md/MEDIAVAULT. It could be related to component device /dev/sda" sda is - unsurprisingly - the other drive which is showing the SMART error. sda is shown as a 'faulty spare' in the Array Detail dialogue.

    Do I have any other options to recover the array and swap these discs out? Could cloning the discs with errors be an option, or running SpinRite on one (or both of the discs) and rebuilding the array from there, or should I just accept that my chances of rebuilding the array are gone, and just try to get as much data off the 'clean' mounted array with the 2 erroring drives in place?

    Thanks in advance for your support!

  • Do I have any other options to recover the array and swap these discs out?

    The array at present is fine, the Raid5 will allow one disk failure, the spare is as it says and should not affect the array itself, but TBH I have never dealt with spares on mdadm and I'm surprised you're using one based upon the fact that you have 6 drives.

    6x4TB in a Raid6 will give you 16TB of space but would allow up to 2 drive failures.

    I read your post initially when you first posted, but did you create the array using the GUI or did you use the command line, the reason I ask is the last time I used the Raid option I don't remember an option for setting a drive as a spare.
    The object of a spare is to take over when a drive fails.
    You should be able to remove it as technically it's not part of the array. But, there is some information written to that spare which would allow mdadm to use it in the event of drive failure.

    If it failed after adding a drive that is part of the array that would point to hardware drive failure as you said of the spare. I know Spinrite well I have a copy and it's saved my a**** more than once, but all it will do is to move data, if it can, from a bad sector to a good area of the disk then mark the sector as bad. But SMART is doing that for you.

    Have you got a backup of data you don't want to lose?

    The reason being this could be a PIA, using different sizes in your array is not a good idea, leaving the array as is replacing the 3TB drives one at a time (if possible) requires a rebuild for each drive change that equals extra drive stress. Then once the array has rebuilt completely, the array will need to grow to make use of the extra space as will the file system.

  • did you create the array using the GUI or did you use the command line

    To be completely honest. I don't remember how the RAID was setup, initially. I've been running this NAS since 2012, and it's steadily grown from 2TB drives in the Proliant's 4 main bays, to adding a further 2 discs, and steadily replacing the 2TB discs with 3TB discs, and now replacing the 3TB discs with 4TB discs, when each existing disc has exhibited errors or died.

    I've always used OMV as the system software, and have encountered a few major issues with drives dying and dropping out of the array - including 2 drives dying at the same time, previously - but I guess I've always been lucky in the sense that I've been able to add new drives and rebuild successfully.

    Have you got a backup of data you don't want to lose?

    Most of it is backed up in one sense or another, but not my most recent files. I'm thinking at this stage, it would be an idea to remount the arrary with 5 discs and start copying the remaining files off the NAS. Once done, I'd swap out the remaining 3 Seagate 3TB drives (2 with errors, 1 still 'good' ... I'll never buy Seagate again ... every drive seems to have died around the same time of it's life!) and replace with the 3 WD 4TB drives and start the new array using Raid 6 as you suggested.

    What do you think?

  • Most of it is backed up in one sense or another, but not my most recent files.

    :thumbup: Well you get 8 out of 10 for that :D As far as I am concerned using a raid is fine if the end user is comfortable with that and it seems you are, but there are other ways to set this up.

    If this were me I would;

    Ensure the most important files are backed up.
    Start again this means using the GUI;

    1) A complete rebuild not a reinstall.
    2) Remove SMB/CIFS shares
    3) Remove all shared folders that point to the array
    4) Remove each drive in turn from the array before removing the array itself
    5) Wipe each drive including the new ones under Storage -> Disks
    6) Recreate the array using the GUI, personally I would use Raid6 formatted to ext4, the 6x4TB drives will give you 16TB of space it will also allow for 2 drive failures.

    Bear in mind that the array will take some time to build, and you will have to recreate the shares and any SMB/CIFS.

    The above, whilst it will take time, will be the least frustrating, the option, trying to replace, rebuild, replace, rebuild, replace rebuild could potentially go wrong with another drive failing during one of the rebuilds.

  • Hey @geaves

    So, I've successfully managed to copy all the remaining data, and the previous backed up files from the RAID array to 2 new 10TB external HDDs. What's puzzling me is that at some point during the copy-off process, both the Seagate drives that were showing 24 and 40 sector SMART errors are now 'green' with no errors reported, while the remaining Seagate drive that was reporting to be error free is now 'red' and reporting 8 pending sectors. Does this suggest anything more sinister at play? Are the drives really approaching their end of life, or should I use them for other purposes?

    Next, to rebuild with the new WD 4TBs and reinstall OMV ...

  • Does this suggest anything more sinister at play?

    It might, and you may not know until your new drives are up and running, but TBH I would just get yourself sorted.

    Are the drives really approaching their end of life, or should I use them for other purposes?

    I have a over a dozen 3.5 drives of various sizes and ages, are they still usable, yes, I have run Spinrite across them followed by DBan, some have old data that has been stored on them, they are not in use but every now and then I test them just to make sure they are accessible, and just to show how bad I am I have a Maxtor 80Gb Sata 3.0 drive dated 16th Jan 2006 :)

    You can always use mismatched drives in an external usb dock you just don't use them for daily use. Before we moved I had a drive that contained images of Dos 6.2 and windows 3.1, why, don't know really, perhaps it's a nostalgia thing :)

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!