Disk failure during RAID 5 rebuild

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • Disk failure during RAID 5 rebuild

      Hello there,

      In what seems to be my annual OMV disk-failure post: I am running a RAID 5 array across 6 discs (3 x 3TB Seagate and 3 x 4TB WD) and was alerted to a SMART error on one of the discs the other day (24 pending sectors on 1 of the 3TB Seagates). I bought and installed a new drive (4TB WD) to replace the disc with the error, and the rebuild has been running for the last day or so.

      I received an email from OMV earlier this morning, stating that the rebuild had failed at 70.5% :( It appears that the spare disk in the array is now also exhibiting a SMART error (one of the other 3TB Seagates ... I've actually got 2 other new 4TB WD intended to replace the Seagates).

      Have run the following commands from Putty this morning:

      cat /proc/mdstat

      Source Code

      1. Personalities : [raid6] [raid5] [raid4]
      2. md127 : active raid5 sde[6](S) sda[7](F) sdf[8] sdd[11] sdc[9] sdb[10]
      3. 14651325440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [6/4] [_UUUU _]

      blkid

      Source Code

      1. /dev/sdb: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="5e6cbf95-0477-b0a2-a422-78f7f29ab984" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      2. /dev/sdc: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="39ffaeb0-5e20-0b00-9000-2dfbbe71e62e" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      3. /dev/md127: LABEL="MEDIAVAULT" UUID="127de519-96fc-4eb8-8c38-21fb02009bd1" TYPE="xfs"
      4. /dev/sdd: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="6c06fcfa-fdc5-a834-295d-114dd8a848a7" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      5. /dev/sdf: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="2ec934c5-f798-aa47-7218-e5319d493298" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      6. /dev/sda: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="5fdfd228-cccd-f51e-0ee5-e8e6eaf3d87c" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      7. /dev/sdg1: UUID="2621b640-5f9a-46d6-ae41-fcce0e913052" TYPE="ext4"
      8. /dev/sdg5: UUID="e4b50853-771c-4585-8c1e-e9afdc87d55b" TYPE="swap"
      9. /dev/sdg3: UUID="204DF80071966050" TYPE="ntfs"
      10. /dev/sde: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="8bba6526-3bcf-ca2a-ba95-6df86d4c07fa" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      fdisk -l


      Source Code

      1. Disk /dev/sdb: 4000.8 GB, 4000787030016 bytes
      2. 255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
      3. Units = sectors of 1 * 512 = 512 bytes
      4. Sector size (logical/physical): 512 bytes / 4096 bytes
      5. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      6. Disk identifier: 0x00000000
      7. Disk /dev/sdb doesn't contain a valid partition table
      8. Disk /dev/sda: 3000.6 GB, 3000592982016 bytes
      9. 255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
      10. Units = sectors of 1 * 512 = 512 bytes
      11. Sector size (logical/physical): 512 bytes / 4096 bytes
      12. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      13. Disk identifier: 0x00000000
      14. Disk /dev/sda doesn't contain a valid partition table
      15. Disk /dev/sdf: 3000.6 GB, 3000592982016 bytes
      16. 255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
      17. Units = sectors of 1 * 512 = 512 bytes
      18. Sector size (logical/physical): 512 bytes / 4096 bytes
      19. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      20. Disk identifier: 0x00000000
      21. Disk /dev/sdf doesn't contain a valid partition table
      22. Disk /dev/sde: 4000.8 GB, 4000787030016 bytes
      23. 255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
      24. Units = sectors of 1 * 512 = 512 bytes
      25. Sector size (logical/physical): 512 bytes / 4096 bytes
      26. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      27. Disk identifier: 0x00000000
      28. Disk /dev/sde doesn't contain a valid partition table
      29. Disk /dev/sdc: 4000.8 GB, 4000787030016 bytes
      30. 255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
      31. Units = sectors of 1 * 512 = 512 bytes
      32. Sector size (logical/physical): 512 bytes / 4096 bytes
      33. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      34. Disk identifier: 0x00000000
      35. Disk /dev/sdc doesn't contain a valid partition table
      36. Disk /dev/sdd: 4000.8 GB, 4000787030016 bytes
      37. 255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
      38. Units = sectors of 1 * 512 = 512 bytes
      39. Sector size (logical/physical): 512 bytes / 4096 bytes
      40. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      41. Disk identifier: 0x00000000
      42. Disk /dev/sdd doesn't contain a valid partition table
      43. Disk /dev/md127: 15003.0 GB, 15002957250560 bytes
      44. 2 heads, 4 sectors/track, -632135936 cylinders, total 29302650880 sectors
      45. Units = sectors of 1 * 512 = 512 bytes
      46. Sector size (logical/physical): 512 bytes / 4096 bytes
      47. I/O size (minimum/optimal): 524288 bytes / 2621440 bytes
      48. Disk identifier: 0x00000000
      49. Disk /dev/md127 doesn't contain a valid partition table
      50. Disk /dev/sdg: 120.0 GB, 120034123776 bytes
      51. 255 heads, 63 sectors/track, 14593 cylinders, total 234441648 sectors
      52. Units = sectors of 1 * 512 = 512 bytes
      53. Sector size (logical/physical): 512 bytes / 512 bytes
      54. I/O size (minimum/optimal): 512 bytes / 512 bytes
      55. Disk identifier: 0x00014414
      56. Device Boot Start End Blocks Id System
      57. /dev/sdg1 * 2048 41945087 20971520 83 Linux
      58. /dev/sdg2 41945088 49807359 3931136 5 Extended
      59. /dev/sdg3 49807360 234440703 92316672 7 HPFS/NTFS/exFAT
      60. /dev/sdg5 41947136 49807359 3930112 82 Linux swap / Solaris
      Display All
      Is there any hope on being able to add back in the original disk and force a rebuild using mdadm --assemble, and then swap out the spare/failed disc and rebuild again?

      Any other suggestions to try and salvage the array? I know I should have migrated from RAID 5 some time ago, and I probably won't get any sympathy in that respect, but would really appreciate any input in terms of next steps.
    • Update: I was able to successfully rebuild the array with the original drives and have access to my files with the array reporting as ‘Clean’.

      Given the failure that occurred when swapping out the drive with 24 pending sectors, what would be my next best course of action? Swapping out the drive with 40 pending sectors (which is the one I believe caused the failure)? I’m sure I’ve swapped out discs with just a few errors without issue previously.

      Thanks in advance!
      Brian
    • brifletch wrote:

      Swapping out the drive with 40 pending sectors
      You should be able to remove the drive using Raid Management in the GUI;

      Select the array, on the menu click delete, in the dialog select the drive to be removed, click ok, this will remove the drive from the array.
      Remove the drive from the machine and install the new drive.
      Storage -> Disks select the new drive, wipe from the menu, short will be enough.
      The next option will either work or it will fail;
      Raid Management, select the array on the menu click recover, in the dialog if the drive does not show you will have to format it then repeat the recover, (format the drive the same as your array)
      The array will now rebuild.
      Raid is not a backup! Would you go skydiving without a parachute?
    • Hi @geaves

      I followed your instructions and the recover process started successfully, but only got 0.4% through before failing. I received a notification email stating that "A FailSpare event had been detected on md device /dev/md/MEDIAVAULT. It could be related to component device /dev/sda" sda is - unsurprisingly - the other drive which is showing the SMART error. sda is shown as a 'faulty spare' in the Array Detail dialogue.

      Do I have any other options to recover the array and swap these discs out? Could cloning the discs with errors be an option, or running SpinRite on one (or both of the discs) and rebuilding the array from there, or should I just accept that my chances of rebuilding the array are gone, and just try to get as much data off the 'clean' mounted array with the 2 erroring drives in place?

      Thanks in advance for your support!

      The post was edited 2 times, last by brifletch ().

    • brifletch wrote:

      Do I have any other options to recover the array and swap these discs out?
      The array at present is fine, the Raid5 will allow one disk failure, the spare is as it says and should not affect the array itself, but TBH I have never dealt with spares on mdadm and I'm surprised you're using one based upon the fact that you have 6 drives.

      6x4TB in a Raid6 will give you 16TB of space but would allow up to 2 drive failures.

      I read your post initially when you first posted, but did you create the array using the GUI or did you use the command line, the reason I ask is the last time I used the Raid option I don't remember an option for setting a drive as a spare.
      The object of a spare is to take over when a drive fails.
      You should be able to remove it as technically it's not part of the array. But, there is some information written to that spare which would allow mdadm to use it in the event of drive failure.

      If it failed after adding a drive that is part of the array that would point to hardware drive failure as you said of the spare. I know Spinrite well I have a copy and it's saved my a**** more than once, but all it will do is to move data, if it can, from a bad sector to a good area of the disk then mark the sector as bad. But SMART is doing that for you.

      Have you got a backup of data you don't want to lose?

      The reason being this could be a PIA, using different sizes in your array is not a good idea, leaving the array as is replacing the 3TB drives one at a time (if possible) requires a rebuild for each drive change that equals extra drive stress. Then once the array has rebuilt completely, the array will need to grow to make use of the extra space as will the file system.
      Raid is not a backup! Would you go skydiving without a parachute?
    • geaves wrote:

      did you create the array using the GUI or did you use the command line
      To be completely honest. I don't remember how the RAID was setup, initially. I've been running this NAS since 2012, and it's steadily grown from 2TB drives in the Proliant's 4 main bays, to adding a further 2 discs, and steadily replacing the 2TB discs with 3TB discs, and now replacing the 3TB discs with 4TB discs, when each existing disc has exhibited errors or died.

      I've always used OMV as the system software, and have encountered a few major issues with drives dying and dropping out of the array - including 2 drives dying at the same time, previously - but I guess I've always been lucky in the sense that I've been able to add new drives and rebuild successfully.


      geaves wrote:

      Have you got a backup of data you don't want to lose?
      Most of it is backed up in one sense or another, but not my most recent files. I'm thinking at this stage, it would be an idea to remount the arrary with 5 discs and start copying the remaining files off the NAS. Once done, I'd swap out the remaining 3 Seagate 3TB drives (2 with errors, 1 still 'good' ... I'll never buy Seagate again ... every drive seems to have died around the same time of it's life!) and replace with the 3 WD 4TB drives and start the new array using Raid 6 as you suggested.

      What do you think?
    • brifletch wrote:

      Most of it is backed up in one sense or another, but not my most recent files.
      :thumbup: Well you get 8 out of 10 for that :D As far as I am concerned using a raid is fine if the end user is comfortable with that and it seems you are, but there are other ways to set this up.

      If this were me I would;

      Ensure the most important files are backed up.
      Start again this means using the GUI;

      1) A complete rebuild not a reinstall.
      2) Remove SMB/CIFS shares
      3) Remove all shared folders that point to the array
      4) Remove each drive in turn from the array before removing the array itself
      5) Wipe each drive including the new ones under Storage -> Disks
      6) Recreate the array using the GUI, personally I would use Raid6 formatted to ext4, the 6x4TB drives will give you 16TB of space it will also allow for 2 drive failures.

      Bear in mind that the array will take some time to build, and you will have to recreate the shares and any SMB/CIFS.

      The above, whilst it will take time, will be the least frustrating, the option, trying to replace, rebuild, replace, rebuild, replace rebuild could potentially go wrong with another drive failing during one of the rebuilds.
      Raid is not a backup! Would you go skydiving without a parachute?
    • New

      Hey @geaves

      So, I've successfully managed to copy all the remaining data, and the previous backed up files from the RAID array to 2 new 10TB external HDDs. What's puzzling me is that at some point during the copy-off process, both the Seagate drives that were showing 24 and 40 sector SMART errors are now 'green' with no errors reported, while the remaining Seagate drive that was reporting to be error free is now 'red' and reporting 8 pending sectors. Does this suggest anything more sinister at play? Are the drives really approaching their end of life, or should I use them for other purposes?

      Next, to rebuild with the new WD 4TBs and reinstall OMV ...
    • New

      brifletch wrote:

      Does this suggest anything more sinister at play?
      It might, and you may not know until your new drives are up and running, but TBH I would just get yourself sorted.

      brifletch wrote:

      Are the drives really approaching their end of life, or should I use them for other purposes?
      I have a over a dozen 3.5 drives of various sizes and ages, are they still usable, yes, I have run Spinrite across them followed by DBan, some have old data that has been stored on them, they are not in use but every now and then I test them just to make sure they are accessible, and just to show how bad I am I have a Maxtor 80Gb Sata 3.0 drive dated 16th Jan 2006 :)

      You can always use mismatched drives in an external usb dock you just don't use them for daily use. Before we moved I had a drive that contained images of Dos 6.2 and windows 3.1, why, don't know really, perhaps it's a nostalgia thing :)
      Raid is not a backup! Would you go skydiving without a parachute?