8Tb RAID 5 array is Clean, FAILED. Pls help before I do something to make things worse.

IamTed · 17. April 2015

Hi everyone,
A few days ago, the RAID in my OMV box disappeared. The details are in this post. RAID 5 array has vaporized! Advice needed please
Last night the RAID was rebuilding, it was accessible and when I went to bed, the rebuild was up to 80% complete. All looked good in my world.

Then, I woke up this morning... and found that it was reporting the RAID as clean, FAILED. Then I checked the details and found this:

Code

mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
     Array Size : 7813523456 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 1953380864 (1862.89 GiB 2000.26 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent


    Update Time : Thu Apr 16 03:21:20 2015
          State : clean, FAILED
 Active Devices : 3
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 1


         Layout : left-symmetric
     Chunk Size : 512K


           Name : OMV2:OMV
           UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
         Events : 1816228


    Number   Major   Minor   RaidDevice State
       5       8       64        0      active sync   /dev/sde
       6       8       80        1      active sync   /dev/sdf
       2       8       32        2      active sync   /dev/sdc
       3       0        0        3      removed
       4       0        0        4      removed


       3       8       48        -      faulty spare   /dev/sdd
       7       8       16        -      spare   /dev/sdb

Alles anzeigen

BLKID gave this:

Code

blkid
/dev/sda1: UUID="d81eacfe-439c-4e12-bbb2-a933e69d4dfa" TYPE="ext4"
/dev/sda5: UUID="6e724718-95ae-4e0c-9e17-a469c4a7627e" TYPE="swap"
/dev/sdc: UUID="3e952187-f4e8-e08a-19b7-63a4cdc912c7" LABEL="OMV2:OMV" TYPE="linux_raid_member"
/dev/sdd: UUID="3e952187-f4e8-e08a-19b7-63a4cdc912c7" LABEL="OMV2:OMV" TYPE="linux_raid_member"
/dev/sde: UUID="3e952187-f4e8-e08a-19b7-63a4cdc912c7" LABEL="OMV2:OMV" TYPE="linux_raid_member"
/dev/sdf: UUID="3e952187-f4e8-e08a-19b7-63a4cdc912c7" LABEL="OMV2:OMV" TYPE="linux_raid_member"
/dev/md127: LABEL="OMV" UUID="13a7164c-7be5-49e9-ab63-d704f96f890e" TYPE="ext4"
/dev/sdg: UUID="3e952187-f4e8-e08a-19b7-63a4cdc912c7" LABEL="OMV2:OMV" TYPE="linux_raid_member"
/dev/sdb: UUID="3e952187-f4e8-e08a-19b7-63a4cdc912c7" LABEL="OMV2:OMV" TYPE="linux_raid_member"

cat /proc/mdstat gave this:

Code

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid5 sdb[7](S) sde[5] sdd[3](F) sdc[2] sdf[6]
      7813523456 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/3] [UUU__]


unused devices: <none>

Looking at each drive with mdadm --examine, I found this:

Code

root@OMV:/# mdadm --examine /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
           Name : OMV2:OMV
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
   Raid Devices : 5


 Avail Dev Size : 3906762752 (1862.89 GiB 2000.26 GB)
     Array Size : 15627046912 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 3906761728 (1862.89 GiB 2000.26 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 174d04b9:c97d1665:84cd30bd:8ed0ecd1


    Update Time : Thu Apr 16 11:02:19 2015
       Checksum : 69e9846d - correct
         Events : 1827322


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : spare
   Array State : AAA.. ('A' == active, '.' == missing)
root@OMV:/# mdadm --examine /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
           Name : OMV2:OMV
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
   Raid Devices : 5


 Avail Dev Size : 3906762752 (1862.89 GiB 2000.26 GB)
     Array Size : 15627046912 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 3906761728 (1862.89 GiB 2000.26 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 099d5100:821d9963:776e9700:cc04ecbc


    Update Time : Thu Apr 16 11:02:29 2015
       Checksum : de1f4954 - correct
         Events : 1827330


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 2
   Array State : AAA.. ('A' == active, '.' == missing)
root@OMV:/# mdadm --examine /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
           Name : OMV2:OMV
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
   Raid Devices : 5


 Avail Dev Size : 3906762752 (1862.89 GiB 2000.26 GB)
     Array Size : 15627046912 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 3906761728 (1862.89 GiB 2000.26 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 0c21a9e5:9f039ea9:12ac15c7:c11a2e3c


    Update Time : Thu Apr 16 03:21:11 2015
       Checksum : 4f46ee71 - correct
         Events : 1816220


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 3
   Array State : AAAAA ('A' == active, '.' == missing)
root@OMV:/# mdadm --examine /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
           Name : OMV2:OMV
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
   Raid Devices : 5


 Avail Dev Size : 3906762752 (1862.89 GiB 2000.26 GB)
     Array Size : 15627046912 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 3906761728 (1862.89 GiB 2000.26 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : ea732e82:10c69291:df0a29b0:7d4aad34


    Update Time : Thu Apr 16 11:02:49 2015
       Checksum : b548ab04 - correct
         Events : 1827346


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 0
   Array State : AAA.. ('A' == active, '.' == missing)
root@OMV:/# mdadm --examine /dev/sdf
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
           Name : OMV2:OMV
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
   Raid Devices : 5


 Avail Dev Size : 3906762752 (1862.89 GiB 2000.26 GB)
     Array Size : 15627046912 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 3906761728 (1862.89 GiB 2000.26 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 8886034f:2e225de1:55989cc1:3ef2fea4


    Update Time : Thu Apr 16 11:02:49 2015
       Checksum : 53ad4ef9 - correct
         Events : 1827346


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 1
   Array State : AAA.. ('A' == active, '.' == missing)
root@OMV:/# mdadm --examine /dev/sdg
/dev/sdg:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
           Name : OMV2:OMV
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
   Raid Devices : 5


 Avail Dev Size : 3906762752 (1862.89 GiB 2000.26 GB)
     Array Size : 15627046912 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 3906761728 (1862.89 GiB 2000.26 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 44029c1f:b0a408e3:6d6d35c2:b5660be0


    Update Time : Sun Mar  1 06:44:42 2015
       Checksum : 614b8b1f - correct
         Events : 153067


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 4
   Array State : AAAAA ('A' == active, '.' == missing)

Alles anzeigen

So, I figured that if I add another drive in as a spare, that the RAID would start to rebuild, but it didn't.

Code

mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
     Array Size : 7813523456 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 1953380864 (1862.89 GiB 2000.26 GB)
   Raid Devices : 5
  Total Devices : 6
    Persistence : Superblock is persistent


    Update Time : Thu Apr 16 21:09:24 2015
          State : clean, FAILED
 Active Devices : 3
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 2


         Layout : left-symmetric
     Chunk Size : 512K


           Name : OMV2:OMV
           UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
         Events : 1833930


    Number   Major   Minor   RaidDevice State
       5       8       64        0      active sync   /dev/sde
       6       8       80        1      active sync   /dev/sdf
       2       8       32        2      active sync   /dev/sdc
       3       0        0        3      removed
       4       0        0        4      removed


       3       8       48        -      faulty spare   /dev/sdd
       7       8       16        -      spare   /dev/sdb
       8       8      128        -      spare   /dev/sdi

Alles anzeigen

(Continued)

IamTed · 17. April 2015

(Continued)
Looking at the events for each drive, I find this:

Code

mdadm --examine /dev/sdb | egrep 'Event|
/dev/sd'/dev/sdb:Events : 1834204
root@OMV:/# mdadm --examine /dev/sdc | egrep 'Event|
/dev/sd'/dev/sdc:Events : 1834212
root@OMV:/# mdadm --examine /dev/sdd | egrep 'Event|
/dev/sd'/dev/sdd:Events : 1816220
root@OMV:/# mdadm --examine /dev/sde | egrep 'Event|
/dev/sd'/dev/sde:Events : 1834228
root@OMV:/# mdadm --examine /dev/sdf | egrep 'Event|
/dev/sd'/dev/sdf:Events : 1834236
root@OMV:/# mdadm --examine /dev/sdg | egrep 'Event|
/dev/sd'/dev/sdg:Events : 153067

Alles anzeigen

From my noob limited knowledge, it looks like there is a good chance that the RAID can be rebuilt with sdb, sdc, sde, and sdf. I am thinking that if I run mdadm --assemble --run --force /dev/md127 /dev/sd[b-c,f], will it work to rebuild the array? Or, am I going to loose everything?

I am on the verge of panicking, as I have over 1200 movies and over 150 TV series that we have collected over the past 10 years, I am hoping that that I'm not totally..... I don't want to even think about it.
If there is anything that I can do to save this, I would be grateful for any help that anyone can provide.
Thank you!

ryecoaaron · 17. April 2015

Raid 5 can only handle one drive failing. Anymore than that and all data is gone Raid isn't backup...

You have such a mess. How many drives are in the original array? Which drives are they?

IamTed · 17. April 2015

Five drives in the original. They were: sdc, sdd, sde, sdf and sdg.

It appears that sdg was the original one that dropped out of the array.

sdb was added to replace sdg when the array tried to rebuild.

IamTed · 17. April 2015

Looking through the syslog, I've come across this from when it was rebuilding:

Code

Apr 16 03:21:20 OMV kernel: [46357.035162] end_request: I/O error, dev sdd, sector 3673200640
Apr 16 03:21:20 OMV kernel: [46357.209100] md: md127: recovery done.
Apr 16 03:21:20 OMV kernel: [46357.217561] Aborting journal on device md127-8.
Apr 16 03:21:20 OMV kernel: [46357.217611] EXT4-fs error (device md127) in ext4_reserve_inode_write: Journal has aborted
Apr 16 03:21:20 OMV kernel: [46357.217670] JBD2: I/O error detected when updating journal superblock for md127-8.
Apr 16 03:21:20 OMV kernel: [46357.217725] EXT4-fs error (device md127) in ext4_dirty_inode: Journal has aborted
Apr 16 03:21:20 OMV kernel: [46357.217770] EXT4-fs (md127): previous I/O error to superblock detected
Apr 16 03:21:20 OMV kernel: [46357.217818] EXT4-fs error (device md127): ext4_journal_start_sb: Detected aborted journal
Apr 16 03:21:20 OMV kernel: [46357.217867] EXT4-fs (md127): Remounting filesystem read-only
Apr 16 03:21:20 OMV kernel: [46357.219970] RAID5 conf printout:
Apr 16 03:21:20 OMV kernel: [46357.219973]  --- rd:5 wd:3
Apr 16 03:21:20 OMV kernel: [46357.219975]  disk 0, o:1, dev:sde
Apr 16 03:21:20 OMV kernel: [46357.219977]  disk 1, o:1, dev:sdf
Apr 16 03:21:20 OMV kernel: [46357.219979]  disk 2, o:1, dev:sdc
Apr 16 03:21:20 OMV kernel: [46357.219981]  disk 3, o:0, dev:sdd
Apr 16 03:21:20 OMV kernel: [46357.219983]  disk 4, o:1, dev:sdb
Apr 16 03:21:20 OMV kernel: [46357.229080] RAID5 conf printout:
Apr 16 03:21:20 OMV kernel: [46357.229082]  --- rd:5 wd:3
Apr 16 03:21:20 OMV kernel: [46357.229084]  disk 0, o:1, dev:sde
Apr 16 03:21:20 OMV kernel: [46357.229086]  disk 1, o:1, dev:sdf
Apr 16 03:21:20 OMV kernel: [46357.229088]  disk 2, o:1, dev:sdc
Apr 16 03:21:20 OMV kernel: [46357.229090]  disk 3, o:0, dev:sdd
Apr 16 03:21:20 OMV kernel: [46357.229097] RAID5 conf printout:
Apr 16 03:21:20 OMV kernel: [46357.229099]  --- rd:5 wd:3
Apr 16 03:21:20 OMV kernel: [46357.229101]  disk 0, o:1, dev:sde
Apr 16 03:21:20 OMV kernel: [46357.229102]  disk 1, o:1, dev:sdf
Apr 16 03:21:20 OMV kernel: [46357.229104]  disk 2, o:1, dev:sdc
Apr 16 03:21:20 OMV kernel: [46357.229106]  disk 3, o:0, dev:sdd
Apr 16 03:21:20 OMV kernel: [46357.245004] RAID5 conf printout:
Apr 16 03:21:20 OMV kernel: [46357.245006]  --- rd:5 wd:3
Apr 16 03:21:20 OMV kernel: [46357.245008]  disk 0, o:1, dev:sde
Apr 16 03:21:20 OMV kernel: [46357.245010]  disk 1, o:1, dev:sdf
Apr 16 03:21:20 OMV kernel: [46357.245012]  disk 2, o:1, dev:sdc
Apr 16 03:21:31 OMV monit[8084]: 'fs_media_13a7164c-7be5-49e9-ab63-d704f96f890e' filesytem flags changed to 0x1009

Alles anzeigen

What that tells me, is that sdd put it's fingers in it's ears and stopped talking, possibly due to a bad sector. I know that when I went to bed at midnight, the array was at 80% rebuilt and this happened at 3:20, so it was pretty close to finishing the rebuild. If I run Spinrite on sdd and am able to get the drive reading again in that area, would I be able to try and rebuild again?

I realize that RAID isn't a form of backup, which is why I have been eyeing SnapRaid. The problem is that this collection has been growing slowly over the years and has sort of grown out of control size wise. If I could have found someone that was willing to loan me about 8TB of drives to move this stuff to while I switched to SnapRaid, that would have been great, but it never happened....

ryecoaaron · 17. April 2015

If spinrite fixes the drive, I would try rebuilding. That should be just to get the content off the raid array though.

If you were close, I might be able to help

IamTed · 17. April 2015

Thanks Rye. I'm going to run SpinRite. Unfortunately, it's not a quick process. (There is so much promise with the speed of scanning drives with the new version, when it comes out...)

Thanks for the offer of help.... Let's see, drive time from Toronto to Wisconsin is about 12 hours.......... Just kidding, I have driving duty, getting my wife back and forth to dialysis.

IamTed · 22. April 2015

Well, I decided to run Spinrite on all the drives and sure enough, sdb and sdd had read issues. SpinRite seems to have corrected the issues on sdd, but sdb seems to be a little more of a challenge. Seeing that the array was originally sdc, sdd, sde, sdf and sdg, then sdd dropped out and the whole thing crapped out when it was trying to rebuild on to sdb to replace sdd, can I try to force assemble it with sdc-sdg? There was no changes to the content originally while this was going on with the array, so I hope that helps with the odds. Or would it be better to wait to see if SpinRite can bring sdb back from the dead?

ryecoaaron · 22. April 2015

I would try to use the original drives if possible.

IamTed · 24. April 2015

Ok, what a wild ride so far.... I ran SpinRite on all the drives and sdb is toast. So, I focused on the five original disks. I did a

Code

mdadm --assemble --run --force /dev/md127 /dev/sd[b-f]

which seemed to work. (Being that I had booted up without the old sdb, all the drive letters got reassigned.) I tried mdadm --detail /dev/md127 and got this:

Code

/dev/md127:
        Version : 1.2
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
     Array Size : 7813523456 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 1953380864 (1862.89 GiB 2000.26 GB)
   Raid Devices : 5
  Total Devices : 4
    Persistence : Superblock is persistent


    Update Time : Thu Apr 23 16:25:59 2015
          State : active, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0


         Layout : left-symmetric
     Chunk Size : 512K


           Name : OMV2:OMV
           UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
         Events : 1881421


    Number   Major   Minor   RaidDevice State
       5       8       48        0      active sync   /dev/sdd
       6       8       64        1      active sync   /dev/sde
       2       8       16        2      active sync   /dev/sdb
       3       8       32        3      active sync   /dev/sdc
       4       0        0        4      removed

Alles anzeigen

I shut the box down to add another drive in to replace the toasted sdb and when it booted, it hung for a while "Checking Quotas". When it finally finished that and booted up, I got this:

Code

/dev/md127:
        Version : 1.2
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
     Array Size : 7813523456 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 1953380864 (1862.89 GiB 2000.26 GB)
   Raid Devices : 5
  Total Devices : 4
    Persistence : Superblock is persistent


    Update Time : Thu Apr 23 19:30:13 2015
          State : clean, degraded
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0


         Layout : left-symmetric
     Chunk Size : 512K


           Name : OMV2:OMV
           UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
         Events : 1884280


    Number   Major   Minor   RaidDevice State
       5       8       64        0      active sync   /dev/sde
       6       8       80        1      active sync   /dev/sdf
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       0        0        4      removed

Alles anzeigen

So far, so good. Through the webGUI, I chose the Recover option and added the new sdb to the array. The array started to do the rebuild and I went to bed.

This morning, I woke to find that the array was listed as clean, FAILED, and I could not access any files on it. It looks like it ran up against more issues with drive sdd. Here is the mdadm --detail /dev/md127:

Code

/dev/md127:
        Version : 1.2
  Creation Time : Sun Jan  1 13:44:03 2006
     Raid Level : raid5
     Array Size : 7813523456 (7451.56 GiB 8001.05 GB)
  Used Dev Size : 1953380864 (1862.89 GiB 2000.26 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent


    Update Time : Fri Apr 24 10:48:31 2015
          State : clean, FAILED
 Active Devices : 3
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 1


         Layout : left-symmetric
     Chunk Size : 512K


           Name : OMV2:OMV
           UUID : 3e952187:f4e8e08a:19b763a4:cdc912c7
         Events : 1903164


    Number   Major   Minor   RaidDevice State
       5       8       64        0      active sync   /dev/sde
       6       8       80        1      active sync   /dev/sdf
       2       8       32        2      active sync   /dev/sdc
       3       0        0        3      removed
       4       0        0        4      removed


       3       8       48        -      faulty spare   /dev/sdd
       7       8       16        -      spare   /dev/sdb

Alles anzeigen

I then added another drive to the array, thinking that it would rebuild on to that, but no dice. At that point, I realized that I was back at square one, so I rebooted the box, and issued this:

Code

mdadm --assemble --run --force /dev/md127 /dev/sd[b-f]
mdadm: forcing event count in /dev/sdd(3) from 1902723 upto 1903306
mdadm: clearing FAULTY flag for device 2 in /dev/md127 for /dev/sdd
mdadm: /dev/md127 has been started with 4 drives (out of 5) and 1 spare.

I was able to mount the array and access the files. Obviously, there is something with the sdd drive, as it seems to crap out during the rebuilding process. I am looking for the best option to go with from here. I realize that I am on the edge of the cliff with my toes danging over. If one more drive fails, I'm pooched. I am sitting here with an array that is listed as clean, degraded, recovering, but craps out during the rebuild due to a drive that if cranky. I realize that there could be a very small area on sdd that is causing problems. Here is what I've come up with for ideas:

Run SpinRite again on the sdd drive to see it can access the trouble area. If I do this on a Level 3 or 4, it was saying that it would take about 2 weeks to complete, is it worked. Then add the sdd drive back to the array and rebuild to a spare.
Take the sdd drive and use Clonezilla to copy it to a spare 2Tb drive that I have and use the -rescue switch. Then replace the old sdd drive with the cloned one and rebuild with a second spare drive. What I don't know, is if the rebuild will handle the missing sector differently that it did when it was getting a I/O error back from the old sdd drive during the prior rebuild attempts.
Say screw it, and run the array on the four out of five drives, so the array is clean and degraded and copy everything off the array and go from there, with either Greyhole or SnapRAID.

Opinions?

IamTed · 25. April 2015

BTW, if the best option is the last, then I would like to copy the files off by connecting the drive to the OMV box and moving them. I could do it the painful way, through the CLI, but is there a better way? I know, that moving 8Tb off through the NIC, to hard drives in another box, will take forever and a day.
Thanks!

8Tb RAID 5 array is Clean, FAILED. Pls help before I do something to make things worse.

Jetzt mitmachen!

Tags