RAID Disappeared - need help to rebuild

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • RAID Disappeared - need help to rebuild

      Hi Guys,

      I've woken up this morning and can't access any media on my server. Logged into OMV GUI, rebooted and the drives are all still there, but the RAID is missing (I believe it was a RAID 5 array). Have looked at a few threads on the forum, and tried to start a self-diagnosis, but I was hoping one of you would kindly offer me some guidance.

      blkid:
      /dev/sdb: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="ef7151df-f61f-ce6e-1612-4dbfc0e9d1cf" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      /dev/sdd: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="55ad5625-0b73-7d46-248e-aedccfc05460" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      /dev/sdf: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="6c9b69a2-58ef-cced-95f1-11233b066a54" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      /dev/sde: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="7a6cb4e9-902f-3852-3c8a-01196f74dcea" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      /dev/sda: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="5fdfd228-cccd-f51e-0ee5-e8e6eaf3d87c" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
      /dev/sdc: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="1e70980d-7799-6a00-9786-2be2e6522558" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"

      cat /proc/mdstat:
      Personalities : [raid6] [raid5] [raid4]
      md126 : inactive sdf[5](S) sdc[2](S)
      5860531120 blocks super 1.2

      md127 : inactive sda[7] sde[6] sdd[3] sdb[1]
      11721062240 blocks super 1.2

      I found some responses to other issues that guided the user to force a rebuild, but because (for an unknown reason) the 6 drives seem to be split across 2 mds I wasn't really sure what my next step should be. All 6 drives were, as of last night, in the same single RAID array.

      Your help would be most appreciated!
      Thanks in advance,
      Brian
    • Re,

      the command will be something like:
      mdadm --assemble /dev/mdX /dev/sd[abcdef] (change the X to 0,126 or 127 ... whatever you want)
      if that fails try to force it:
      mdadm --assemble --force /dev/mdX /dev/sd[abcdef]

      BUT:
      You should find the root-cause for this behavior first - just check the logs for any error message!
      cat /var/log/messages | grep KEYWORD (KEYWORD is something like md, sda, sdb, sd..., raid, ...)
      cat /var/log/syslog | grep KEYWORD

      Sc0rp
    • Re,

      tkaiser wrote:

      Just for my personal understanding: the above /proc/mdstat output talking about two RAIDs with different members can be ignored?
      I hope so ... since the state for the both /dev/sdf and /dev/sdc in the md126-array are both "spare", i hope the backup superblock is intact for the assembling with the remaining disks. Btw. the "/dev/mdX" numbering is more OS related, than array related ... md is working here more like hardware-controllers.

      You can issue the command:
      mdadm --examine /dev/sd[abcdef]
      to bring clearance of the stati of all drives - i hope there are no greater "event missmatches" on the array ...

      Sc0rp
    • Sc0rp wrote:

      You can issue the command:
      mdadm --examine /dev/sd[abcdef]
      to bring clearance of the stati of all drives - i hope there are no greater "event missmatches" on the array ...

      Sc0rp
      The command you suggested returned the following:

      Source Code

      1. /dev/sda:
      2. Magic : a92b4efc
      3. Version : 1.2
      4. Feature Map : 0x0
      5. Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
      6. Name : openmediavault:MEDIAVAULT (local to host openmediavault)
      7. Creation Time : Mon Sep 17 01:03:50 2012
      8. Raid Level : raid5
      9. Raid Devices : 6
      10. Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
      11. Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
      12. Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
      13. Data Offset : 2048 sectors
      14. Super Offset : 8 sectors
      15. State : clean
      16. Device UUID : 7a6cb4e9:902f3852:3c8a0119:6f74dcea
      17. Update Time : Thu Nov 23 22:46:35 2017
      18. Checksum : 2b886914 - correct
      19. Events : 288625
      20. Layout : left-symmetric
      21. Chunk Size : 512K
      22. Device Role : Active device 5
      23. Array State : AA.A.A ('A' == active, '.' == missing)
      24. /dev/sdb:
      25. Magic : a92b4efc
      26. Version : 1.2
      27. Feature Map : 0x0
      28. Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
      29. Name : openmediavault:MEDIAVAULT (local to host openmediavault)
      30. Creation Time : Mon Sep 17 01:03:50 2012
      31. Raid Level : raid5
      32. Raid Devices : 6
      33. Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
      34. Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
      35. Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
      36. Data Offset : 2048 sectors
      37. Super Offset : 8 sectors
      38. State : active
      39. Device UUID : 6c9b69a2:58efcced:95f11123:3b066a54
      40. Update Time : Thu Nov 23 22:46:27 2017
      41. Checksum : f37650e5 - correct
      42. Events : 288622
      43. Layout : left-symmetric
      44. Chunk Size : 512K
      45. Device Role : Active device 4
      46. Array State : AA.AAA ('A' == active, '.' == missing)
      47. /dev/sdc:
      48. Magic : a92b4efc
      49. Version : 1.2
      50. Feature Map : 0x0
      51. Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
      52. Name : openmediavault:MEDIAVAULT (local to host openmediavault)
      53. Creation Time : Mon Sep 17 01:03:50 2012
      54. Raid Level : raid5
      55. Raid Devices : 6
      56. Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
      57. Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
      58. Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
      59. Data Offset : 2048 sectors
      60. Super Offset : 8 sectors
      61. State : clean
      62. Device UUID : 5fdfd228:cccdf51e:0ee5e8e6:eaf3d87c
      63. Update Time : Thu Nov 23 22:46:35 2017
      64. Checksum : 97485482 - correct
      65. Events : 288625
      66. Layout : left-symmetric
      67. Chunk Size : 512K
      68. Device Role : Active device 0
      69. Array State : AA.A.A ('A' == active, '.' == missing)
      70. /dev/sdd:
      71. Magic : a92b4efc
      72. Version : 1.2
      73. Feature Map : 0x0
      74. Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
      75. Name : openmediavault:MEDIAVAULT (local to host openmediavault)
      76. Creation Time : Mon Sep 17 01:03:50 2012
      77. Raid Level : raid5
      78. Raid Devices : 6
      79. Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
      80. Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
      81. Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
      82. Data Offset : 2048 sectors
      83. Super Offset : 8 sectors
      84. State : clean
      85. Device UUID : ef7151df:f61fce6e:16124dbf:c0e9d1cf
      86. Update Time : Thu Nov 23 22:46:35 2017
      87. Checksum : c8fc5ca5 - correct
      88. Events : 288625
      89. Layout : left-symmetric
      90. Chunk Size : 512K
      91. Device Role : Active device 1
      92. Array State : AA.A.A ('A' == active, '.' == missing)
      93. /dev/sde:
      94. Magic : a92b4efc
      95. Version : 1.2
      96. Feature Map : 0x0
      97. Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
      98. Name : openmediavault:MEDIAVAULT (local to host openmediavault)
      99. Creation Time : Mon Sep 17 01:03:50 2012
      100. Raid Level : raid5
      101. Raid Devices : 6
      102. Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
      103. Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
      104. Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
      105. Data Offset : 2048 sectors
      106. Super Offset : 8 sectors
      107. State : active
      108. Device UUID : 1e70980d:77996a00:97862be2:e6522558
      109. Update Time : Wed Nov 1 08:02:57 2017
      110. Checksum : 33f8af78 - correct
      111. Events : 275260
      112. Layout : left-symmetric
      113. Chunk Size : 512K
      114. Device Role : Active device 2
      115. Array State : AAAAAA ('A' == active, '.' == missing)
      116. /dev/sdf:
      117. Magic : a92b4efc
      118. Version : 1.2
      119. Feature Map : 0x0
      120. Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
      121. Name : openmediavault:MEDIAVAULT (local to host openmediavault)
      122. Creation Time : Mon Sep 17 01:03:50 2012
      123. Raid Level : raid5
      124. Raid Devices : 6
      125. Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
      126. Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
      127. Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
      128. Data Offset : 2048 sectors
      129. Super Offset : 8 sectors
      130. State : clean
      131. Device UUID : 55ad5625:0b737d46:248eaedc:cfc05460
      132. Update Time : Thu Nov 23 22:46:35 2017
      133. Checksum : 9495447e - correct
      134. Events : 288625
      135. Layout : left-symmetric
      136. Chunk Size : 512K
      137. Device Role : Active device 3
      138. Array State : AA.A.A ('A' == active, '.' == missing)
      Display All

      The post was edited 1 time, last by tkaiser: Added code block ().

    • brifletch wrote:

      The command you suggested returned the following
      But log output is still missing ;) Something like this would put this to an online pasteboard service:

      Source Code

      1. zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/syslog* | grep -v systemd | curl -F 'sprunge=<-' http://sprunge.us
      2. zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/messages* | grep -v systemd | curl -F 'sprunge=<-' http://sprunge.us
    • Thanks tkaiser, that was really helpful. I was in the process of manually copying and pasting the outputs ... I'm a bit of a novice, but can follow instructions! :)

      zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/syslog* | grep -v systemd | curl -F 'sprunge=<-' sprunge.us
      sprunge.us/HCaD

      zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/messages* | grep -v systemd | curl -F 'sprunge=<-' sprunge.us
      sprunge.us/KLXL
    • brifletch wrote:

      sprunge.us/HCaD
      Ok, RAID was not coming up since

      Source Code

      1. /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332607] md/raid:md127: device sda operational as raid disk 0
      2. /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332610] md/raid:md127: device sde operational as raid disk 5
      3. /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332612] md/raid:md127: device sdd operational as raid disk 3
      4. /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332615] md/raid:md127: device sdb operational as raid disk 1
      5. /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.333341] md/raid:md127: allocated 0kB
      6. /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.333459] md/raid:md127: not enough operational devices (2/6 failed)
      And the troubles started at '/var/log/syslog.1:Nov 23 22:46:31' on sdb. Everything else is beyond my mdraid knowledge (since I hate it wholeheartedly ;) ) but I would check at least SMART attribute 199 of sdb now and check also sdc and sdf (since also reported as missing today)
    • Re,

      according to the output there are good news ... and bad news:
      - good: the "Magic" is the same on all drives
      - bad: the Event-counter is different (but it seems, they are close enough to reassemble)

      Any conclusion about the root-cause? That would be highly necessary ...

      May be, you can "copy" (read backup) the log-files for later searching ...
      cp -v /var/log/messages* /root/20171124-raid-issue (this will create the subdir "20171124-raid-issue" under the home of root)
      cp -v /var/log/syslog* /root/20171124-raid-issue

      After the files are copied, you can try to reassemble ...

      Sc0rp
    • Re,

      checked the logs ... here are the most recent error-lines:

      /var/log/syslog.1:Nov 23 22:46:35 openmediavault kernel: [23464678.453691] md/raid:md127: Disk failure on sdb, disabling device.
      /var/log/syslog.1:Nov 23 22:46:35 openmediavault kernel: [23464678.453691] md/raid:md127: Operation continuing on 4 devices.

      That means:
      - sdb was disable due to massive errors on the device (read errors) ... and with that, your array went from "degraded" to "dead"
      - 2nd line states, that there was one missing drive before that ... and with that, your redundancy was gone

      Don't you have an email-notification?
      Which drives (vendor/model) do you use in this setup?
      Which powersupply do you use?

      Sc0rp

      EDIT/ps: you should also check the backlogs ...
      ls -la /var/log | grep syslog (shows the backlogs for syslog)
      ls -la /var/log | grep messages (shows the backlogs for messages)
      do this both commands and remember the numbers, then do:
      zcat /var/log/syslog.X.gz | grep sdY (the X is a number between 1-7 - look at the lists from commands above, Y is a,b,c,d,e or f)
      zcat /var/messages.X.gz | grep sdY
      for each drive one for one (first start with Y=a), since the drive-naming between mdstat and the provided logs looks weired ...

      After that, you have to check all SMART stati of all drives, as @tkaiser mentioned already!

      Sc0rp

      The post was edited 1 time, last by Sc0rp ().

    • @tkaiser

      I can see in the logs that sdb encountered a number of read error not correctable errors at that time last night.

      sdb: SMART attribute199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0

      That drive has some pending sectors too, which have increased slightly in the last week. Was planning to replace the drive (clearly should've done it sooner), but that's not one of the ones that was kicked out, right?

      sdc and sdf both have no pending or reallocated sectors, but ALL 6 drives are showing the same CRC Error Count as sdb 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0

      @Sc0rp

      That sounds reasonably positive. No conclusions drawn, except the info above. To be honest, I don't really know what else I should be looking for in order to come to a conclusion.

      I'll do as you suggested with the logs (though the code you suggested is returning an error, currently: `/root/20171124-raid-issue' is not a directory), and give the reassembly a go.
    • From reasonably positive to, really not positive at all ...

      Yes, I do have email notification ... I'd received emails about the pending sectors, but not to the effect of any failed disc.
      I'm using Seagate Barracuda 3TB discs (which I recently learned were probably not the best ones to use)
      Not sure about the powersupply, specifically - it is the standard one that came in my hp Microserver

      I can't understand about the missing drive ... I'm sure all 6 were in the array, previously.

      Is that it then? No chance of resurrecting or rebuilding?
    • Re,

      brifletch wrote:

      I'll do as you suggested with the logs (though the code you suggested is returning an error, currently: `/root/20171124-raid-issue' is not a directory), and give the reassembly a go.
      Sorry that was my error, while fast typing ... add a slash on the end:
      cp -v /var/log/messages* /root/20171124-raid-issue/

      After copying the files, you can do the search on this directory ... just change the path from "/var/log/" to "/root/20171124-raid-issue/" ...

      brifletch wrote:

      I can't understand about the missing drive ... I'm sure all 6 were in the array, previously.
      But the log's don't lie :P

      brifletch wrote:

      Is that it then? No chance of resurrecting or rebuilding?
      You can always try the reassemble with force. Chance is 50:50 ... md is save in this, but you have to expect data-loss since the fs-layer (xfs) is damged too ...

      And as always: RAID is not backup ... i hope you'll have a working backup.
      For the future you should keep in mind, that you hat to think about changing from RAID5 to ZFS-Z1 or move to SnapRAID/mergerfs ...

      PSU: ATX standard is good, i was only afraid of the next PicoPSU-setup ...

      HDD: Barracuda's are not problematic at all, my "old" 2TB-ones are working flawlessly 24/7 ... md-RAID5 @ OMV3 (of course with continous rsync-backup ... and UPS ... and email-noti ... and other scripts)

      Sc0rp

      The post was edited 1 time, last by Sc0rp ().

    • Sc0rp wrote:

      Sorry that was my error, while fast typing ... add a slash on the end:cp -v /var/log/messages* /root/20171124-raid-issue/


      After copying the files, you can do the search on this directory ... just change the path from "/var/log/" to "/root/20171124-raid-issue/" ...

      Sc0rp
      I'm still getting the same error, even when adding the ending slash?
    • brifletch wrote:

      Yes, I do have email notification ... I'd received emails about the pending sectors, but not to the effect of any failed disc.
      And that's because I only had the notifications for SMART events turned on :( . I have to go out for a few hours, but will follow up with the log copies when I get back in ... but is there any point? Is there still any hope of salvaging any data?
    • Hi Guys,

      So, I finally got chance to save the logs as suggested (I had to manually create the subdirectory to do so). Then, force reassemble worked and the array is visible and mounted again. All files appear to be 'visible', but as you said I'm expecting some to have become corrupted and to be missing data.

      For now, it's a big "phew" and thanks for your help so far!

      To answer your question, no .. I foolishly don't have a backup, and that will be the next step once I've got this array as stable as I can for now, before I look at alternative filesystems like you said.

      What would you suggest the sequence of steps should be to minimise risk of causing more problems? There are 3 drives in the array that are reporting SMART issues:

      Reallocated SectorsPending SectorsOffline UncorrectableCRC Error Rate
      sdb640000
      sdc113632320
      sdf16168016800



      Which one of the drives above should I swap out and replace first, second and third?

      Thanks again for your help, guys - really appreciate it!