Raid 5 Degraded after Reboot

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • Raid 5 Degraded after Reboot

      I have a RAID 5 with three Hard drives on my system. I'm using Open Media Vault 3.0.59 with Debian 8.9

      A couple of days ago, after rebooting, I got a message that my RAID was degrading. So I checked the hard drive for failures using SMART. But it shows no failure (or at least not any I can see....):
      pastebin.com/3KESMji3

      More precisely, /dev/sdb is missing from the RAID after reboot.

      So I just used the OMV interface to restore the RAID with the drive that is missing. After it was finished, everything looked fine:

      Source Code

      1. /dev/sdc: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="954aa619-68be-616b-76c8-4feb16f64316" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
      2. /dev/sda1: UUID="e7484e9d-96fa-48cf-8972-a9e0755321c5" TYPE="ext4" PARTUUID="7c0d5bdf-01"
      3. /dev/sda5: UUID="b281244e-402c-46e4-8ff8-f2b35530dbb0" TYPE="swap" PARTUUID="7c0d5bdf-05"
      4. /dev/md0: LABEL="Raid" UUID="e394d3a6-1f9d-431b-b778-4a65d24f3cd2" TYPE="ext4"
      5. /dev/sdd: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="dcb862d0-4ac5-3a40-b25d-2209e3f56ea3" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
      6. /dev/sdb: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="139941d9-8723-997d-3517-9f081ac2e8db" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"


      All three drives (sdb, sdc, sdd) are now part of the Raid again.

      But after another reboot I had the same problem as before:


      Source Code

      1. /dev/sdd: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="dcb862d0-4ac5-3a40-b25d-2209e3f56ea3" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
      2. /dev/sda1: UUID="e7484e9d-96fa-48cf-8972-a9e0755321c5" TYPE="ext4" PARTUUID="7c0d5bdf-01"
      3. /dev/sda5: UUID="b281244e-402c-46e4-8ff8-f2b35530dbb0" TYPE="swap" PARTUUID="7c0d5bdf-05"
      4. /dev/md0: LABEL="Raid" UUID="e394d3a6-1f9d-431b-b778-4a65d24f3cd2" TYPE="ext4"
      5. /dev/sdc: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="954aa619-68be-616b-76c8-4feb16f64316" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
      6. /dev/sdb: PTUUID="4ce8b032-80ac-426d-9029-4f3eeaf8ee98" PTTYPE="gpt"
      dmesg gave me the following output: pastebin.com/w1qCMfMS

      I then formatted the HDD as an ext4 drive and mounted it. That also works perfectly, so I guess the hard drive is not failing....
      At this point I'm a little unsure what to do next, but maybe someone else has any ideas. Just in case, my /etc/fstab:


      Source Code

      1. # <file system> <mount point> <type> <options> <dump> <pass>
      2. # / was on /dev/sdb1 during installation
      3. UUID=e7484e9d-96fa-48cf-8972-a9e0755321c5 / ext4 errors=remount-ro 0 1
      4. # swap was on /dev/sdb5 during installation
      5. UUID=b281244e-402c-46e4-8ff8-f2b35530dbb0 none swap sw 0 0
      6. tmpfs /tmp tmpfs defaults 0 0
      7. # >>> [openmediavault]
      8. UUID=e394d3a6-1f9d-431b-b778-4a65d24f3cd2 /media/e394d3a6-1f9d-431b-b778-4a65d24f3cd2 ext4 defaults,nofail,user_xattr,noexec,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,acl 0 2
      9. # <<< [openmediavault]
      and my mdadm.conf

      Source Code

      1. DEVICE partitions
      2. # auto-create devices with Debian standard permissions
      3. CREATE owner=root group=disk mode=0660 auto=yes
      4. # automatically tag new arrays as belonging to the local system
      5. HOMEHOST <system>
      6. # definitions of existing MD arrays
      7. ARRAY /dev/md0 metadata=1.2 spares=1 name=tdog42:MainRaid UUID=cb7391ea:b6dd64df:ffe7775c:1d15cb62
      cat /proc/mdstat:

      Source Code

      1. Personalities : [raid6] [raid5] [raid4]
      2. md0 : active raid5 sdc[1] sdd[2]
      3. 11720782848 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
      4. bitmap: 19/44 pages [76KB], 65536KB chunk
      5. unused devices: <none>
      fdisk -l | grep "Disk ":


      Source Code

      1. Disk /dev/sdc: 5,5 TiB, 6001175126016 bytes, 11721045168 sectors
      2. Disk /dev/sdb: 5,5 TiB, 6001175126016 bytes, 11721045168 sectors
      3. Disk identifier: 4CE8B032-80AC-426D-9029-4F3EEAF8EE98
      4. Disk /dev/sda: 298,1 GiB, 320072933376 bytes, 625142448 sectors
      5. Disk identifier: 0x7c0d5bdf
      6. Disk /dev/sdd: 5,5 TiB, 6001175126016 bytes, 11721045168 sectors
      7. Disk /dev/md0: 10,9 TiB, 12002081636352 bytes, 23441565696 sectors
      mdadm --detail --scan --verbose:




      Source Code

      1. ARRAY /dev/md0 level=raid5 num-devices=3 metadata=1.2 name=tdog42:MainRaid UUID=cb7391ea:b6dd64df:ffe7775c:1d15cb62
      2. devices=/dev/sdc,/dev/sdd

      The post was edited 1 time, last by tdog42 ().

    • You have set up a 3 disk RAID5 (why exactly?), /dev/sdb is most probably the problem, you show SMART output for one disk without telling which one (why not all three?) and provide no useful logs (what about checking /var/log/syslog* for the events that led to a degraded RAID?).

      Which feedback do you expect? :)
      'OMV problems' with XU4 and Cloudshell 2? Nope, read this first. 'OMV problems' with Cloudshell 1? Nope, just Ohm's law or queue size.
    • Yes, sorry, /dev/sdb/ is the problem.

      tkaiser wrote:

      You have set up a 3 disk RAID5 (why exactly?)
      My understanding is, that in a RAID 5 on disk can fail while the other two still operate normally. Is that wrong? Might be another setup better? Now, two disk are working as "one" and the other one can fail... I want to achive maximal storage capacity. I/O isn't that important to me....
      But I'm open to suggestions how to do it better :)


      tkaiser wrote:

      you show SMART output for one disk without telling which one (why not all three?)
      b/c the other ones seem fine? I will add them in time....


      tkaiser wrote:

      and provide no useful logs (what about checking /var/log/syslog* for the events that led to a degraded RAID?).
      /var/log/syslog did not gave me any input (at least nothing I could see). Are there any other logs I could check? From what I can tell, there are no specific mdadm-Logs, are there?

      Thankfully, the problem is reproducible, so I will post the log-output someday tomorrow, once the RAID was rebuild...
    • Yes, sorry, /dev/sdb/ is the problem.

      tkaiser wrote:

      You have set up a 3 disk RAID5 (why exactly?)
      My understanding is, that in a RAID 5 on disk can fail while the other two still operate normally. Is that wrong? Might be another setup better? Now, two disk are working as "one" and the other one can fail... I want to achive maximal storage capacity. I/O isn't that important to me....
      But I'm open to suggestions how to do it better :)


      tkaiser wrote:

      you show SMART output for one disk without telling which one (why not all three?)
      b/c the other ones seem fine? I will add them in time....


      tkaiser wrote:

      and provide no useful logs (what about checking /var/log/syslog* for the events that led to a degraded RAID?).
      /var/log/syslog did not gave me any input (at least nothing I could see). Are there any other logs I could check? From what I can tell, there are no specific mdadm-Logs, are there?

      Thankfully, the problem is reproducible, so I will post the log-output someday tomorrow, once the RAID was rebuild...


      //Edit: Here's my syslog right before I reboot (and including the reboot): pastebin.com/Mtu25dAw

      However, there is not much for me to indicate why the one disk is always missing...
    • tdog42 wrote:

      Here's my syslog right before I reboot (and including the reboot): pastebin.com/Mtu25dAw
      Why? What's the purpose of providing an uninteresting log and hiding the interesting stuff?

      tdog42 wrote:

      A couple of days ago, after rebooting
      So obviously logs from before 'A couple of days ago' would be interesting to get a clue what happened?
      'OMV problems' with XU4 and Cloudshell 2? Nope, read this first. 'OMV problems' with Cloudshell 1? Nope, just Ohm's law or queue size.
    • I added the log after the reboot b/c that's when my raid was broken. The information are exactly the same as before and I did not hide any "interesting stuff".

      Like I said: the problem is reproducible. I rebuild the RAID, all three drives were working the way they should. I rebooted the system, suddenly, /dev/sdb was missing. This is the log from the point since when it was missing...

      Like I said: there never had been any clue in the syslog that would point to problem - it seems as if the system for some reason does not detect the /dev/sdb drive correctly...

      Ooh, and now you are complaining about a long "uninteresting log" but asked me before, why I didn't add the SMART log for every hard drive. Please make up your mind: either as many information as possible or just the "interesting stuff"....
    • Re,

      please forgive @tkaiser - he's to much pro :D

      tdog42 wrote:

      My understanding is, that in a RAID 5 on disk can fail while the other two still operate normally. Is that wrong?
      That is not exactly what it does, or your understanding is a bit wrong: RAID-5 uses a striped data algorythm to provide data redundancy N=1 (one drive in this array can fail without loosing data). So "The other two still operate normally" is not exactly what will hapenn, if one disk dies ... you have a degraded array then, and that means degraded performance too - along with no more redundancy (N=0) ...

      tdog42 wrote:

      Like I said: the problem is reproducible. I rebuild the RAID, all three drives were working the way they should. I rebooted the system, suddenly, /dev/sdb was missing. This is the log from the point since when it was missing...

      Like I said: there never had been any clue in the syslog that would point to problem - it seems as if the system for some reason does not detect the /dev/sdb drive correctly...
      That seems to me like an sporadic issue, seen often last time: because of an unknown issue sometimes you can not write the DCB (superblock in md-term) to the disk (it is written to the first 4KiB of the disk ...), and then after reboot, it still holds nothing for the auto-detection ... adding the drive later on will then use a backup-superblock.

      Try this:
      - make a backup of you data
      ... then:
      - if the drive is failing after next reboot, zero the first sectors with dd
      - add the drive again to the array (as complete new drive)
      - wait for the rebuild finish
      - try reboot

      Sc0rp
    • Sc0rp wrote:

      please forgive @tkaiser - he's to much pro
      I deal with (md)RAID now for over 20 years and I know exactly why I avoid it if possible.

      In the context of OMV I'm really scared of the lack of understanding on users' side and OMV's 'data loss made easy' approach.

      The anachronistic RAID implementations date back to a time where administrators hopefully knew what they did. They knew the difference between backup (data protection/safety) and RAID (availability / business continuity). They knew that any measure to improve availability (reliability in a broader sense) implies also to reduce as much SPoF as possible (SPoF = single point of failure). That couldn't be more different. RAID turned into a stupid religion these days.

      Today people play RAID with OMV with setups that are just a joke (eg. using a weak picoPSU and then wondering how it's possible that all drives vanish one after another), they never test anything but blindly trust in concepts they do not even remotely understand and then even fail to report what happened (IMO the tools OMV provides to provide support data are insufficient but since there also exists no 'policy' to force such essential data collection and to get people submitting them I would call this just a huge mess).
      'OMV problems' with XU4 and Cloudshell 2? Nope, read this first. 'OMV problems' with Cloudshell 1? Nope, just Ohm's law or queue size.
    • Re,

      tkaiser wrote:

      I deal with (md)RAID now for over 20 years and I know exactly why I avoid it if possible.
      Same time but different approach ... i use RAID in Hardware and Software whereever it is needed for availability and/or business continuity, tuning it for max data safety ... with working and tested backup strategy, ups and filesystems, which provide a working integrity level. Most people out there underestimate the complexity of ZFS as well! RAID seems to be "easy" ...

      But what's the alternative? Using ZFS? Unfinished BTRFS? Buy an appliance? Looking at the mass of prebuilt NAS boxes (QNAP, Synology, Thecus, Asustor, ...), it seems, that even software RAID is doing well for home users - read it again: HOME users. I think, it is an easy approach on "blocklayer concatinating" ... if you look around the other aproaches. Sure, it' has it's disadvatages and some "obstacles" you should avoid, but hammering "bah, don't use old fashioned RAID crap" on users won't work, because of the lack of alternatives ... and you give no hints on other technologies.

      Make an artikel on it - please - with the alternatives, call it better aproaches ... then you can link it, whenever needed (and it seems you will need it more often). You can use it in your Sig as well :D

      BR Sc0rp