Raid 5 Degraded after Reboot

  • I have a RAID 5 with three Hard drives on my system. I'm using Open Media Vault 3.0.59 with Debian 8.9


    A couple of days ago, after rebooting, I got a message that my RAID was degrading. So I checked the hard drive for failures using SMART. But it shows no failure (or at least not any I can see....):
    https://pastebin.com/3KESMji3


    More precisely, /dev/sdb is missing from the RAID after reboot.


    So I just used the OMV interface to restore the RAID with the drive that is missing. After it was finished, everything looked fine:


    Code
    /dev/sdc: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="954aa619-68be-616b-76c8-4feb16f64316" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
    /dev/sda1: UUID="e7484e9d-96fa-48cf-8972-a9e0755321c5" TYPE="ext4" PARTUUID="7c0d5bdf-01"
    /dev/sda5: UUID="b281244e-402c-46e4-8ff8-f2b35530dbb0" TYPE="swap" PARTUUID="7c0d5bdf-05"
    /dev/md0: LABEL="Raid" UUID="e394d3a6-1f9d-431b-b778-4a65d24f3cd2" TYPE="ext4"
    /dev/sdd: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="dcb862d0-4ac5-3a40-b25d-2209e3f56ea3" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
    /dev/sdb: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="139941d9-8723-997d-3517-9f081ac2e8db" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"


    All three drives (sdb, sdc, sdd) are now part of the Raid again.


    But after another reboot I had the same problem as before:



    Code
    /dev/sdd: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="dcb862d0-4ac5-3a40-b25d-2209e3f56ea3" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
    /dev/sda1: UUID="e7484e9d-96fa-48cf-8972-a9e0755321c5" TYPE="ext4" PARTUUID="7c0d5bdf-01"
    /dev/sda5: UUID="b281244e-402c-46e4-8ff8-f2b35530dbb0" TYPE="swap" PARTUUID="7c0d5bdf-05"
    /dev/md0: LABEL="Raid" UUID="e394d3a6-1f9d-431b-b778-4a65d24f3cd2" TYPE="ext4"
    /dev/sdc: UUID="cb7391ea-b6dd-64df-ffe7-775c1d15cb62" UUID_SUB="954aa619-68be-616b-76c8-4feb16f64316" LABEL="tdog42:MainRaid" TYPE="linux_raid_member"
    /dev/sdb: PTUUID="4ce8b032-80ac-426d-9029-4f3eeaf8ee98" PTTYPE="gpt"

    dmesg gave me the following output: https://pastebin.com/w1qCMfMS


    I then formatted the HDD as an ext4 drive and mounted it. That also works perfectly, so I guess the hard drive is not failing....
    At this point I'm a little unsure what to do next, but maybe someone else has any ideas. Just in case, my /etc/fstab:



    Code
    # <file system> <mount point>   <type>  <options>       <dump>  <pass>
    # / was on /dev/sdb1 during installation
    UUID=e7484e9d-96fa-48cf-8972-a9e0755321c5 /               ext4    errors=remount-ro 0       1
    # swap was on /dev/sdb5 during installation
    UUID=b281244e-402c-46e4-8ff8-f2b35530dbb0 none            swap    sw              0       0
    tmpfs           /tmp            tmpfs   defaults        0       0
    # >>> [openmediavault]
    UUID=e394d3a6-1f9d-431b-b778-4a65d24f3cd2 /media/e394d3a6-1f9d-431b-b778-4a65d24f3cd2 ext4 defaults,nofail,user_xattr,noexec,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,acl 0 2
    # <<< [openmediavault]

    and my mdadm.conf


    cat /proc/mdstat:


    Code
    Personalities : [raid6] [raid5] [raid4]
    md0 : active raid5 sdc[1] sdd[2]
          11720782848 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
          bitmap: 19/44 pages [76KB], 65536KB chunk
    
    
    unused devices: <none>

    fdisk -l | grep "Disk ":



    Code
    Disk /dev/sdc: 5,5 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sdb: 5,5 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk identifier: 4CE8B032-80AC-426D-9029-4F3EEAF8EE98
    Disk /dev/sda: 298,1 GiB, 320072933376 bytes, 625142448 sectors
    Disk identifier: 0x7c0d5bdf
    Disk /dev/sdd: 5,5 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/md0: 10,9 TiB, 12002081636352 bytes, 23441565696 sectors

    mdadm --detail --scan --verbose:





    Code
    ARRAY /dev/md0 level=raid5 num-devices=3 metadata=1.2 name=tdog42:MainRaid UUID=cb7391ea:b6dd64df:ffe7775c:1d15cb62
       devices=/dev/sdc,/dev/sdd
  • Yes, sorry, /dev/sdb/ is the problem.

    You have set up a 3 disk RAID5 (why exactly?)

    My understanding is, that in a RAID 5 on disk can fail while the other two still operate normally. Is that wrong? Might be another setup better? Now, two disk are working as "one" and the other one can fail... I want to achive maximal storage capacity. I/O isn't that important to me....
    But I'm open to suggestions how to do it better :)



    you show SMART output for one disk without telling which one (why not all three?)

    b/c the other ones seem fine? I will add them in time....



    and provide no useful logs (what about checking /var/log/syslog* for the events that led to a degraded RAID?).

    /var/log/syslog did not gave me any input (at least nothing I could see). Are there any other logs I could check? From what I can tell, there are no specific mdadm-Logs, are there?


    Thankfully, the problem is reproducible, so I will post the log-output someday tomorrow, once the RAID was rebuild...

  • Yes, sorry, /dev/sdb/ is the problem.

    You have set up a 3 disk RAID5 (why exactly?)

    My understanding is, that in a RAID 5 on disk can fail while the other two still operate normally. Is that wrong? Might be another setup better? Now, two disk are working as "one" and the other one can fail... I want to achive maximal storage capacity. I/O isn't that important to me....
    But I'm open to suggestions how to do it better :)



    you show SMART output for one disk without telling which one (why not all three?)

    b/c the other ones seem fine? I will add them in time....



    and provide no useful logs (what about checking /var/log/syslog* for the events that led to a degraded RAID?).

    /var/log/syslog did not gave me any input (at least nothing I could see). Are there any other logs I could check? From what I can tell, there are no specific mdadm-Logs, are there?


    Thankfully, the problem is reproducible, so I will post the log-output someday tomorrow, once the RAID was rebuild...



    //Edit: Here's my syslog right before I reboot (and including the reboot): https://pastebin.com/Mtu25dAw


    However, there is not much for me to indicate why the one disk is always missing...

  • Here's my syslog right before I reboot (and including the reboot): pastebin.com/Mtu25dAw

    Why? What's the purpose of providing an uninteresting log and hiding the interesting stuff?


    A couple of days ago, after rebooting

    So obviously logs from before 'A couple of days ago' would be interesting to get a clue what happened?

  • I added the log after the reboot b/c that's when my raid was broken. The information are exactly the same as before and I did not hide any "interesting stuff".


    Like I said: the problem is reproducible. I rebuild the RAID, all three drives were working the way they should. I rebooted the system, suddenly, /dev/sdb was missing. This is the log from the point since when it was missing...


    Like I said: there never had been any clue in the syslog that would point to problem - it seems as if the system for some reason does not detect the /dev/sdb drive correctly...


    Ooh, and now you are complaining about a long "uninteresting log" but asked me before, why I didn't add the SMART log for every hard drive. Please make up your mind: either as many information as possible or just the "interesting stuff"....

  • Re,


    please forgive @tkaiser - he's to much pro :D


    My understanding is, that in a RAID 5 on disk can fail while the other two still operate normally. Is that wrong?

    That is not exactly what it does, or your understanding is a bit wrong: RAID-5 uses a striped data algorythm to provide data redundancy N=1 (one drive in this array can fail without loosing data). So "The other two still operate normally" is not exactly what will hapenn, if one disk dies ... you have a degraded array then, and that means degraded performance too - along with no more redundancy (N=0) ...


    Like I said: the problem is reproducible. I rebuild the RAID, all three drives were working the way they should. I rebooted the system, suddenly, /dev/sdb was missing. This is the log from the point since when it was missing...


    Like I said: there never had been any clue in the syslog that would point to problem - it seems as if the system for some reason does not detect the /dev/sdb drive correctly...

    That seems to me like an sporadic issue, seen often last time: because of an unknown issue sometimes you can not write the DCB (superblock in md-term) to the disk (it is written to the first 4KiB of the disk ...), and then after reboot, it still holds nothing for the auto-detection ... adding the drive later on will then use a backup-superblock.


    Try this:
    - make a backup of you data
    ... then:
    - if the drive is failing after next reboot, zero the first sectors with dd
    - add the drive again to the array (as complete new drive)
    - wait for the rebuild finish
    - try reboot


    Sc0rp

  • please forgive @tkaiser - he's to much pro

    I deal with (md)RAID now for over 20 years and I know exactly why I avoid it if possible.


    In the context of OMV I'm really scared of the lack of understanding on users' side and OMV's 'data loss made easy' approach.


    The anachronistic RAID implementations date back to a time where administrators hopefully knew what they did. They knew the difference between backup (data protection/safety) and RAID (availability / business continuity). They knew that any measure to improve availability (reliability in a broader sense) implies also to reduce as much SPoF as possible (SPoF = single point of failure). That couldn't be more different. RAID turned into a stupid religion these days.


    Today people play RAID with OMV with setups that are just a joke (eg. using a weak picoPSU and then wondering how it's possible that all drives vanish one after another), they never test anything but blindly trust in concepts they do not even remotely understand and then even fail to report what happened (IMO the tools OMV provides to provide support data are insufficient but since there also exists no 'policy' to force such essential data collection and to get people submitting them I would call this just a huge mess).

  • Re,

    I deal with (md)RAID now for over 20 years and I know exactly why I avoid it if possible.

    Same time but different approach ... i use RAID in Hardware and Software whereever it is needed for availability and/or business continuity, tuning it for max data safety ... with working and tested backup strategy, ups and filesystems, which provide a working integrity level. Most people out there underestimate the complexity of ZFS as well! RAID seems to be "easy" ...


    But what's the alternative? Using ZFS? Unfinished BTRFS? Buy an appliance? Looking at the mass of prebuilt NAS boxes (QNAP, Synology, Thecus, Asustor, ...), it seems, that even software RAID is doing well for home users - read it again: HOME users. I think, it is an easy approach on "blocklayer concatinating" ... if you look around the other aproaches. Sure, it' has it's disadvatages and some "obstacles" you should avoid, but hammering "bah, don't use old fashioned RAID crap" on users won't work, because of the lack of alternatives ... and you give no hints on other technologies.


    Make an artikel on it - please - with the alternatives, call it better aproaches ... then you can link it, whenever needed (and it seems you will need it more often). You can use it in your Sig as well :D


    BR Sc0rp

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!