Raid 5 gone after reboot

  • Code
    cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4]
    md127 : inactive sdb[1] sda[4](S) sdd[3]
          2929894536 blocks super 1.2
    
    
    unused devices: <none>


    BLKID

    Code
    BLKID
    /dev/sdc: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5366aea8-9876-f78d-a23e-99e6ad6fea31" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sdb: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="175ea7f3-8e75-caaa-a728-75435266c481" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sda: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5c0b0f88-e883-6acd-ce5d-bb2b95a4bb9b" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sdd: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="cec25e6e-e6f8-bc06-f4c7-d33148c0f9bd" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sde1: UUID="5e284973-7730-436e-aeb5-491c3cfe6446" TYPE="ext4"
    /dev/sde5: UUID="0f12fb13-59f7-456c-a1e7-628b8f4375a2" TYPE="swap"
    Code
    mdadm --detail --scan --verbose
    mdadm: cannot open /dev/md/Datengrab: No such file or directory

    That is the info I can give you all. My Raid has been vanished after it got to rebuild itself. In the status before the reboot was the message "clean, failed".


    Can I get the date somehow back? It is also strange that sdb sda sdd are there the sdb isn't. Maybe that could be the faulty drive?

  • Well the funny thing is that I used the mdadm command to force the raid back and it was always recovering and after a while I get clean, failed. When it recovers I see all my files and they are ok after the recovery I cannot open these files anymore.


    I used these commands


    mdadm --stop /dev/md127
    mdadm --assemble /dev/md127 /dev/sd[abcd] --verbose --force


    Now he recovers and I can get my files. But after the failed the data seems to be there but cannot be opend anymore.


    If that happens again can I somehow stop the recovering process?

  • Hi,


    it seems your RAID did'nt suffer from multiple failures:
    md127 : inactive sdb[1] sda[4](S) sdd[3] <- (S) indicates a spare drive, while sdc is completly missing ...


    So it seems to me, that sda has lost some DCB informations and sdc is not reliable enough for RAID ...


    Anyway, you forced the drives to work in a RAID - and so you got two issues:
    - the RAID will now try to recover itself (gain a sync-status at all), starting immediately recovering/resyncing
    - blocks (data) with wrong CRC's will be deleted when no rescue is possible at all


    If that happens again can I somehow stop the recovering process?

    Possible, read https://serverfault.com/questi…rupt-software-raid-resync for all options ;)
    (i use the "speed down" option too, because it works in every environment)


    Sc0rp


    EDIT: *OMG* i totally forgot to mention, that you have to checkout your /var/log/messages and /var/log/syslog for the errors occured!

  • Code
    mdadm --detail --scan --verbose
    ARRAY /dev/md/Datengrab level=raid5 num-devices=4 metadata=1.2 spares=1 name=NAS:Datengrab UUID=79d94d6c:7030af53:7c7b1956:c4564987
       devices=/dev/sdb,/dev/sdd,/dev/sdc,/dev/sda
    root@NAS:~#

    Here is the syslog after a fresh restart


    https://pastebin.com/k4aWp8qG


    and here is the messages log


    https://pastebin.com/9rYDHNMm


    Code
    cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4]
    md127 : inactive sdb[1] sda[4](S) sdd[3]
          2929894536 blocks super 1.2


    I find it strange that when I type in these commands I can get on all the data in the raid but when it failed after the automized repairing the data is not usabel anymore.

    Code
    blkid
    /dev/sdc: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5366aea8-9876-f7                                                                                                                                                             8d-a23e-99e6ad6fea31" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sdb: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="175ea7f3-8e75-ca                                                                                                                                                             aa-a728-75435266c481" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sda: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5c0b0f88-e883-6a                                                                                                                                                             cd-ce5d-bb2b95a4bb9b" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sdd: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="cec25e6e-e6f8-bc                                                                                                                                                             06-f4c7-d33148c0f9bd" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
    /dev/sde1: UUID="5e284973-7730-436e-aeb5-491c3cfe6446" TYPE="ext4"
    /dev/sde5: UUID="0f12fb13-59f7-456c-a1e7-628b8f4375a2" TYPE="swap"

    FDISK command
    https://pastebin.com/raw/RCB3q1K2


    mdadm.conf
    https://pastebin.com/a1niQBda


    mdadm
    https://pastebin.com/h73Dx9A8


    That's what I get now. But now the data is not there anymore (example a Videofile won'T play) When I restart the RAID with the mdadm commands and he recovers all files are readable again. How can I find out which hdd is faulty. When I restart the NAS I'll get the error that 2/4 hdd are not found.



    Is drive sdc corrupt? How can I determine the drive so that I don't pull the wrong one out.


    I did the following after the reboot



  • Re,


    Since the log was after reboot, i find only these lines:


    syslog/messages:
    Sep 19 10:41:02 NAS kernel: [ 4.553638] md: kicking non-fresh sdc from array! *1)
    Sep 19 10:41:02 NAS kernel: [ 4.553660] md: unbind<sdc>


    ... in these logs is no issue, which points at the drive failures ...


    And the line from your commands after reboot talks the same:
    mdadm: forcing event count in /dev/sdc(2) from 46700 upto 46712 *1)


    *1) every member of an array gots an (individual) eventcounter - if this counter is on all members the same, the array is in sync. If a drive derives from the others - it is stated as not sync and kicked out of the array, because it is outdated. This forces the RAID normally to an resync, but in your case i assume it willd estroy your data completly, because sda is stated as spare ...


    ToDo (just a recommendation):
    1st: just try to get the most data from the array (do whatever it needs to back them up!)
    2nd: do non-invasive searching:


    Please check the SMART attributes on your drives:
    smartctl -a /dev/sda


    Then you need further the cause of the failure (aka root cause) - do more searching like that:
    zcat /var/log/syslog.1.gz | grep sdc
    zcat /var/log/syslog.1.gz | grep sda
    zcat /var/log/messages.1.gz | grep sdc
    zcat /var/log/messages.1.gz | grep sda
    - alter the digit (possible 1...x, depends on your logrotate) for yourself, check the directory with ls -la /var/log | grep syslog and | grep messages
    - alter the searchstring for yourself (instead of sdc/sda you can use disk, scsi, mdadm, or any other related keyword, just try it)


    Good luck!


    Sc0rp

  • First of all, Thank you very much for these hints. I try to recover the data whilst OMV is recovering.


    After that I will create a new RAID5 with no spare. Is there a way to scan for HDD errors whilst creating the Raid?

  • Re,

    Is there a way to scan for HDD errors whilst creating the Raid?

    More than one i assume ...


    - smartctl should be running deamonized (ps -ef | grep smart to check)
    - smartctl -a /dev/sda (alter "a" for b,c,d,e,...) if you have a suspect (output is very long)
    - tail -f /var/log/messages (on a second console)
    - may be you configure email reporting on your box too - it's working for me (and it's fast)


    While building (rebuilding, syncing, resyncing) an array, mdadm takes care of faulty sectors (it's like formatting), and will log to syslog and/or messages ... so track these files via tail -f ...


    Also check the smart-status of your drives to see some attributes (possible) rising - this will do the deamonized smartctl for you in an 180s interval (standard setting), and that's enough. Take a look at the documentation of SMARTMONTOOLS.


    Sc0rp

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!