RAID Disappeared - need help to rebuild

  • Hi Guys,


    I've woken up this morning and can't access any media on my server. Logged into OMV GUI, rebooted and the drives are all still there, but the RAID is missing (I believe it was a RAID 5 array). Have looked at a few threads on the forum, and tried to start a self-diagnosis, but I was hoping one of you would kindly offer me some guidance.


    blkid:
    /dev/sdb: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="ef7151df-f61f-ce6e-1612-4dbfc0e9d1cf" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sdd: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="55ad5625-0b73-7d46-248e-aedccfc05460" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sdf: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="6c9b69a2-58ef-cced-95f1-11233b066a54" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sde: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="7a6cb4e9-902f-3852-3c8a-01196f74dcea" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sda: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="5fdfd228-cccd-f51e-0ee5-e8e6eaf3d87c" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"
    /dev/sdc: UUID="6ff00f35-b3aa-6d29-25ac-1d7e2a2b2007" UUID_SUB="1e70980d-7799-6a00-9786-2be2e6522558" LABEL="openmediavault:MEDIAVAULT" TYPE="linux_raid_member"


    cat /proc/mdstat:
    Personalities : [raid6] [raid5] [raid4]
    md126 : inactive sdf[5](S) sdc[2](S)
    5860531120 blocks super 1.2


    md127 : inactive sda[7] sde[6] sdd[3] sdb[1]
    11721062240 blocks super 1.2


    I found some responses to other issues that guided the user to force a rebuild, but because (for an unknown reason) the 6 drives seem to be split across 2 mds I wasn't really sure what my next step should be. All 6 drives were, as of last night, in the same single RAID array.


    Your help would be most appreciated!
    Thanks in advance,
    Brian

  • Re,


    the command will be something like:
    mdadm --assemble /dev/mdX /dev/sd[abcdef] (change the X to 0,126 or 127 ... whatever you want)
    if that fails try to force it:
    mdadm --assemble --force /dev/mdX /dev/sd[abcdef]


    BUT:
    You should find the root-cause for this behavior first - just check the logs for any error message!
    cat /var/log/messages | grep KEYWORD (KEYWORD is something like md, sda, sdb, sd..., raid, ...)
    cat /var/log/syslog | grep KEYWORD


    Sc0rp

  • Re,

    Just for my personal understanding: the above /proc/mdstat output talking about two RAIDs with different members can be ignored?

    I hope so ... since the state for the both /dev/sdf and /dev/sdc in the md126-array are both "spare", i hope the backup superblock is intact for the assembling with the remaining disks. Btw. the "/dev/mdX" numbering is more OS related, than array related ... md is working here more like hardware-controllers.


    You can issue the command:
    mdadm --examine /dev/sd[abcdef]
    to bring clearance of the stati of all drives - i hope there are no greater "event missmatches" on the array ...


    Sc0rp

  • You can issue the command:
    mdadm --examine /dev/sd[abcdef]
    to bring clearance of the stati of all drives - i hope there are no greater "event missmatches" on the array ...


    Sc0rp

    The command you suggested returned the following:


  • The command you suggested returned the following

    But log output is still missing ;) Something like this would put this to an online pasteboard service:

    Code
    zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/syslog* | grep -v systemd | curl -F 'sprunge=<-' http://sprunge.us
    zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/messages* | grep -v systemd | curl -F 'sprunge=<-' http://sprunge.us
  • Thanks tkaiser, that was really helpful. I was in the process of manually copying and pasting the outputs ... I'm a bit of a novice, but can follow instructions! :)


    zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/syslog* | grep -v systemd | curl -F 'sprunge=<-' http://sprunge.us
    http://sprunge.us/HCaD


    zgrep -E "md|sda|sdb|sdc|sde|sdf|raid" /var/log/messages* | grep -v systemd | curl -F 'sprunge=<-' http://sprunge.us
    http://sprunge.us/KLXL

  • sprunge.us/HCaD

    Ok, RAID was not coming up since


    Code
    /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332607] md/raid:md127: device sda operational as raid disk 0
    /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332610] md/raid:md127: device sde operational as raid disk 5
    /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332612] md/raid:md127: device sdd operational as raid disk 3
    /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.332615] md/raid:md127: device sdb operational as raid disk 1
    /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.333341] md/raid:md127: allocated 0kB
    /var/log/syslog:Nov 24 09:32:33 openmediavault kernel: [ 4.333459] md/raid:md127: not enough operational devices (2/6 failed)

    And the troubles started at '/var/log/syslog.1:Nov 23 22:46:31' on sdb. Everything else is beyond my mdraid knowledge (since I hate it wholeheartedly ;) ) but I would check at least SMART attribute 199 of sdb now and check also sdc and sdf (since also reported as missing today)

  • Re,


    according to the output there are good news ... and bad news:
    - good: the "Magic" is the same on all drives
    - bad: the Event-counter is different (but it seems, they are close enough to reassemble)


    Any conclusion about the root-cause? That would be highly necessary ...


    May be, you can "copy" (read backup) the log-files for later searching ...
    cp -v /var/log/messages* /root/20171124-raid-issue (this will create the subdir "20171124-raid-issue" under the home of root)
    cp -v /var/log/syslog* /root/20171124-raid-issue


    After the files are copied, you can try to reassemble ...


    Sc0rp

  • Re,


    checked the logs ... here are the most recent error-lines:


    /var/log/syslog.1:Nov 23 22:46:35 openmediavault kernel: [23464678.453691] md/raid:md127: Disk failure on sdb, disabling device.
    /var/log/syslog.1:Nov 23 22:46:35 openmediavault kernel: [23464678.453691] md/raid:md127: Operation continuing on 4 devices.


    That means:
    - sdb was disable due to massive errors on the device (read errors) ... and with that, your array went from "degraded" to "dead"
    - 2nd line states, that there was one missing drive before that ... and with that, your redundancy was gone


    Don't you have an email-notification?
    Which drives (vendor/model) do you use in this setup?
    Which powersupply do you use?


    Sc0rp


    EDIT/ps: you should also check the backlogs ...
    ls -la /var/log | grep syslog (shows the backlogs for syslog)
    ls -la /var/log | grep messages (shows the backlogs for messages)
    do this both commands and remember the numbers, then do:
    zcat /var/log/syslog.X.gz | grep sdY (the X is a number between 1-7 - look at the lists from commands above, Y is a,b,c,d,e or f)
    zcat /var/messages.X.gz | grep sdY
    for each drive one for one (first start with Y=a), since the drive-naming between mdstat and the provided logs looks weired ...


    After that, you have to check all SMART stati of all drives, as @tkaiser mentioned already!


    Sc0rp

  • @tkaiser


    I can see in the logs that sdb encountered a number of read error not correctable errors at that time last night.


    sdb: SMART attribute199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0


    That drive has some pending sectors too, which have increased slightly in the last week. Was planning to replace the drive (clearly should've done it sooner), but that's not one of the ones that was kicked out, right?


    sdc and sdf both have no pending or reallocated sectors, but ALL 6 drives are showing the same CRC Error Count as sdb 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0


    @Sc0rp


    That sounds reasonably positive. No conclusions drawn, except the info above. To be honest, I don't really know what else I should be looking for in order to come to a conclusion.


    I'll do as you suggested with the logs (though the code you suggested is returning an error, currently: `/root/20171124-raid-issue' is not a directory), and give the reassembly a go.

  • From reasonably positive to, really not positive at all ...


    Yes, I do have email notification ... I'd received emails about the pending sectors, but not to the effect of any failed disc.
    I'm using Seagate Barracuda 3TB discs (which I recently learned were probably not the best ones to use)
    Not sure about the powersupply, specifically - it is the standard one that came in my hp Microserver


    I can't understand about the missing drive ... I'm sure all 6 were in the array, previously.


    Is that it then? No chance of resurrecting or rebuilding?

  • Re,

    I'll do as you suggested with the logs (though the code you suggested is returning an error, currently: `/root/20171124-raid-issue' is not a directory), and give the reassembly a go.

    Sorry that was my error, while fast typing ... add a slash on the end:
    cp -v /var/log/messages* /root/20171124-raid-issue/


    After copying the files, you can do the search on this directory ... just change the path from "/var/log/" to "/root/20171124-raid-issue/" ...


    I can't understand about the missing drive ... I'm sure all 6 were in the array, previously.

    But the log's don't lie :P


    Is that it then? No chance of resurrecting or rebuilding?

    You can always try the reassemble with force. Chance is 50:50 ... md is save in this, but you have to expect data-loss since the fs-layer (xfs) is damged too ...


    And as always: RAID is not backup ... i hope you'll have a working backup.
    For the future you should keep in mind, that you hat to think about changing from RAID5 to ZFS-Z1 or move to SnapRAID/mergerfs ...


    PSU: ATX standard is good, i was only afraid of the next PicoPSU-setup ...


    HDD: Barracuda's are not problematic at all, my "old" 2TB-ones are working flawlessly 24/7 ... md-RAID5 @ OMV3 (of course with continous rsync-backup ... and UPS ... and email-noti ... and other scripts)


    Sc0rp

  • I'm still getting the same error, even when adding the ending slash?

  • Yes, I do have email notification ... I'd received emails about the pending sectors, but not to the effect of any failed disc.

    And that's because I only had the notifications for SMART events turned on :( . I have to go out for a few hours, but will follow up with the log copies when I get back in ... but is there any point? Is there still any hope of salvaging any data?

  • Re,

    I'm still getting the same error, even when adding the ending slash?

    Uhm ... just issue an
    ls -la /root to see what's going on in this dir, if there is a file called "20172417-raid-issue" (the line then starts with "-rwx"), delete it with:
    rm /root/20171124-raid-issue and try again ... may be you have to create the subdir first:
    mkdir /root/20171124-raid-issue


    Sc0rp

  • Hi Guys,


    So, I finally got chance to save the logs as suggested (I had to manually create the subdirectory to do so). Then, force reassemble worked and the array is visible and mounted again. All files appear to be 'visible', but as you said I'm expecting some to have become corrupted and to be missing data.


    For now, it's a big "phew" and thanks for your help so far!


    To answer your question, no .. I foolishly don't have a backup, and that will be the next step once I've got this array as stable as I can for now, before I look at alternative filesystems like you said.


    What would you suggest the sequence of steps should be to minimise risk of causing more problems? There are 3 drives in the array that are reporting SMART issues:


    Reallocated SectorsPending SectorsOffline UncorrectableCRC Error Rate
    sdb640000
    sdc113632320
    sdf16168016800



    Which one of the drives above should I swap out and replace first, second and third?


    Thanks again for your help, guys - really appreciate it!

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!