RAID Missing After Re-Boot - The Sequel

  • Made it almost a week thanks to geaves help, but, sadly, here I am again.


    RAID status is now showing "clean, FAILED", rather than missing.


    Here are the results of the initially required (ryecoarron's) inquiries as well as some of geaves' requested info from the first go-round. I scanned discs C and D as they showed (to my novice eye) issues with reporting (faulty).


    Again, my set up is/was RAID5 with (4) 6 TB WD Red discs. And I haven't rebooted since discovering this last night. Any help is most appreciated.

    Code
    root@Server:~# cat /proc/mdstat
    Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
    md0 : active raid5 sda[4] sdc[1](F) sde[3] sdd[2](F)
    17581174272 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [U__U]
    bitmap: 3/44 pages [12KB], 65536KB chunk
    unused devices: <none>
    root@Server:~#
    Code
    root@Server:~# blkid
    /dev/sdf1: UUID="b8f86e19-3cb3-4d0a-b1e2-623620314887" TYPE="ext4" PARTUUID="79c1501c-01"
    /dev/sdf5: UUID="a631ab49-ab21-4169-8352-aa1829c8a95b" TYPE="swap" PARTUUID="79c1501c-05"
    /dev/sdb1: LABEL="BackUp" UUID="9238bbb9-e494-487d-941e-234cad83a670" TYPE="ext4" PARTUUID="d6e47150-672f-4fb8-a57d-72c6ff0ca4ae"
    /dev/sde: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="97feb0f7-c46c-0e05-4b6f-4c40a9448f9f" LABEL="Server:Raid1" TYPE="linux_raid_member"
    /dev/sdc: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="98a8cd6c-cb21-5f16-8540-aa6c88960541" LABEL="Server:Raid1" TYPE="linux_raid_member"
    /dev/sdd: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="df979bad-92c3-ac42-f3e5-512838996555" LABEL="Server:Raid1" TYPE="linux_raid_member"
    /dev/md0: LABEL="Raid1" UUID="0f1174dc-fa73-49b0-8af3-c3ddb3caa7ef" TYPE="ext4"
    /dev/sda: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="cd9ad946-ea0f-65a1-a2a3-298a258b2f76" LABEL="Server:Raid1" TYPE="linux_raid_member"
    root@Server:~#
    Code
    root@Server:~# mdadm --detail --scan --verbose
    ARRAY /dev/md0 level=raid5 num-devices=4 metadata=1.2 name=Server:Raid1 UUID=98379905:d139d263:d58d5eb3:893ba95b
    devices=/dev/sda,/dev/sdc,/dev/sdd,/dev/sde
    root@Server:~#
  • Sorry but the Raids toast it's showing 2 drives as removed and failed, you can't recover a Raid 5 with more than one drive failure.

    Raid is not a backup! Would you go skydiving without a parachute?

  • Hi geaves,


    Sorry for the intrusion, but upon reflection (and before I do anything crazy), I'm curious why 3 out 4 drives (A and then C and D) would crap out within a week of each other when for a year and a half they hummed along just fine. I obviously have a problem, but I'm not sure where to begin. The discs look good in the GUI; could it be a corrupt OS issue (or maybe an SAS card issue)? I hate to ask, but do you have any ideas?


    OMV was my first foray into RAID, but in all my years with PCs (I started in college with an Atari 800), I've only had one disc fail (hard fail), and that was years ago. Since then, I've used/installed more than a couple of hundred drives from tapes to SSDs and built over two dozen PCs (with both Windows and Linux OSs), all were extremely reliable. But I don't know where to begin with this NAS I built.


    As an aside, I have an exact duplicate build using smaller drives (4 WD 4 TB Reds), but running Windows Server 2016 that I use for business; should I be concerned about those drives and that pool as well? Everything has been rock solid with that machine for the year it's been online, but it was with my OMV machine, too, until last week :(. My research verified WD NAS discs were pretty reliable (at the time I bought them).


    Before I begin replacing the drives and upgrading to a later version of OMV, do you have any words of wisdom as to direction? Should I continue with a RAID? My 75% disc failure rate gives me pause. Right now, I'm leaning towards my old strategy of parsing all my files over several discs and backing them up separately. As I mentioned before, I used the array for media mostly and Plex doesn't care where it gets the files, so either strategy will work. I could install five discs and go with RAID 6, but with my luck, three discs would then die ^^ . I have room for 10 discs, but c'mon, nobody needs that, hahaha.


    If there is anything you could offer, I would appreciate it. I hope you're having a good day.

  • OK before I eat my #2 have you done an mdadm --examine of sda and sde, as well as a long SMART test


    but running Windows Server 2016 that I use for business; should I be concerned about those drives and that pool as well?

    As with anything that runs 24/7 you need some sort of notification in place that warns you of a potential failure. Raid failures have happened to me and that was with notifications, one 15,000rpm SAS reported as failing, ordered a new drive, replaced failing drive. As Raid was rebuilding another drive reports as failing, the whole raid is toast, it now takes 3 days from ordering another drive and rebuilding and restoring the VM's from backup, the school was not impressed. But there's nothing you can do but have a backup in place, the alternative to run 2 DC's, but you have to know what your doing, plus schools don't have that budget.

    Raid is not a backup! Would you go skydiving without a parachute?

  • Thank you geaves; it may be straw, but I'm grasping for it :D. Thank you as well for sharing that my situation is anything but unique (actually, quite pedestrian by comparison); the school was very lucky to have you.


    Results of drives sda and sde follow. I had not run long SMART tests, just regularly scheduled short tests. I am running them now and they show as completing somewhere around midnight my time. I'll report results tomorrow.


    Oddly, I know I set up email notifications when I installed OMV (all are still checked), but the email server info seems to have disappeared. Never realized I was not getting emails anymore; great catch. I'll fix that when I get a functioning NAS back.


    I appreciate all your insights and help.


  • I'll report results tomorrow

    :thumbup:


    Looking at the output from sda and sde there's nothing wrong ?( I'm now wondering if you're getting intermittent hardware failure;

    1) sata ports dropping out

    2) issues with the sata cables and/or power cables to the drives

    3) this could be the m'board, power supply

    4) OS drive failing

    this could be difficult to pin down

    Raid is not a backup! Would you go skydiving without a parachute?

  • Hi geaves,


    We've had some power blips lately, but I have a UPS to mitigate the damage (in theory, of course). But the two issues (last week and last night) seem to coincide closely with power outages. Hmmm.


    I did clean (mildly) the case with compressed air two weeks ago, maybe I loosened a connection or two. I can dig into the BIOS to see if the MB and chipset are saying anything (and the i/o SAS card); just have to attach a monitor and keyboard/mouse.


    After the long SMART tests are done.


    Once they are, do you think its okay to shut down the NAS and take a look?

  • Hi geaves,


    I believe these are the results of the long SMART tests; I found them in SMART/Devices/Information/Extended information . The only email I received was the following (in case you need it):


    I noticed there is no 'fingers crossed' emoji; I could use one right about now :).

    sda.txt,sdc.txt,sdd.txt,sde.txt

  • Well the good news is the output from the long test is showing no errors, you're looking at 5, 196, 197 198, and 199, all are showing zero which implies the drives are fine.


    So, shutdown and have a look at the connections and if you have some spare new cables switch them over, a restart then may bring it back up

    Raid is not a backup! Would you go skydiving without a parachute?

  • Bless you, my son. Oh, I've got cables hahaha. Thank you, thank you, thank you.


    I might not get to it today; too much 'on the docket'. But certainly tomorrow.


    I'll report back. I can't thank you enough for your expertise, geaves.


    Have a great weekend.

  • Hi geaves,


    After shutting down, cleaning the case and replacing the sata and power cables, the RAID has gone missing (unmounted actually); everything else is reporting normally. I did not run any BIOS checks, hoping that a simple cleaning and reboot would get me back on track, so I can't confirm the MB/chipset and other components are fine.


    Is this fixable or has my quixotic journey come to an end? System logs repeat the email message warning of no mount point.


    I have not rebooted nor tried anything on my own.


    Code
    Status failed Service mountpoint_srv_dev-disk-by-label-Raid1
    Date: Sat, 17 Apr 2021 09:02:16
    Action: alert
    Host: Server
    Description: status failed (1) -- /srv/dev-disk-by-label-Raid1 is not a mountpoint
    Code
    root@Server:~# cat /proc/mdstat
    Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
    md0 : inactive sdc[1](S) sda[4](S) sde[3](S) sdd[2](S)
    23441566048 blocks super 1.2
    unused devices: <none>
    root@Server:~#
    Code
    root@Server:~# blkid
    /dev/sdf1: UUID="b8f86e19-3cb3-4d0a-b1e2-623620314887" TYPE="ext4" PARTUUID="79c1501c-01"
    /dev/sdf5: UUID="a631ab49-ab21-4169-8352-aa1829c8a95b" TYPE="swap" PARTUUID="79c1501c-05"
    /dev/sdc: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="98a8cd6c-cb21-5f16-8540-aa6c88960541" LABEL="Server:Raid1" TYPE="linux_raid_member"
    /dev/sda: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="cd9ad946-ea0f-65a1-a2a3-298a258b2f76" LABEL="Server:Raid1" TYPE="linux_raid_member"
    /dev/sdd: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="df979bad-92c3-ac42-f3e5-512838996555" LABEL="Server:Raid1" TYPE="linux_raid_member"
    /dev/sde: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="97feb0f7-c46c-0e05-4b6f-4c40a9448f9f" LABEL="Server:Raid1" TYPE="linux_raid_member"
    /dev/sdb1: LABEL="BackUp" UUID="9238bbb9-e494-487d-941e-234cad83a670" TYPE="ext4" PARTUUID="d6e47150-672f-4fb8-a57d-72c6ff0ca4ae"
    root@Server:~#
    Code
    root@Server:~# fdisk -l | grep "Disk "
    Disk /dev/sdf: 232.9 GiB, 250059350016 bytes, 488397168 sectors
    Disk identifier: 0x79c1501c
    Disk /dev/sdc: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sda: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sdd: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sde: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
    Disk /dev/sdb: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
    Disk identifier: 9D074D37-5C66-4B80-9C9D-9B55480833ED
    root@Server:~#
    Code
    root@Server:~# mdadm --detail --scan --verbose
    INACTIVE-ARRAY /dev/md0 num-devices=4 metadata=1.2 name=Server:Raid1 UUID=98379905:d139d263:d58d5eb3:893ba95b
    devices=/dev/sda,/dev/sdc,/dev/sdd,/dev/sde
    root@Server:~#
  • Hopefully the failed mountpoint is due to the array being inactive, so;


    mdadm --stop /dev/md0


    mdadm --assemble --force --verbose /dev/md0 /dev/sd[acde]

    Raid is not a backup! Would you go skydiving without a parachute?

  • :D I'm baaaaaaaack. You, sir, are my hero. 'Thank you' just doesn't seem like enough.


    geaves, I hope you have a fantastic weekend; I will be, LOL.


Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!