RAID Missing After Re-Boot - The Sequel

piscdi · 15. April 2021

Made it almost a week thanks to geaves help, but, sadly, here I am again.

RAID status is now showing "clean, FAILED", rather than missing.

Here are the results of the initially required (ryecoarron's) inquiries as well as some of geaves' requested info from the first go-round. I scanned discs C and D as they showed (to my novice eye) issues with reporting (faulty).

Again, my set up is/was RAID5 with (4) 6 TB WD Red discs. And I haven't rebooted since discovering this last night. Any help is most appreciated.

Code

root@Server:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sda[4] sdc[1](F) sde[3] sdd[2](F)
      17581174272 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [U__U]
      bitmap: 3/44 pages [12KB], 65536KB chunk

unused devices: <none>
root@Server:~#

Code

root@Server:~# blkid
/dev/sdf1: UUID="b8f86e19-3cb3-4d0a-b1e2-623620314887" TYPE="ext4" PARTUUID="79c1501c-01"
/dev/sdf5: UUID="a631ab49-ab21-4169-8352-aa1829c8a95b" TYPE="swap" PARTUUID="79c1501c-05"
/dev/sdb1: LABEL="BackUp" UUID="9238bbb9-e494-487d-941e-234cad83a670" TYPE="ext4" PARTUUID="d6e47150-672f-4fb8-a57d-72c6ff0ca4ae"
/dev/sde: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="97feb0f7-c46c-0e05-4b6f-4c40a9448f9f" LABEL="Server:Raid1" TYPE="linux_raid_member"
/dev/sdc: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="98a8cd6c-cb21-5f16-8540-aa6c88960541" LABEL="Server:Raid1" TYPE="linux_raid_member"
/dev/sdd: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="df979bad-92c3-ac42-f3e5-512838996555" LABEL="Server:Raid1" TYPE="linux_raid_member"
/dev/md0: LABEL="Raid1" UUID="0f1174dc-fa73-49b0-8af3-c3ddb3caa7ef" TYPE="ext4"
/dev/sda: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="cd9ad946-ea0f-65a1-a2a3-298a258b2f76" LABEL="Server:Raid1" TYPE="linux_raid_member"
root@Server:~#

Code

root@Server:~# cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
# Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
# To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
# used if no RAID devices are configured.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=Server:Raid1 UUID=98379905:d139d263:d58d5eb3:893ba95b
root@Server:~#

Alles anzeigen

Code

root@Server:~# mdadm --detail --scan --verbose
ARRAY /dev/md0 level=raid5 num-devices=4 metadata=1.2 name=Server:Raid1 UUID=98379905:d139d263:d58d5eb3:893ba95b
   devices=/dev/sda,/dev/sdc,/dev/sdd,/dev/sde
root@Server:~#

Code

root@Server:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Nov 25 18:05:25 2019
     Raid Level : raid5
     Array Size : 17581174272 (16766.71 GiB 18003.12 GB)
  Used Dev Size : 5860391424 (5588.90 GiB 6001.04 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Apr 15 05:40:00 2021
          State : clean, FAILED
 Active Devices : 2
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : Server:Raid1  (local to host Server)
           UUID : 98379905:d139d263:d58d5eb3:893ba95b
         Events : 191669

    Number   Major   Minor   RaidDevice State
       4       8        0        0      active sync   /dev/sda
       -       0        0        1      removed
       -       0        0        2      removed
       3       8       64        3      active sync   /dev/sde

       1       8       32        -      faulty   /dev/sdc
       2       8       48        -      faulty   /dev/sdd
root@Server:~#

Alles anzeigen

Code

root@Server:~# mdadm --examine /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 98379905:d139d263:d58d5eb3:893ba95b
           Name : Server:Raid1  (local to host Server)
  Creation Time : Mon Nov 25 18:05:25 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 11720783024 (5588.90 GiB 6001.04 GB)
     Array Size : 17581174272 (16766.71 GiB 18003.12 GB)
  Used Dev Size : 11720782848 (5588.90 GiB 6001.04 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=176 sectors
          State : clean
    Device UUID : 98a8cd6c:cb215f16:8540aa6c:88960541

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Apr 14 04:13:31 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 802c48bb - correct
         Events : 191597

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
root@Server:~#

Alles anzeigen

Code

root@Server:~# mdadm --examine /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 98379905:d139d263:d58d5eb3:893ba95b
           Name : Server:Raid1  (local to host Server)
  Creation Time : Mon Nov 25 18:05:25 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 11720783024 (5588.90 GiB 6001.04 GB)
     Array Size : 17581174272 (16766.71 GiB 18003.12 GB)
  Used Dev Size : 11720782848 (5588.90 GiB 6001.04 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=176 sectors
          State : clean
    Device UUID : df979bad:92c3ac42:f3e55128:38996555

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Apr 14 04:13:31 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : bd4f81e8 - correct
         Events : 191597

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
root@Server:~#

Alles anzeigen

geaves · 15. April 2021

Sorry but the Raids toast it's showing 2 drives as removed and failed, you can't recover a Raid 5 with more than one drive failure.

piscdi · 15. April 2021

Thanks geaves, that's what I saw but was hoping it wasn't so.

I appreciate your time. Thanks again.

piscdi · 15. April 2021

Hi geaves,

Sorry for the intrusion, but upon reflection (and before I do anything crazy), I'm curious why 3 out 4 drives (A and then C and D) would crap out within a week of each other when for a year and a half they hummed along just fine. I obviously have a problem, but I'm not sure where to begin. The discs look good in the GUI; could it be a corrupt OS issue (or maybe an SAS card issue)? I hate to ask, but do you have any ideas?

OMV was my first foray into RAID, but in all my years with PCs (I started in college with an Atari 800), I've only had one disc fail (hard fail), and that was years ago. Since then, I've used/installed more than a couple of hundred drives from tapes to SSDs and built over two dozen PCs (with both Windows and Linux OSs), all were extremely reliable. But I don't know where to begin with this NAS I built.

As an aside, I have an exact duplicate build using smaller drives (4 WD 4 TB Reds), but running Windows Server 2016 that I use for business; should I be concerned about those drives and that pool as well? Everything has been rock solid with that machine for the year it's been online, but it was with my OMV machine, too, until last week . My research verified WD NAS discs were pretty reliable (at the time I bought them).

Before I begin replacing the drives and upgrading to a later version of OMV, do you have any words of wisdom as to direction? Should I continue with a RAID? My 75% disc failure rate gives me pause. Right now, I'm leaning towards my old strategy of parsing all my files over several discs and backing them up separately. As I mentioned before, I used the array for media mostly and Plex doesn't care where it gets the files, so either strategy will work. I could install five discs and go with RAID 6, but with my luck, three discs would then die . I have room for 10 discs, but c'mon, nobody needs that, hahaha.

If there is anything you could offer, I would appreciate it. I hope you're having a good day.

geaves · 15. April 2021

OK before I eat my #2 have you done an mdadm --examine of sda and sde, as well as a long SMART test

Zitat von piscdi

but running Windows Server 2016 that I use for business; should I be concerned about those drives and that pool as well?

As with anything that runs 24/7 you need some sort of notification in place that warns you of a potential failure. Raid failures have happened to me and that was with notifications, one 15,000rpm SAS reported as failing, ordered a new drive, replaced failing drive. As Raid was rebuilding another drive reports as failing, the whole raid is toast, it now takes 3 days from ordering another drive and rebuilding and restoring the VM's from backup, the school was not impressed. But there's nothing you can do but have a backup in place, the alternative to run 2 DC's, but you have to know what your doing, plus schools don't have that budget.

piscdi · 15. April 2021

Thank you geaves; it may be straw, but I'm grasping for it . Thank you as well for sharing that my situation is anything but unique (actually, quite pedestrian by comparison); the school was very lucky to have you.

Results of drives sda and sde follow. I had not run long SMART tests, just regularly scheduled short tests. I am running them now and they show as completing somewhere around midnight my time. I'll report results tomorrow.

Oddly, I know I set up email notifications when I installed OMV (all are still checked), but the email server info seems to have disappeared. Never realized I was not getting emails anymore; great catch. I'll fix that when I get a functioning NAS back.

I appreciate all your insights and help.

Code

root@Server:~# mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 98379905:d139d263:d58d5eb3:893ba95b
           Name : Server:Raid1  (local to host Server)
  Creation Time : Mon Nov 25 18:05:25 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 11720783024 (5588.90 GiB 6001.04 GB)
     Array Size : 17581174272 (16766.71 GiB 18003.12 GB)
  Used Dev Size : 11720782848 (5588.90 GiB 6001.04 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=176 sectors
          State : clean
    Device UUID : cd9ad946:ea0f65a1:a2a3298a:258b2f76

Internal Bitmap : 8 sectors from superblock
    Update Time : Thu Apr 15 11:30:27 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 37e7390a - correct
         Events : 191685

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A..A ('A' == active, '.' == missing, 'R' == replacing)
root@Server:~#

Alles anzeigen

Code

root@Server:~# mdadm --examine /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 98379905:d139d263:d58d5eb3:893ba95b
           Name : Server:Raid1  (local to host Server)
  Creation Time : Mon Nov 25 18:05:25 2019
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 11720783024 (5588.90 GiB 6001.04 GB)
     Array Size : 17581174272 (16766.71 GiB 18003.12 GB)
  Used Dev Size : 11720782848 (5588.90 GiB 6001.04 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=176 sectors
          State : clean
    Device UUID : 97feb0f7:c46c0e05:4b6f4c40:a9448f9f

Internal Bitmap : 8 sectors from superblock
    Update Time : Thu Apr 15 11:30:27 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 2bea7eda - correct
         Events : 191685

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : A..A ('A' == active, '.' == missing, 'R' == replacing)
root@Server:~#

Alles anzeigen

geaves · 15. April 2021

Zitat von piscdi

I'll report results tomorrow

Looking at the output from sda and sde there's nothing wrong I'm now wondering if you're getting intermittent hardware failure;

1) sata ports dropping out

2) issues with the sata cables and/or power cables to the drives

3) this could be the m'board, power supply

4) OS drive failing

this could be difficult to pin down

piscdi · 15. April 2021

Hi geaves,

We've had some power blips lately, but I have a UPS to mitigate the damage (in theory, of course). But the two issues (last week and last night) seem to coincide closely with power outages. Hmmm.

I did clean (mildly) the case with compressed air two weeks ago, maybe I loosened a connection or two. I can dig into the BIOS to see if the MB and chipset are saying anything (and the i/o SAS card); just have to attach a monitor and keyboard/mouse.

After the long SMART tests are done.

Once they are, do you think its okay to shut down the NAS and take a look?

geaves · 15. April 2021

Zitat von piscdi

Once they are, do you think its okay to shut down the NAS and take a look

Yes, but after you post the smart test output, a power or connection issue is the only thing I can think of.

piscdi · 15. April 2021

Excellent. Thanks, geaves.

I just checked and the first disc is 70% complete. I set up (and confirmed) email notifications again; the results will be conveyed through that medium, correct?

geaves · 15. April 2021

Zitat von piscdi

the results will be conveyed through that medium, correct?

They should be, but this is looking more like hardware or power

piscdi · 15. April 2021

Noted. Wait and see. Just as soon as I know anything (which will be a first)

Thanks again, geaves.

piscdi · 16. April 2021

Hi geaves,

I believe these are the results of the long SMART tests; I found them in SMART/Devices/Information/Extended information . The only email I received was the following (in case you need it):

I noticed there is no 'fingers crossed' emoji; I could use one right about now .

Code

This is an automatically generated mail message from mdadm running on Server

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sda[4] sdc[1](F) sde[3] sdd[2](F)
      17581174272 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [U__U]
      bitmap: 3/44 pages [12KB], 65536KB chunk

unused devices: <none>

Alles anzeigen

sda.txt,sdc.txt,sdd.txt,sde.txt

geaves · 16. April 2021

Well the good news is the output from the long test is showing no errors, you're looking at 5, 196, 197 198, and 199, all are showing zero which implies the drives are fine.

So, shutdown and have a look at the connections and if you have some spare new cables switch them over, a restart then may bring it back up

piscdi · 16. April 2021

Bless you, my son. Oh, I've got cables hahaha. Thank you, thank you, thank you.

I might not get to it today; too much 'on the docket'. But certainly tomorrow.

I'll report back. I can't thank you enough for your expertise, geaves.

Have a great weekend.

piscdi · 17. April 2021

Hi geaves,

After shutting down, cleaning the case and replacing the sata and power cables, the RAID has gone missing (unmounted actually); everything else is reporting normally. I did not run any BIOS checks, hoping that a simple cleaning and reboot would get me back on track, so I can't confirm the MB/chipset and other components are fine.

Is this fixable or has my quixotic journey come to an end? System logs repeat the email message warning of no mount point.

I have not rebooted nor tried anything on my own.

Code

Status failed Service mountpoint_srv_dev-disk-by-label-Raid1 

    Date:        Sat, 17 Apr 2021 09:02:16
    Action:      alert
    Host:        Server
    Description: status failed (1) -- /srv/dev-disk-by-label-Raid1 is not a mountpoint

Code

root@Server:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdc[1](S) sda[4](S) sde[3](S) sdd[2](S)
      23441566048 blocks super 1.2

unused devices: <none>
root@Server:~#

Code

root@Server:~# blkid
/dev/sdf1: UUID="b8f86e19-3cb3-4d0a-b1e2-623620314887" TYPE="ext4" PARTUUID="79c1501c-01"
/dev/sdf5: UUID="a631ab49-ab21-4169-8352-aa1829c8a95b" TYPE="swap" PARTUUID="79c1501c-05"
/dev/sdc: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="98a8cd6c-cb21-5f16-8540-aa6c88960541" LABEL="Server:Raid1" TYPE="linux_raid_member"
/dev/sda: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="cd9ad946-ea0f-65a1-a2a3-298a258b2f76" LABEL="Server:Raid1" TYPE="linux_raid_member"
/dev/sdd: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="df979bad-92c3-ac42-f3e5-512838996555" LABEL="Server:Raid1" TYPE="linux_raid_member"
/dev/sde: UUID="98379905-d139-d263-d58d-5eb3893ba95b" UUID_SUB="97feb0f7-c46c-0e05-4b6f-4c40a9448f9f" LABEL="Server:Raid1" TYPE="linux_raid_member"
/dev/sdb1: LABEL="BackUp" UUID="9238bbb9-e494-487d-941e-234cad83a670" TYPE="ext4" PARTUUID="d6e47150-672f-4fb8-a57d-72c6ff0ca4ae"
root@Server:~#

Code

root@Server:~# fdisk -l | grep "Disk "
Disk /dev/sdf: 232.9 GiB, 250059350016 bytes, 488397168 sectors
Disk identifier: 0x79c1501c
Disk /dev/sdc: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Disk /dev/sda: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Disk /dev/sdd: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Disk /dev/sde: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Disk /dev/sdb: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Disk identifier: 9D074D37-5C66-4B80-9C9D-9B55480833ED
root@Server:~#

Code

root@Server:~# cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
# Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
# To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
# used if no RAID devices are configured.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=Server:Raid1 UUID=98379905:d139d263:d58d5eb3:893ba95b

# instruct the monitoring daemon where to send mail alerts
MAILADDR xxxxxxxxxxxx
MAILFROM rootroot@Server:~#

Alles anzeigen

Code

root@Server:~# mdadm --detail --scan --verbose
INACTIVE-ARRAY /dev/md0 num-devices=4 metadata=1.2 name=Server:Raid1 UUID=98379905:d139d263:d58d5eb3:893ba95b
   devices=/dev/sda,/dev/sdc,/dev/sdd,/dev/sde
root@Server:~#

geaves · 17. April 2021

Hopefully the failed mountpoint is due to the array being inactive, so;

mdadm --stop /dev/md0

mdadm --assemble --force --verbose /dev/md0 /dev/sd[acde]

piscdi · 17. April 2021

I'm baaaaaaaack. You, sir, are my hero. 'Thank you' just doesn't seem like enough.

geaves, I hope you have a fantastic weekend; I will be, LOL.

Code

root@Server:~# mdadm --assemble --force --verbose /dev/md0 /dev/sd[acde]
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md0, slot 3.
mdadm: forcing event count in /dev/sdc(1) from 191597 upto 191723
mdadm: forcing event count in /dev/sdd(2) from 191597 upto 191723
mdadm: clearing FAULTY flag for device 1 in /dev/md0 for /dev/sdc
mdadm: clearing FAULTY flag for device 2 in /dev/md0 for /dev/sdd
mdadm: Marking array /dev/md0 as 'clean'
mdadm: added /dev/sdc to /dev/md0 as 1
mdadm: added /dev/sdd to /dev/md0 as 2
mdadm: added /dev/sde to /dev/md0 as 3
mdadm: added /dev/sda to /dev/md0 as 0
mdadm: /dev/md0 has been started with 4 drives.
root@Server:~#

Alles anzeigen

geaves · 17. April 2021

well done

piscdi · 17. April 2021

Right back at you. Thanks again.

Jetzt mitmachen!