clean, degraded array state after only a couple weeks of setting it up

atomicrabbit · 27. Januar 2018

I just set up OMV in a VM in ESXi a couple weeks ago and the raid array state is showing "clean, degraded" now. I'm using 4 x 8TB WD80EFZX 5400RPM in RAID10. I'm not sure what happened for this to occur. I had just recently copied all the data to it a couple days ago. About 3TB of the 14.4TB is being used. Here's the output from the various commands requested from the pinned post. If you need any more information, please let me know. The SMART status of all the drives is showing green/good.

cat /proc/mdstat

Code

Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md0 : active raid10 sde[3] sdc[1]
      15627790336 blocks super 1.2 512K chunks 2 near-copies [4/2] [_U_U]
      bitmap: 117/117 pages [468KB], 65536KB chunk

blkid

Code

/dev/sda1: UUID="5fd7c9d7-d9b4-4c03-ba51-7017ae8018fa" TYPE="ext4" PARTUUID="9856ddaf-01"
/dev/sda5: UUID="1671fa99-4244-4fab-9cad-975e63b1b012" TYPE="swap" PARTUUID="9856ddaf-05"
/dev/sdc: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="3e5f75fb-76e5-6ebb-bd0c-77473af3ad0f" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
/dev/sdb: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="56d31a5b-4c6d-258d-b997-5e807d463250" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
/dev/sdd: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="944b81c7-e25c-988d-971f-2d1f5a3cc058" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
/dev/md0: LABEL="acmraid" UUID="3301097f-2458-4e36-94e5-e633cd21dfcc" TYPE="ext4"
/dev/sde: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="6158fc62-6e62-d339-ecb2-ca05b94822b9" LABEL="acmomv:acmraid" TYPE="linux_raid_member"

fdisk -l | grep "Disk "

Code

Disk /dev/sda: 8 GiB, 8589934592 bytes, 16777216 sectors
Disk identifier: 0x9856ddaf
Disk /dev/sdc: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
Disk /dev/sdb: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
Disk /dev/sdd: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
Disk /dev/sde: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
Disk /dev/md0: 14.6 TiB, 16002857304064 bytes, 31255580672 sectors

cat /etc/mdadm/mdadm.conf

Code

# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#


# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
# Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
# To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
# used if no RAID devices are configured.
DEVICE partitions


# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes


# automatically tag new arrays as belonging to the local system
HOMEHOST <system>


# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=acmomv:acmraid UUID=d856f092:84997949:b3f9705e:26a12002


# instruct the monitoring daemon where to send mail alerts
MAILADDR example@domain.com

Alles anzeigen

mdadm --detail --scan --verbose

Code

ARRAY /dev/md0 level=raid10 num-devices=4 metadata=1.2 name=acmomv:acmraid UUID=d856f092:84997949:b3f9705e:26a12002
   devices=/dev/sdc,/dev/sde

I noticed when I click Details in the RAID management section, I see this.. what does "removed" mean?:

Code

Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       32        1      active sync set-B   /dev/sdc
       -       0        0        2      removed
       3       8       64        3      active sync set-B   /dev/sde

I also notice that my read/write speeds are nothing to rave about. About 67 MB/sec write and 95 MB/sec read on average, but this being the first time I ever set up RAID, I don't know what to expect. Any help is appreciated.

stratege1401 · 27. Januar 2018

could you be more specific about harware.

Mobo; cpu; ram ....

We can't guess and compare between a VM and something reel

atomicrabbit · 27. Januar 2018

Zitat von stratege1401

could you be more specific about harware.

Mobo; cpu; ram ....

We can't guess and compare between a VM and something reel

sorry, forgot that while adding all the other details.

ESXi is running on a Dell PowerEdge T110 II with a Xeon E3-1230 V2 and 32GB of ECC RAM. The OMV VM has 2GB of RAM and and 4CPUs. The four 8TB HDDs are connected via a LSI 9211-8i HBA in IT mode and set to direct pass through to the VM.

EDIT: What exactly does "degraded" mean in OMV?

stratege1401 · 28. Januar 2018

degraded means something is wrong.

Could be hardware or data related ( for ex, data sync is not finish )

In your case, i have seen post about the need to convert LSI 9211-8i HBA card to Initiator Target mode under some linux distro (also under freenas) in order to have all the HDD to show up. This is done throught the LSI bios. I remember also a minimum firmware level to respect ... but could not remember where ...

found this:

This is just troubleshooting:

Is your LSI 9211-8i controller healthy?
When the <<<Press Ctrl-C to start LSI Logic Configuration Utility>>> prompt is displayed, press Ctrl+C.
View the disks in the MPT SAS BIOS utility. Select "SAS Topology"... Do you see your disks? (very important)
Try different cables. (important)
Try a different PCIe slot.
Are the disks known to be good?

atomicrabbit · 28. Januar 2018

As I said in my last post, the controller is already in IT (Initiator Target) mode. I purchased it like that. Also, I was able to see all 4 disks in OMV when I initially setup the RAID array and I can still see all 4 disks in the Disks section of OMV.

In the Detail popup of the RAID Management section, I see this. What does it mean when it says "Raid Devices: 4" and "Total Devices: 2", then below it says "Active Devices: 2" and "Working Devices: 2" but Failed Devices says 0.

Code

Version : 1.2
  Creation Time : Thu Dec 14 00:51:03 2017
     Raid Level : raid10
     Array Size : 15627790336 (14903.82 GiB 16002.86 GB)
  Used Dev Size : 7813895168 (7451.91 GiB 8001.43 GB)
   Raid Devices : 4
  Total Devices : 2
    Persistence : Superblock is persistent


  Intent Bitmap : Internal


    Update Time : Sat Jan 27 16:53:18 2018
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

Alles anzeigen

I also checked the SMART logs and I see a couple entries like this:

Code

ATA error count increased from 64475 to 64477

and this one a different hdd:

Code

SMART Usage Attribute: 199 UDMA_CRC_Error_Count changed from 200 to 198

I think the problem is I don't know what I SHOULD be seeing in OMV. What should Raid Devices, Total Devices, Active Devices and Working Devices show in a proper healthy environment? Should all the devices be listed in the RAID Management section. This is all new to me so I don't have a benchmark to say what is ok and what isn't. Why is everything seemingly working still? The only reason I noticed the "clean, degraded" state was because the read/write speeds were a tad slow and I did some digging, but other than a few not-obvious warning flags, the UI of OMV looks like business as usual. No big red warning saying 1 or more drives are failing or something is wrong. I had to dig into the details. This seems kinda backwards to me.

atomicrabbit · 28. Januar 2018

Just tested swapping the SATA cables on the SFF-8087 cable I have and confirmed that it seems to be a faulty cable.

Before swap:

Port	Dev	Status
0	/dev/sdb	Removed
1	/dev/sdc	Connected
2	/dev/sdd	Removed
3	/dev/sde	Connected

After swap:

Port	Dev	Status
0	/dev/sde	Removed
1	/dev/sdd	Connected
2	/dev/sdc	Removed
3	/dev/sdb	Connected

Based on that alone, I think I can safely say it's a problem with the cable.

EDIT: I'm also concerned about the SMART errors in the logs. Any thoughts on that?

stratege1401 · 28. Januar 2018

so step four was right ! dont worry about the UDMA_CRC error, That error is usually a bad sata cable or a sata cable wrapped around a power wire. ATA errors can be caused by something other than the drives: cables, power etc. The two error are linked.

stratege1401 · 28. Januar 2018

if you want to laught at me, i've just experience a bad sata cable in my raid ...

After cleaning the tons of dust from my tower, i start to switch one of my defective test hdd, taking apart ( means letting the cable drop on the floor ) the components.

After reassembling the tower, the new disk was not showing up. Spend 20 minutes trying to understand what the hell was going on !

Well, the explanation was simple: bad cable due to kitten attack !!!

BEWARE !!!

atomicrabbit · 29. Januar 2018

Haha. Gotta keep those kitten clear of the equipment!

Thanks again for the help in troubleshooting! My biggest concern is that there were little to no warning flags in the OMV UI. I would expect if 1 (or multiple) drives were failing or not connected, that OMV would be screaming bloody murder. Instead I had to do a whole bunch of investigation and even then it really wasn’t that obvious what had occurred. The shares were working which was good but it would be nice to have some big red warnings next time. Any ideas if this is just some configuration I need to set? I’m still new to OMV.

stratege1401 · 29. Januar 2018

well, the notifications are sent to me via mail, on my smartphone, so i can react to raid/smart/filesystem/cpu/load ...
You need to configure your notification mail server ...

atomicrabbit · 30. Januar 2018

ugh. bought a new cable and same thing! 2 of the drives appears as removed and the state is still clean, degraded.

what else could this be?

the CRC error counts are still increasing and the ATA errors are still showijg up too.

stratege1401 · 30. Januar 2018

Multiple cause:

- bad sata cable
- psu undervoltage
- to much HDD on the same power line
- cats
- bad sata controller ( defect or overheating )
- bad bios or bad AHCI config, check your bios config ( be sure to be in AHCI mode, check any bios raid config, follow logical HDD numbering )

Du to the results of your cable switch, i dont think the HDD are faulty. I dont think the array is defect. If you have an extra fan, try to cool your sata chip... sorry, not much help here. But it looks clearly as a hadrware default

atomicrabbit · 30. Januar 2018

Is it possible, I accidentally/unintentionally "removed" the disks from the array from OMV? I mean, I don't remember doing it (or even know HOW to do it for that matter), but is that a possibility?

If it was faulty hardware or a bad bios config, why would it still be able see all 4 disks and read various information off it (SMART data, size, etc)?

stratege1401 · 30. Januar 2018

Each HDD under Linux gets a UUID witch is linked to sdX naming. When switching sata cable, you might have screw up the pairing :
Sataconn1---uuid455---sda
Sataconn2---uuid788---sdb
Is not the same as
Sataconn1---uuid788---sda...
So it might be possible...

atomicrabbit · 30. Januar 2018

Well I meant “removed” from the UI options. The removed state was showing before I touched any of the cables. That being said I did switch the cables afterward and the sda, sdb, sdc.. mapping’s seem to be different now, but 2 drives are always removed. When I switch the SATA cables, I was looking at which HDD serial numbers were mapped to which port, so it was unique.

Is there anything I can do now?

Can I delete the RAID Array and re-add the drives and still keep all the data on the drives?

stratege1401 · 30. Januar 2018

Well. I am not sure. And today I am away so I can test on my VM.

atomicrabbit · 31. Januar 2018

I don't think it's a power issue because the tower (a Dell PowerEdge T110 ii) has 4 HDD bays, plus 2 bays at the top for more HDDs or optical drives, so it made to support that many devices. I don't think it's a problem with the cable since I replaced it already. I don't think it's the LSI 9811-8i card because the BIOS page for it is able to see all the drives and OMV is able to see all the drives and show the SMART status for all of them. And when I initially setup the raid array, I was able to select all 4 drives and create the RAID10 array. Sure it's possible that some hardware might have failed afterward, but based on the information I have, I really feel it's not the case. At this point I'm kind of out of solutions as I don't know enough about OMV or mdadm yet and I haven't found a solution. The HDD that all my data was previously on was failing badly and I barely was able to copy the data off of it and no longer have a separate HDD that can hold the 3TB that I copied to the RAID array. So this is my plan:

Buy a new 4TB (or maybe a fifth 8TB WD80EFZX for when one of the drives in the array fail)
Copy all the data from the raid array to the new drive temporarily
Kill the raid array and recreate it.
Copy all the data back.

If you have any other suggestions before I do this, please let me know.

atomicrabbit · 3. Februar 2018

Any thoughts?

atomicrabbit · 7. Februar 2018

Zitat von stratege1401

Well. I am not sure. And today I am away so I can test on my VM.

Well... I finally got a spare hdd to copy all the files from the raid array and about 10% into the copy, it seems like the raid drives are failing. The shared drive is offline and when I access the File Systems or Raid Management section in the OMV UI, I get "Communication Error"

I feel like a drive (or multiple) have failed. How can I confirm?

I've attached the output from the VM while it was trying to shutdown -- it's been a long time and still hasn't been able to shutdown.

EDIT: After rebooting OMV, it seemed startup but there's no RAID array in the RAID Management section. All the disks appear in the Disks section and they're all green in the SMART section.

Jetzt mitmachen!