clean, degraded array state after only a couple weeks of setting it up

  • I just set up OMV in a VM in ESXi a couple weeks ago and the raid array state is showing "clean, degraded" now. I'm using 4 x 8TB WD80EFZX 5400RPM in RAID10. I'm not sure what happened for this to occur. I had just recently copied all the data to it a couple days ago. About 3TB of the 14.4TB is being used. Here's the output from the various commands requested from the pinned post. If you need any more information, please let me know. The SMART status of all the drives is showing green/good.


    cat /proc/mdstat

    Code
    Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
    md0 : active raid10 sde[3] sdc[1]
          15627790336 blocks super 1.2 512K chunks 2 near-copies [4/2] [_U_U]
          bitmap: 117/117 pages [468KB], 65536KB chunk


    blkid

    Code
    /dev/sda1: UUID="5fd7c9d7-d9b4-4c03-ba51-7017ae8018fa" TYPE="ext4" PARTUUID="9856ddaf-01"
    /dev/sda5: UUID="1671fa99-4244-4fab-9cad-975e63b1b012" TYPE="swap" PARTUUID="9856ddaf-05"
    /dev/sdc: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="3e5f75fb-76e5-6ebb-bd0c-77473af3ad0f" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
    /dev/sdb: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="56d31a5b-4c6d-258d-b997-5e807d463250" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
    /dev/sdd: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="944b81c7-e25c-988d-971f-2d1f5a3cc058" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
    /dev/md0: LABEL="acmraid" UUID="3301097f-2458-4e36-94e5-e633cd21dfcc" TYPE="ext4"
    /dev/sde: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="6158fc62-6e62-d339-ecb2-ca05b94822b9" LABEL="acmomv:acmraid" TYPE="linux_raid_member"

    fdisk -l | grep "Disk "

    Code
    Disk /dev/sda: 8 GiB, 8589934592 bytes, 16777216 sectors
    Disk identifier: 0x9856ddaf
    Disk /dev/sdc: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
    Disk /dev/sdb: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
    Disk /dev/sdd: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
    Disk /dev/sde: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
    Disk /dev/md0: 14.6 TiB, 16002857304064 bytes, 31255580672 sectors

    cat /etc/mdadm/mdadm.conf

    mdadm --detail --scan --verbose

    Code
    ARRAY /dev/md0 level=raid10 num-devices=4 metadata=1.2 name=acmomv:acmraid UUID=d856f092:84997949:b3f9705e:26a12002
       devices=/dev/sdc,/dev/sde

    I noticed when I click Details in the RAID management section, I see this.. what does "removed" mean?:

    Code
    Number   Major   Minor   RaidDevice State
           -       0        0        0      removed
           1       8       32        1      active sync set-B   /dev/sdc
           -       0        0        2      removed
           3       8       64        3      active sync set-B   /dev/sde


    I also notice that my read/write speeds are nothing to rave about. About 67 MB/sec write and 95 MB/sec read on average, but this being the first time I ever set up RAID, I don't know what to expect. Any help is appreciated.

  • could you be more specific about harware.


    Mobo; cpu; ram ....


    We can't guess and compare between a VM and something reel

  • could you be more specific about harware.


    Mobo; cpu; ram ....


    We can't guess and compare between a VM and something reel

    sorry, forgot that while adding all the other details.


    ESXi is running on a Dell PowerEdge T110 II with a Xeon E3-1230 V2 and 32GB of ECC RAM. The OMV VM has 2GB of RAM and and 4CPUs. The four 8TB HDDs are connected via a LSI 9211-8i HBA in IT mode and set to direct pass through to the VM.


    EDIT: What exactly does "degraded" mean in OMV?

  • degraded means something is wrong.


    Could be hardware or data related ( for ex, data sync is not finish )


    In your case, i have seen post about the need to convert LSI 9211-8i HBA card to Initiator Target mode under some linux distro (also under freenas) in order to have all the HDD to show up. This is done throught the LSI bios. I remember also a minimum firmware level to respect ... but could not remember where ...


    found this:


    This is just troubleshooting:

    • Is your LSI 9211-8i controller healthy?
    • When the <<<Press Ctrl-C to start LSI Logic Configuration Utility>>> prompt is displayed, press Ctrl+C.
    • View the disks in the MPT SAS BIOS utility. Select "SAS Topology"... Do you see your disks? (very important)
    • Try different cables. (important)
    • Try a different PCIe slot.
    • Are the disks known to be good?
  • As I said in my last post, the controller is already in IT (Initiator Target) mode. I purchased it like that. Also, I was able to see all 4 disks in OMV when I initially setup the RAID array and I can still see all 4 disks in the Disks section of OMV.


    In the Detail popup of the RAID Management section, I see this. What does it mean when it says "Raid Devices: 4" and "Total Devices: 2", then below it says "Active Devices: 2" and "Working Devices: 2" but Failed Devices says 0.



    I also checked the SMART logs and I see a couple entries like this:

    Code
    ATA error count increased from 64475 to 64477

    and this one a different hdd:

    Code
    SMART Usage Attribute: 199 UDMA_CRC_Error_Count changed from 200 to 198

    I think the problem is I don't know what I SHOULD be seeing in OMV. What should Raid Devices, Total Devices, Active Devices and Working Devices show in a proper healthy environment? Should all the devices be listed in the RAID Management section. This is all new to me so I don't have a benchmark to say what is ok and what isn't. Why is everything seemingly working still? The only reason I noticed the "clean, degraded" state was because the read/write speeds were a tad slow and I did some digging, but other than a few not-obvious warning flags, the UI of OMV looks like business as usual. No big red warning saying 1 or more drives are failing or something is wrong. I had to dig into the details. This seems kinda backwards to me.

  • Just tested swapping the SATA cables on the SFF-8087 cable I have and confirmed that it seems to be a faulty cable.


    Before swap:


    PortDevStatus
    0/dev/sdbRemoved
    1/dev/sdcConnected
    2/dev/sddRemoved
    3/dev/sdeConnected



    After swap:


    PortDevStatus
    0/dev/sdeRemoved
    1/dev/sddConnected
    2/dev/sdcRemoved
    3/dev/sdbConnected



    Based on that alone, I think I can safely say it's a problem with the cable.


    EDIT: I'm also concerned about the SMART errors in the logs. Any thoughts on that?

  • so step four was right ! dont worry about the UDMA_CRC error, That error is usually a bad sata cable or a sata cable wrapped around a power wire. ATA errors can be caused by something other than the drives: cables, power etc. The two error are linked.

  • if you want to laught at me, i've just experience a bad sata cable in my raid ...


    After cleaning the tons of dust from my tower, i start to switch one of my defective test hdd, taking apart ( means letting the cable drop on the floor ) the components.


    After reassembling the tower, the new disk was not showing up. Spend 20 minutes trying to understand what the hell was going on !


    Well, the explanation was simple: bad cable due to kitten attack !!!


    BEWARE !!!


  • Haha. Gotta keep those kitten clear of the equipment!


    Thanks again for the help in troubleshooting! My biggest concern is that there were little to no warning flags in the OMV UI. I would expect if 1 (or multiple) drives were failing or not connected, that OMV would be screaming bloody murder. Instead I had to do a whole bunch of investigation and even then it really wasn’t that obvious what had occurred. The shares were working which was good but it would be nice to have some big red warnings next time. Any ideas if this is just some configuration I need to set? I’m still new to OMV.

  • well, the notifications are sent to me via mail, on my smartphone, so i can react to raid/smart/filesystem/cpu/load ...
    You need to configure your notification mail server ...

  • ugh. bought a new cable and same thing! 2 of the drives appears as removed and the state is still clean, degraded.


    what else could this be?


    the CRC error counts are still increasing and the ATA errors are still showijg up too.

  • Multiple cause:


    - bad sata cable
    - psu undervoltage
    - to much HDD on the same power line
    - cats
    - bad sata controller ( defect or overheating )
    - bad bios or bad AHCI config, check your bios config ( be sure to be in AHCI mode, check any bios raid config, follow logical HDD numbering )


    Du to the results of your cable switch, i dont think the HDD are faulty. I dont think the array is defect. If you have an extra fan, try to cool your sata chip... sorry, not much help here. But it looks clearly as a hadrware default

  • Is it possible, I accidentally/unintentionally "removed" the disks from the array from OMV? I mean, I don't remember doing it (or even know HOW to do it for that matter), but is that a possibility?


    If it was faulty hardware or a bad bios config, why would it still be able see all 4 disks and read various information off it (SMART data, size, etc)?

  • Each HDD under Linux gets a UUID witch is linked to sdX naming. When switching sata cable, you might have screw up the pairing :
    Sataconn1---uuid455---sda
    Sataconn2---uuid788---sdb
    Is not the same as
    Sataconn1---uuid788---sda...
    So it might be possible...

  • Well I meant “removed” from the UI options. The removed state was showing before I touched any of the cables. That being said I did switch the cables afterward and the sda, sdb, sdc.. mapping’s seem to be different now, but 2 drives are always removed. When I switch the SATA cables, I was looking at which HDD serial numbers were mapped to which port, so it was unique.


    Is there anything I can do now?


    Can I delete the RAID Array and re-add the drives and still keep all the data on the drives?

  • I don't think it's a power issue because the tower (a Dell PowerEdge T110 ii) has 4 HDD bays, plus 2 bays at the top for more HDDs or optical drives, so it made to support that many devices. I don't think it's a problem with the cable since I replaced it already. I don't think it's the LSI 9811-8i card because the BIOS page for it is able to see all the drives and OMV is able to see all the drives and show the SMART status for all of them. And when I initially setup the raid array, I was able to select all 4 drives and create the RAID10 array. Sure it's possible that some hardware might have failed afterward, but based on the information I have, I really feel it's not the case. At this point I'm kind of out of solutions as I don't know enough about OMV or mdadm yet and I haven't found a solution. The HDD that all my data was previously on was failing badly and I barely was able to copy the data off of it and no longer have a separate HDD that can hold the 3TB that I copied to the RAID array. So this is my plan:


    • Buy a new 4TB (or maybe a fifth 8TB WD80EFZX for when one of the drives in the array fail)
    • Copy all the data from the raid array to the new drive temporarily
    • Kill the raid array and recreate it.
    • Copy all the data back.

    If you have any other suggestions before I do this, please let me know.

  • Well. I am not sure. And today I am away so I can test on my VM.

    Well... I finally got a spare hdd to copy all the files from the raid array and about 10% into the copy, it seems like the raid drives are failing. The shared drive is offline and when I access the File Systems or Raid Management section in the OMV UI, I get "Communication Error"


    I feel like a drive (or multiple) have failed. How can I confirm?


    I've attached the output from the VM while it was trying to shutdown -- it's been a long time and still hasn't been able to shutdown.


    EDIT: After rebooting OMV, it seemed startup but there's no RAID array in the RAID Management section. All the disks appear in the Disks section and they're all green in the SMART section.

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!