clean, degraded array state after only a couple weeks of setting it up

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • clean, degraded array state after only a couple weeks of setting it up

      I just set up OMV in a VM in ESXi a couple weeks ago and the raid array state is showing "clean, degraded" now. I'm using 4 x 8TB WD80EFZX 5400RPM in RAID10. I'm not sure what happened for this to occur. I had just recently copied all the data to it a couple days ago. About 3TB of the 14.4TB is being used. Here's the output from the various commands requested from the pinned post. If you need any more information, please let me know. The SMART status of all the drives is showing green/good.

      cat /proc/mdstat

      Source Code

      1. Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
      2. md0 : active raid10 sde[3] sdc[1]
      3. 15627790336 blocks super 1.2 512K chunks 2 near-copies [4/2] [_U_U]
      4. bitmap: 117/117 pages [468KB], 65536KB chunk

      blkid

      Source Code

      1. /dev/sda1: UUID="5fd7c9d7-d9b4-4c03-ba51-7017ae8018fa" TYPE="ext4" PARTUUID="9856ddaf-01"
      2. /dev/sda5: UUID="1671fa99-4244-4fab-9cad-975e63b1b012" TYPE="swap" PARTUUID="9856ddaf-05"
      3. /dev/sdc: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="3e5f75fb-76e5-6ebb-bd0c-77473af3ad0f" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
      4. /dev/sdb: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="56d31a5b-4c6d-258d-b997-5e807d463250" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
      5. /dev/sdd: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="944b81c7-e25c-988d-971f-2d1f5a3cc058" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
      6. /dev/md0: LABEL="acmraid" UUID="3301097f-2458-4e36-94e5-e633cd21dfcc" TYPE="ext4"
      7. /dev/sde: UUID="d856f092-8499-7949-b3f9-705e26a12002" UUID_SUB="6158fc62-6e62-d339-ecb2-ca05b94822b9" LABEL="acmomv:acmraid" TYPE="linux_raid_member"
      fdisk -l | grep "Disk "

      Source Code

      1. Disk /dev/sda: 8 GiB, 8589934592 bytes, 16777216 sectors
      2. Disk identifier: 0x9856ddaf
      3. Disk /dev/sdc: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
      4. Disk /dev/sdb: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
      5. Disk /dev/sdd: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
      6. Disk /dev/sde: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors
      7. Disk /dev/md0: 14.6 TiB, 16002857304064 bytes, 31255580672 sectors
      cat /etc/mdadm/mdadm.conf

      Source Code

      1. # mdadm.conf
      2. #
      3. # Please refer to mdadm.conf(5) for information about this file.
      4. #
      5. # by default, scan all partitions (/proc/partitions) for MD superblocks.
      6. # alternatively, specify devices to scan, using wildcards if desired.
      7. # Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
      8. # To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
      9. # used if no RAID devices are configured.
      10. DEVICE partitions
      11. # auto-create devices with Debian standard permissions
      12. CREATE owner=root group=disk mode=0660 auto=yes
      13. # automatically tag new arrays as belonging to the local system
      14. HOMEHOST <system>
      15. # definitions of existing MD arrays
      16. ARRAY /dev/md0 metadata=1.2 name=acmomv:acmraid UUID=d856f092:84997949:b3f9705e:26a12002
      17. # instruct the monitoring daemon where to send mail alerts
      18. MAILADDR example@domain.com
      Display All
      mdadm --detail --scan --verbose

      Source Code

      1. ARRAY /dev/md0 level=raid10 num-devices=4 metadata=1.2 name=acmomv:acmraid UUID=d856f092:84997949:b3f9705e:26a12002
      2. devices=/dev/sdc,/dev/sde
      I noticed when I click Details in the RAID management section, I see this.. what does "removed" mean?:

      Source Code

      1. Number Major Minor RaidDevice State
      2. - 0 0 0 removed
      3. 1 8 32 1 active sync set-B /dev/sdc
      4. - 0 0 2 removed
      5. 3 8 64 3 active sync set-B /dev/sde

      I also notice that my read/write speeds are nothing to rave about. About 67 MB/sec write and 95 MB/sec read on average, but this being the first time I ever set up RAID, I don't know what to expect. Any help is appreciated.
    • could you be more specific about harware.

      Mobo; cpu; ram ....

      We can't guess and compare between a VM and something reel
      ---------------------------------------------------------------------------------------------------------------------
      French, so forgive my english
      Personal Rig: valid.x86.fr/v72uek as a test bench with Oracle VM.
      And YES, my avatar is real, i am flying "parapentes" in St Hilaire du Touvet and at la coupe icare.
    • stratege1401 wrote:

      could you be more specific about harware.

      Mobo; cpu; ram ....

      We can't guess and compare between a VM and something reel
      sorry, forgot that while adding all the other details.

      ESXi is running on a Dell PowerEdge T110 II with a Xeon E3-1230 V2 and 32GB of ECC RAM. The OMV VM has 2GB of RAM and and 4CPUs. The four 8TB HDDs are connected via a LSI 9211-8i HBA in IT mode and set to direct pass through to the VM.

      EDIT: What exactly does "degraded" mean in OMV?
    • degraded means something is wrong.

      Could be hardware or data related ( for ex, data sync is not finish )

      In your case, i have seen post about the need to convert LSI 9211-8i HBA card to Initiator Target mode under some linux distro (also under freenas) in order to have all the HDD to show up. This is done throught the LSI bios. I remember also a minimum firmware level to respect ... but could not remember where ...

      found this:

      This is just troubleshooting:
      • Is your LSI 9211-8i controller healthy?
      • When the <<<Press Ctrl-C to start LSI Logic Configuration Utility>>> prompt is displayed, press Ctrl+C.
      • View the disks in the MPT SAS BIOS utility. Select "SAS Topology"... Do you see your disks? (very important)
      • Try different cables. (important)
      • Try a different PCIe slot.
      • Are the disks known to be good?
      ---------------------------------------------------------------------------------------------------------------------
      French, so forgive my english
      Personal Rig: valid.x86.fr/v72uek as a test bench with Oracle VM.
      And YES, my avatar is real, i am flying "parapentes" in St Hilaire du Touvet and at la coupe icare.

      The post was edited 1 time, last by stratege1401 ().

    • As I said in my last post, the controller is already in IT (Initiator Target) mode. I purchased it like that. Also, I was able to see all 4 disks in OMV when I initially setup the RAID array and I can still see all 4 disks in the Disks section of OMV.

      In the Detail popup of the RAID Management section, I see this. What does it mean when it says "Raid Devices: 4" and "Total Devices: 2", then below it says "Active Devices: 2" and "Working Devices: 2" but Failed Devices says 0.


      Source Code

      1. Version : 1.2
      2. Creation Time : Thu Dec 14 00:51:03 2017
      3. Raid Level : raid10
      4. Array Size : 15627790336 (14903.82 GiB 16002.86 GB)
      5. Used Dev Size : 7813895168 (7451.91 GiB 8001.43 GB)
      6. Raid Devices : 4
      7. Total Devices : 2
      8. Persistence : Superblock is persistent
      9. Intent Bitmap : Internal
      10. Update Time : Sat Jan 27 16:53:18 2018
      11. State : clean, degraded
      12. Active Devices : 2
      13. Working Devices : 2
      14. Failed Devices : 0
      15. Spare Devices : 0
      Display All
      I also checked the SMART logs and I see a couple entries like this:

      Source Code

      1. ATA error count increased from 64475 to 64477
      and this one a different hdd:

      Source Code

      1. SMART Usage Attribute: 199 UDMA_CRC_Error_Count changed from 200 to 198
      I think the problem is I don't know what I SHOULD be seeing in OMV. What should Raid Devices, Total Devices, Active Devices and Working Devices show in a proper healthy environment? Should all the devices be listed in the RAID Management section. This is all new to me so I don't have a benchmark to say what is ok and what isn't. Why is everything seemingly working still? The only reason I noticed the "clean, degraded" state was because the read/write speeds were a tad slow and I did some digging, but other than a few not-obvious warning flags, the UI of OMV looks like business as usual. No big red warning saying 1 or more drives are failing or something is wrong. I had to dig into the details. This seems kinda backwards to me.

      The post was edited 3 times, last by atomicrabbit ().

    • Just tested swapping the SATA cables on the SFF-8087 cable I have and confirmed that it seems to be a faulty cable.

      Before swap:

      PortDevStatus
      0/dev/sdbRemoved
      1/dev/sdcConnected
      2/dev/sddRemoved
      3/dev/sdeConnected



      After swap:

      PortDevStatus
      0/dev/sdeRemoved
      1/dev/sddConnected
      2/dev/sdcRemoved
      3/dev/sdbConnected



      Based on that alone, I think I can safely say it's a problem with the cable.

      EDIT: I'm also concerned about the SMART errors in the logs. Any thoughts on that?
    • so step four was right ! dont worry about the UDMA_CRC error, That error is usually a bad sata cable or a sata cable wrapped around a power wire. ATA errors can be caused by something other than the drives: cables, power etc. The two error are linked.
      ---------------------------------------------------------------------------------------------------------------------
      French, so forgive my english
      Personal Rig: valid.x86.fr/v72uek as a test bench with Oracle VM.
      And YES, my avatar is real, i am flying "parapentes" in St Hilaire du Touvet and at la coupe icare.
    • if you want to laught at me, i've just experience a bad sata cable in my raid ...

      After cleaning the tons of dust from my tower, i start to switch one of my defective test hdd, taking apart ( means letting the cable drop on the floor ) the components.

      After reassembling the tower, the new disk was not showing up. Spend 20 minutes trying to understand what the hell was going on !

      Well, the explanation was simple: bad cable due to kitten attack !!!

      BEWARE !!!

      [IMG:https://zupimages.net/up/18/04/q7p2.jpg]
      ---------------------------------------------------------------------------------------------------------------------
      French, so forgive my english
      Personal Rig: valid.x86.fr/v72uek as a test bench with Oracle VM.
      And YES, my avatar is real, i am flying "parapentes" in St Hilaire du Touvet and at la coupe icare.
    • Haha. Gotta keep those kitten clear of the equipment!

      Thanks again for the help in troubleshooting! My biggest concern is that there were little to no warning flags in the OMV UI. I would expect if 1 (or multiple) drives were failing or not connected, that OMV would be screaming bloody murder. Instead I had to do a whole bunch of investigation and even then it really wasn’t that obvious what had occurred. The shares were working which was good but it would be nice to have some big red warnings next time. Any ideas if this is just some configuration I need to set? I’m still new to OMV.
    • well, the notifications are sent to me via mail, on my smartphone, so i can react to raid/smart/filesystem/cpu/load ...
      You need to configure your notification mail server ...
      ---------------------------------------------------------------------------------------------------------------------
      French, so forgive my english
      Personal Rig: valid.x86.fr/v72uek as a test bench with Oracle VM.
      And YES, my avatar is real, i am flying "parapentes" in St Hilaire du Touvet and at la coupe icare.
    • Multiple cause:

      - bad sata cable
      - psu undervoltage
      - to much HDD on the same power line
      - cats
      - bad sata controller ( defect or overheating )
      - bad bios or bad AHCI config, check your bios config ( be sure to be in AHCI mode, check any bios raid config, follow logical HDD numbering )

      Du to the results of your cable switch, i dont think the HDD are faulty. I dont think the array is defect. If you have an extra fan, try to cool your sata chip... sorry, not much help here. But it looks clearly as a hadrware default
      ---------------------------------------------------------------------------------------------------------------------
      French, so forgive my english
      Personal Rig: valid.x86.fr/v72uek as a test bench with Oracle VM.
      And YES, my avatar is real, i am flying "parapentes" in St Hilaire du Touvet and at la coupe icare.
    • Is it possible, I accidentally/unintentionally "removed" the disks from the array from OMV? I mean, I don't remember doing it (or even know HOW to do it for that matter), but is that a possibility?

      If it was faulty hardware or a bad bios config, why would it still be able see all 4 disks and read various information off it (SMART data, size, etc)?
    • Each HDD under Linux gets a UUID witch is linked to sdX naming. When switching sata cable, you might have screw up the pairing :
      Sataconn1---uuid455---sda
      Sataconn2---uuid788---sdb
      Is not the same as
      Sataconn1---uuid788---sda...
      So it might be possible...
      ---------------------------------------------------------------------------------------------------------------------
      French, so forgive my english
      Personal Rig: valid.x86.fr/v72uek as a test bench with Oracle VM.
      And YES, my avatar is real, i am flying "parapentes" in St Hilaire du Touvet and at la coupe icare.
    • Well I meant “removed” from the UI options. The removed state was showing before I touched any of the cables. That being said I did switch the cables afterward and the sda, sdb, sdc.. mapping’s seem to be different now, but 2 drives are always removed. When I switch the SATA cables, I was looking at which HDD serial numbers were mapped to which port, so it was unique.

      Is there anything I can do now?

      Can I delete the RAID Array and re-add the drives and still keep all the data on the drives?
    • I don't think it's a power issue because the tower (a Dell PowerEdge T110 ii) has 4 HDD bays, plus 2 bays at the top for more HDDs or optical drives, so it made to support that many devices. I don't think it's a problem with the cable since I replaced it already. I don't think it's the LSI 9811-8i card because the BIOS page for it is able to see all the drives and OMV is able to see all the drives and show the SMART status for all of them. And when I initially setup the raid array, I was able to select all 4 drives and create the RAID10 array. Sure it's possible that some hardware might have failed afterward, but based on the information I have, I really feel it's not the case. At this point I'm kind of out of solutions as I don't know enough about OMV or mdadm yet and I haven't found a solution. The HDD that all my data was previously on was failing badly and I barely was able to copy the data off of it and no longer have a separate HDD that can hold the 3TB that I copied to the RAID array. So this is my plan:

      1. Buy a new 4TB (or maybe a fifth 8TB WD80EFZX for when one of the drives in the array fail)
      2. Copy all the data from the raid array to the new drive temporarily
      3. Kill the raid array and recreate it.
      4. Copy all the data back.
      If you have any other suggestions before I do this, please let me know.
    • stratege1401 wrote:

      Well. I am not sure. And today I am away so I can test on my VM.
      Well... I finally got a spare hdd to copy all the files from the raid array and about 10% into the copy, it seems like the raid drives are failing. The shared drive is offline and when I access the File Systems or Raid Management section in the OMV UI, I get "Communication Error"

      I feel like a drive (or multiple) have failed. How can I confirm?

      I've attached the output from the VM while it was trying to shutdown -- it's been a long time and still hasn't been able to shutdown.

      EDIT: After rebooting OMV, it seemed startup but there's no RAID array in the RAID Management section. All the disks appear in the Disks section and they're all green in the SMART section.
      Images
      • omv-shutdown.PNG

        140.87 kB, 868×650, viewed 105 times

      The post was edited 3 times, last by atomicrabbit ().