RAID10 Fails w/ “not enough operational mirrors.” w/ 10 of 12 available.

  • RAID10 Fails w/ “not enough operational mirrors.”

    Issue Description:
    Not sure why this happened, but we suspect a power outage, or brown-out. This caused one of our OMV RAID10 arrays to go down, and when we tried a reboot because there were no HDD failures the dmesg log showed one of my iSCSI arrays failing. Also attempts to initiate a re-assemble failed with errors as shown below...


    Our Environment:

    We have a Supermicro High-Density Storage Server w/ two 1Tb SATA SSD drive for OS, and 72 4Tb HDDs in 36 dual hot-swap bays. There are 24 drives per LSI Logic SAS SCSI controler, configured w/ six RAID10 Arrays.


    We are running OpenMediaVault 1.19 (Kralizec).

    The arrays are configured as follows....


    The 2 SSD drives as stand-alone ext4 paritions.

    /dev/sda1 - For the OS
    /dev/sdb1 - For OS storage.


    The RAID Arrays and their associated LUN and types as as follows...

    /dev/md0 - LUN1 - ext4 - NFS Share
    /dev/md1 - LUN2 - ext4 - NFS Share
    /dev/md2 - LUN3 - ext4 - SMB Share
    /dev/md3 - LUN4 - ext4 - SMB Share
    /dev/md4 - LUN5 - ext4 - iSCSI Share (This is my problem Array)
    /dev/md5 - LUN6 - ext4 - iSCSI Share


    ----------------
    Logs and pertinant information.

    1.) Boot-up Dmesg log:


    [ 16.954740] md: md4 stopped.
    ....
    [ 16.960695] md/raid10:md4: not enough operational mirrors.
    [ 16.960775] md: pers->run() failed ...


    2.) mdadm.conf


    :~# cat /etc/mdadm/mdadm.conf
    # mdadm.conf
    .......
    # definitions of existing MD arrays
    ARRAY /dev/md0 metadata=1.2 name=hydromediavault:vol1 UUID=1cfbe551:59608320:d05a6c0b:36472514
    ARRAY /dev/md1 metadata=1.2 name=hydromediavault:vol2 UUID=92786b08:2998971f:e43b629f:9fae9d5c
    ARRAY /dev/md2 metadata=1.2 name=hydromediavault:vol3 UUID=2e027586:e0836061:8c51d19a:25e1de4e
    ARRAY /dev/md3 metadata=1.2 name=hydromediavault:vol4 UUID=7142528a:142b2fcf:1864bd51:ab53d7c1
    ARRAY /dev/md4 metadata=1.2 name=hydromediavault:vol5 UUID=f8964aaf:801f5634:e358c097:1d146306
    ARRAY /dev/md5 metadata=1.2 name=hydromediavault:vol6 UUID=a42c055f:a4c7b1c4:ab83d732:80b2b780


    3.) Stopped the array, then checked disk status...


    :~# smartctl -d scsi -a /dev/sdaw | grep "Status"
    SMART Health Status: OK


    :~# smartctl -d scsi -a /dev/sdax | grep "Status"
    SMART Health Status: OK

    4.) Ran the following to start the array


    :~# mdadm --assemble -v --scan --force --run --uuid=f8964aaf:801f5634:e358c097:1d146306


    ...key results...


    mdadm: failed to RUN_ARRAY /dev/md4: Input/output error
    mdadm: Not enough devices to start the array.


    5.) Results of mdstat...


    :~# cat /proc/mdstat for /dev/md4


    Personalities : [raid10]
    md4 : inactive sdav[2] sdam[11] sdan[10] sdao[9] sdap[8] sdaq[7] sdar[6] sdas[5] sdat[4] sdau[3]
    39068875120 blocks super 1.2


    6.) Results of mdadm examine showing problem with HD devices /dev/sdaw & sdax...

    Drives /dev/sda[mnopqrstuv] - State OK.


    mdadm: No md superblock detected on /dev/sdaw.


    mdadm: No md superblock detected on /dev/sdax.


    7.) Tried to fail both drives then remove then re-add with the following...


    mdadm /dev/md4 --fail /dev/sdaw
    mdadm: set device faulty failed for /dev/sdaw: No such device


    mdadm /dev/md4 --remove /dev/sdaw
    mdadm: hot remove failed for /dev/sdaw: No such device or address


    mdadm /dev/md4 --fail /dev/sdax
    mdadm: set device faulty failed for /dev/sdax: No such device


    mdadm /dev/md4 --remove /dev/sdax
    mdadm: hot remove failed for /dev/sdax: No such device or address


    mdadm --add /dev/md4 /dev/sdaw
    mdadm: /dev/md4 has failed so using --add cannot work and might destroy
    mdadm: data on /dev/sdaw. You should stop the array and re-assemble it.


    mdadm --add /dev/md4 /dev/sdax
    mdadm: /dev/md4 has failed so using --add cannot work and might destroy
    mdadm: data on /dev/sdax. You should stop the array and re-assemble it.


    8.) Second assemble attempt....


    mdadm --assemble --force /dev/md4 /dev/sda[utsrqponmvwx]
    ...
    mdadm: no recogniseable superblock on /dev/sdaw
    mdadm: /dev/sdaw has no superblock - assembly aborted


    9.) Drives show as removed from the array....


    mdadm -D /dev/md4
    /dev/md4:
    Version : 1.2
    Creation Time : Tue Mar 31 13:02:06 2015
    Raid Level : raid10
    Used Dev Size : -1
    Raid Devices : 12
    Total Devices : 10
    Persistence : Superblock is persistent


    Update Time : Thu Jun 4 18:24:15 2015
    State : active, FAILED, Not Started
    Active Devices : 10
    Working Devices : 10
    Failed Devices : 0
    Spare Devices : 0


    Layout : near=2
    Chunk Size : 512K


    Name : hydromediavault:vol5 (local to host hydromediavault)
    UUID : f8964aaf:801f5634:e358c097:1d146306
    Events : 239307


    Number Major Minor RaidDevice State
    0 0 0 0 removed
    1 0 0 1 removed
    2 66 240 2 active sync /dev/sdav
    3 66 224 3 active sync /dev/sdau
    4 66 208 4 active sync /dev/sdat
    5 66 192 5 active sync /dev/sdas
    6 66 176 6 active sync /dev/sdar
    7 66 160 7 active sync /dev/sdaq
    8 66 144 8 active sync /dev/sdap
    9 66 128 9 active sync /dev/sdao
    10 66 112 10 active sync /dev/sdan
    11 66 96 11 active sync /dev/sdam


    THE QUESTIONS:

    #1: If RAID10 arrays can lose up to 2 drives and still be operational, why will a 12 disk RAID10 array not start up w/ 10 good drives? - Is there a way to force this?

    #2: With the 2 drives missing the superblock showing as OK and not failed at hardware level, why can I not fail and remove the drives to re-add them to the array for re-assembly?


    Any insight would be very helpful. I have key data on this array that is otherwise unrecoverable. Please Help!

  • Did you ever stop md4 before assembling: mdadm --stop /dev/md4?
    Do sdaw and sdax show up in fdisk -l and blkid


    You have a system this large with this many drives and no backup??

    omv 5.5.11 usul | 64 bit | 5.4 proxmox kernel | omvextrasorg 5.3.6
    omv-extras.org plugins source code and issue tracker - github


    Please read this before posting a question.
    Please don't PM for support... Too many PMs!

  • Hello 'ryecoaaron'

    Thank you for your reply. To answer your questions....


    1. Did I run mdadm --stop /dev/md4? Yes, but the re-assemble would not even show indication of starting without trying to restart the array with a --scan switch to restart the array as shown in the following command results...


    :~# mdadm --assemble -v --scan --force --run --uuid=f8964aaf:801f5634:e358c097:1d146306


    ...key results...


    mdadm: /dev/sdav is identified as a member of /dev/md4, slot 2.
    mdadm: /dev/sdau is identified as a member of /dev/md4, slot 3.
    mdadm: /dev/sdat is identified as a member of /dev/md4, slot 4.
    mdadm: /dev/sdas is identified as a member of /dev/md4, slot 5.
    mdadm: /dev/sdar is identified as a member of /dev/md4, slot 6.
    mdadm: /dev/sdaq is identified as a member of /dev/md4, slot 7.
    mdadm: /dev/sdap is identified as a member of /dev/md4, slot 8.
    mdadm: /dev/sdao is identified as a member of /dev/md4, slot 9.
    mdadm: /dev/sdan is identified as a member of /dev/md4, slot 10.
    mdadm: /dev/sdam is identified as a member of /dev/md4, slot 11.
    mdadm: no uptodate device for slot 0 of /dev/md4
    mdadm: no uptodate device for slot 1 of /dev/md4
    mdadm: added /dev/sdau to /dev/md4 as 3
    mdadm: added /dev/sdat to /dev/md4 as 4
    mdadm: added /dev/sdas to /dev/md4 as 5
    mdadm: added /dev/sdar to /dev/md4 as 6
    mdadm: added /dev/sdaq to /dev/md4 as 7
    mdadm: added /dev/sdap to /dev/md4 as 8
    mdadm: added /dev/sdao to /dev/md4 as 9
    mdadm: added /dev/sdan to /dev/md4 as 10
    mdadm: added /dev/sdam to /dev/md4 as 11
    mdadm: added /dev/sdav to /dev/md4 as 2
    mdadm: failed to RUN_ARRAY /dev/md4: Input/output error
    mdadm: Not enough devices to start the array.


    2. Yes the drive show up under fdisk -l...


    :~# fdisk -l


    ...Key results...


    Disk /dev/sdaw: 4000.8 GB, 4000787030016 bytes
    255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0xb23c4c98


    Disk /dev/sdaw doesn't contain a valid partition table


    Disk /dev/sdax: 4000.8 GB, 4000787030016 bytes
    255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0xc99768eb


    Disk /dev/sdax doesn't contain a valid partition table


    ...But not under the blkid command.


    :~# blkid


    ....both drives before and after the "sdaw" & "sdax" assignment....


    /dev/sdav: UUID="f8964aaf-801f-5634-e358-c0971d146306" UUID_SUB="2566b614-2e8b-d9b7-325f-1ce452f7a7f0" LABEL="hydromediavault:vol5" TYPE="linux_raid_member"
    /dev/sday: UUID="7142528a-142b-2fcf-1864-bd51ab53d7c1" UUID_SUB="90a249cd-942b-fc3b-1cbc-b06e10977137" LABEL="hydromediavault:vol4" TYPE="linux_raid_member"


    3. As far as a backup solution? I get that question as I work a lot with EMC's Avamar product.
    This system however is not being backed up yet as this is a relatively new install, and the client has not purchased a solution for backups just yet. I.e., I have warned them this could happen, due to the storage server being a single point of failure, so this is damage control under a FUBAR situation.


    I do have one interesting command result that looks promising however. Once again, understand I am used to hardware RAID, so am relatively new to MDADM and Linux software RAID. That is the following..


    :~# mdadm --monitor /dev/md4
    mdadm: Warning: One autorebuild process already running.


    ...Could this mean the Array is rebuilding before it comes back online? 8)


    Please any insight would be enormously helpful. Thanks in advance for the assistance here. :thumbup:

  • I never use scan. Try:


    mdadm --stop /dev/md4
    mdadm --assemble /dev/md4 /dev/sda[mnopqrstuvwx] --verbose --force

    omv 5.5.11 usul | 64 bit | 5.4 proxmox kernel | omvextrasorg 5.3.6
    omv-extras.org plugins source code and issue tracker - github


    Please read this before posting a question.
    Please don't PM for support... Too many PMs!

  • Hello "Dropkick Murphy"


    The command you mentioned shows the following....


    mdadm -D /dev/md4
    /dev/md4:
    Version : 1.2
    Creation Time : Tue Mar 31 13:02:06 2015
    Raid Level : raid10
    Used Dev Size : -1
    Raid Devices : 12
    Total Devices : 10
    Persistence : Superblock is persistent


    Update Time : Thu Jun 4 18:24:15 2015
    State : active, FAILED, Not Started
    Active Devices : 10
    Working Devices : 10
    Failed Devices : 0
    Spare Devices : 0


    Layout : near=2
    Chunk Size : 512K


    Name : hydromediavault:vol5 (local to host hydromediavault)
    UUID : f8964aaf:801f5634:e358c097:1d146306
    Events : 239307


    Number Major Minor RaidDevice State
    0 0 0 0 removed
    1 0 0 1 removed
    2 66 240 2 active sync /dev/sdav
    3 66 224 3 active sync /dev/sdau
    4 66 208 4 active sync /dev/sdat
    5 66 192 5 active sync /dev/sdas
    6 66 176 6 active sync /dev/sdar
    7 66 160 7 active sync /dev/sdaq
    8 66 144 8 active sync /dev/sdap
    9 66 128 9 active sync /dev/sdao
    10 66 112 10 active sync /dev/sdan
    11 66 96 11 active sync /dev/sdam

  • Hello 'ryecoaaron',


    I've actually tried this very approach. I tried multiple approaches before I finally caved and submitted to this forum. It's like the system is not behaving as expected. I am logged in as 'root' user, but I even tried these commands w/ 'sudo' and still the same results.


    I can stop the array just fine with the '--stop' command, but when I try to start a re-assemble with the array stopped with the following command...


    :~# mdadm --assemble /dev/md4 /dev/sda[mnopqrstuvwx] --verbose --force


    ....It returns an error that the array is currently stopped.


    But when I added scan like in the following.....


    :~# mdadm --assemble /dev/md4 /dev/sda[mnopqrstuvwx] --verbose --force


    .....It will seem to start things up but again, returns the result that I showed in my original post...


    mdadm: failed to RUN_ARRAY /dev/md4: Input/output error
    mdadm: Not enough devices to start the array.


    However, I wonder if I'm being impatient. I see the following when I run the '--monitor' switch....

    :~# mdadm --monitor /dev/md4
    mdadm: Warning: One autorebuild process already running.


    ...Could this mean the Array is rebuilding before it comes back online?

  • Hello All,


    In the interest of following the instruction given in the "Degraded raid array questions" sticky note for the RAID forum, here are the results of the commands requested for down array issues...
    NOTE: Due to post character limitations, I cannot submit the entire results of some of these commands, so only the key results are given...

    1. First - cat /proc/mdstat



    2. Second - fdisk -l (Including only results for 2 affected disks.)


    3. Third - mdadm -D /dev/md4 (To demonstrate the issue at hand)


    4. Fourth - blkid (Showing results indicating these 2 drives do not show in the results of this command)



    5.) As noted in previous post to this thread, the 2 HDD's in question show as status OK by SMART.


    6.) For the past few days now MDADM monitor shows the following results...


    Code
    root@hostname:~# mdadm --monitor /dev/md4
    mdadm: Warning: One autorebuild process already running.


    Could the array be auto-rebuilding?

    I know it took 3 to 4 days for the array to initialize when I created it because it's 12x4TB=23Tb in size.


    Thanks again in advance for any help. (I am not clear on which code box I should be using. The one I chose gave line breaks, where the other did not.)

  • Thanks for the reply. I tried the mdadm -monitor against one of my good RAIDS and it showed the same result as show.... :(


    Code
    root@hydromediavault:~# mdadm --monitor /dev/md5
    mdadm: Warning: One autorebuild process already running.


    ....Any thoughts on what can I do to get the RAID array up in a crippled state?


    This is It's RAID 10 for an iSCSI array if that is helpful, but I find it hard to understand why 10 out of 12 drives would not be enough to bring the array at least back up.


    Help Please.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!