RAID10 Fails w/ “not enough operational mirrors.”
Issue Description:
Not sure why this happened, but we suspect a power outage, or brown-out. This caused one of our OMV RAID10 arrays to go down, and when we tried a reboot because there were no HDD failures the dmesg log showed one of my iSCSI arrays failing. Also attempts to initiate a re-assemble failed with errors as shown below...
Our Environment:
We have a Supermicro High-Density Storage Server w/ two 1Tb SATA SSD drive for OS, and 72 4Tb HDDs in 36 dual hot-swap bays. There are 24 drives per LSI Logic SAS SCSI controler, configured w/ six RAID10 Arrays.
We are running OpenMediaVault 1.19 (Kralizec).
The arrays are configured as follows....
The 2 SSD drives as stand-alone ext4 paritions.
/dev/sda1 - For the OS
/dev/sdb1 - For OS storage.
The RAID Arrays and their associated LUN and types as as follows...
/dev/md0 - LUN1 - ext4 - NFS Share
/dev/md1 - LUN2 - ext4 - NFS Share
/dev/md2 - LUN3 - ext4 - SMB Share
/dev/md3 - LUN4 - ext4 - SMB Share
/dev/md4 - LUN5 - ext4 - iSCSI Share (This is my problem Array)
/dev/md5 - LUN6 - ext4 - iSCSI Share
----------------
Logs and pertinant information.
1.) Boot-up Dmesg log:
[ 16.954740] md: md4 stopped.
....
[ 16.960695] md/raid10:md4: not enough operational mirrors.
[ 16.960775] md: pers->run() failed ...
2.) mdadm.conf
:~# cat /etc/mdadm/mdadm.conf
# mdadm.conf
.......
# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=hydromediavault:vol1 UUID=1cfbe551:59608320:d05a6c0b:36472514
ARRAY /dev/md1 metadata=1.2 name=hydromediavault:vol2 UUID=92786b08:2998971f:e43b629f:9fae9d5c
ARRAY /dev/md2 metadata=1.2 name=hydromediavault:vol3 UUID=2e027586:e0836061:8c51d19a:25e1de4e
ARRAY /dev/md3 metadata=1.2 name=hydromediavault:vol4 UUID=7142528a:142b2fcf:1864bd51:ab53d7c1
ARRAY /dev/md4 metadata=1.2 name=hydromediavault:vol5 UUID=f8964aaf:801f5634:e358c097:1d146306
ARRAY /dev/md5 metadata=1.2 name=hydromediavault:vol6 UUID=a42c055f:a4c7b1c4:ab83d732:80b2b780
3.) Stopped the array, then checked disk status...
:~# smartctl -d scsi -a /dev/sdaw | grep "Status"
SMART Health Status: OK
:~# smartctl -d scsi -a /dev/sdax | grep "Status"
SMART Health Status: OK
4.) Ran the following to start the array
:~# mdadm --assemble -v --scan --force --run --uuid=f8964aaf:801f5634:e358c097:1d146306
...key results...
mdadm: failed to RUN_ARRAY /dev/md4: Input/output error
mdadm: Not enough devices to start the array.
5.) Results of mdstat...
:~# cat /proc/mdstat for /dev/md4
Personalities : [raid10]
md4 : inactive sdav[2] sdam[11] sdan[10] sdao[9] sdap[8] sdaq[7] sdar[6] sdas[5] sdat[4] sdau[3]
39068875120 blocks super 1.2
6.) Results of mdadm examine showing problem with HD devices /dev/sdaw & sdax...
Drives /dev/sda[mnopqrstuv] - State OK.
mdadm: No md superblock detected on /dev/sdaw.
mdadm: No md superblock detected on /dev/sdax.
7.) Tried to fail both drives then remove then re-add with the following...
mdadm /dev/md4 --fail /dev/sdaw
mdadm: set device faulty failed for /dev/sdaw: No such device
mdadm /dev/md4 --remove /dev/sdaw
mdadm: hot remove failed for /dev/sdaw: No such device or address
mdadm /dev/md4 --fail /dev/sdax
mdadm: set device faulty failed for /dev/sdax: No such device
mdadm /dev/md4 --remove /dev/sdax
mdadm: hot remove failed for /dev/sdax: No such device or address
mdadm --add /dev/md4 /dev/sdaw
mdadm: /dev/md4 has failed so using --add cannot work and might destroy
mdadm: data on /dev/sdaw. You should stop the array and re-assemble it.
mdadm --add /dev/md4 /dev/sdax
mdadm: /dev/md4 has failed so using --add cannot work and might destroy
mdadm: data on /dev/sdax. You should stop the array and re-assemble it.
8.) Second assemble attempt....
mdadm --assemble --force /dev/md4 /dev/sda[utsrqponmvwx]
...
mdadm: no recogniseable superblock on /dev/sdaw
mdadm: /dev/sdaw has no superblock - assembly aborted
9.) Drives show as removed from the array....
mdadm -D /dev/md4
/dev/md4:
Version : 1.2
Creation Time : Tue Mar 31 13:02:06 2015
Raid Level : raid10
Used Dev Size : -1
Raid Devices : 12
Total Devices : 10
Persistence : Superblock is persistent
Update Time : Thu Jun 4 18:24:15 2015
State : active, FAILED, Not Started
Active Devices : 10
Working Devices : 10
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Name : hydromediavault:vol5 (local to host hydromediavault)
UUID : f8964aaf:801f5634:e358c097:1d146306
Events : 239307
Number Major Minor RaidDevice State
0 0 0 0 removed
1 0 0 1 removed
2 66 240 2 active sync /dev/sdav
3 66 224 3 active sync /dev/sdau
4 66 208 4 active sync /dev/sdat
5 66 192 5 active sync /dev/sdas
6 66 176 6 active sync /dev/sdar
7 66 160 7 active sync /dev/sdaq
8 66 144 8 active sync /dev/sdap
9 66 128 9 active sync /dev/sdao
10 66 112 10 active sync /dev/sdan
11 66 96 11 active sync /dev/sdam
THE QUESTIONS:
#1: If RAID10 arrays can lose up to 2 drives and still be operational, why will a 12 disk RAID10 array not start up w/ 10 good drives? - Is there a way to force this?
#2: With the 2 drives missing the superblock showing as OK and not failed at hardware level, why can I not fail and remove the drives to re-add them to the array for re-assembly?
Any insight would be very helpful. I have key data on this array that is otherwise unrecoverable. Please Help!