RAID10 Fails w/ “not enough operational mirrors.” w/ 10 of 12 available.

lrwaldon · 11. Juni 2015

RAID10 Fails w/ “not enough operational mirrors.”

Issue Description:
Not sure why this happened, but we suspect a power outage, or brown-out. This caused one of our OMV RAID10 arrays to go down, and when we tried a reboot because there were no HDD failures the dmesg log showed one of my iSCSI arrays failing. Also attempts to initiate a re-assemble failed with errors as shown below...

Our Environment:

We have a Supermicro High-Density Storage Server w/ two 1Tb SATA SSD drive for OS, and 72 4Tb HDDs in 36 dual hot-swap bays. There are 24 drives per LSI Logic SAS SCSI controler, configured w/ six RAID10 Arrays.

We are running OpenMediaVault 1.19 (Kralizec).

The arrays are configured as follows....

The 2 SSD drives as stand-alone ext4 paritions.

/dev/sda1 - For the OS
/dev/sdb1 - For OS storage.

The RAID Arrays and their associated LUN and types as as follows...

/dev/md0 - LUN1 - ext4 - NFS Share
/dev/md1 - LUN2 - ext4 - NFS Share
/dev/md2 - LUN3 - ext4 - SMB Share
/dev/md3 - LUN4 - ext4 - SMB Share
/dev/md4 - LUN5 - ext4 - iSCSI Share (This is my problem Array)
/dev/md5 - LUN6 - ext4 - iSCSI Share

----------------
Logs and pertinant information.

1.) Boot-up Dmesg log:

[ 16.954740] md: md4 stopped.
....
[ 16.960695] md/raid10:md4: not enough operational mirrors.
[ 16.960775] md: pers->run() failed ...

2.) mdadm.conf

:~# cat /etc/mdadm/mdadm.conf
# mdadm.conf
.......
# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=hydromediavault:vol1 UUID=1cfbe551:59608320:d05a6c0b:36472514
ARRAY /dev/md1 metadata=1.2 name=hydromediavault:vol2 UUID=92786b08:2998971f:e43b629f:9fae9d5c
ARRAY /dev/md2 metadata=1.2 name=hydromediavault:vol3 UUID=2e027586:e0836061:8c51d19a:25e1de4e
ARRAY /dev/md3 metadata=1.2 name=hydromediavault:vol4 UUID=7142528a:142b2fcf:1864bd51:ab53d7c1
ARRAY /dev/md4 metadata=1.2 name=hydromediavault:vol5 UUID=f8964aaf:801f5634:e358c097:1d146306
ARRAY /dev/md5 metadata=1.2 name=hydromediavault:vol6 UUID=a42c055f:a4c7b1c4:ab83d732:80b2b780

3.) Stopped the array, then checked disk status...

:~# smartctl -d scsi -a /dev/sdaw | grep "Status"
SMART Health Status: OK

:~# smartctl -d scsi -a /dev/sdax | grep "Status"
SMART Health Status: OK

4.) Ran the following to start the array

:~# mdadm --assemble -v --scan --force --run --uuid=f8964aaf:801f5634:e358c097:1d146306

...key results...

mdadm: failed to RUN_ARRAY /dev/md4: Input/output error
mdadm: Not enough devices to start the array.

5.) Results of mdstat...

:~# cat /proc/mdstat for /dev/md4

Personalities : [raid10]
md4 : inactive sdav[2] sdam[11] sdan[10] sdao[9] sdap[8] sdaq[7] sdar[6] sdas[5] sdat[4] sdau[3]
39068875120 blocks super 1.2

6.) Results of mdadm examine showing problem with HD devices /dev/sdaw & sdax...

Drives /dev/sda[mnopqrstuv] - State OK.

mdadm: No md superblock detected on /dev/sdaw.

mdadm: No md superblock detected on /dev/sdax.

7.) Tried to fail both drives then remove then re-add with the following...

mdadm /dev/md4 --fail /dev/sdaw
mdadm: set device faulty failed for /dev/sdaw: No such device

mdadm /dev/md4 --remove /dev/sdaw
mdadm: hot remove failed for /dev/sdaw: No such device or address

mdadm /dev/md4 --fail /dev/sdax
mdadm: set device faulty failed for /dev/sdax: No such device

mdadm /dev/md4 --remove /dev/sdax
mdadm: hot remove failed for /dev/sdax: No such device or address

mdadm --add /dev/md4 /dev/sdaw
mdadm: /dev/md4 has failed so using --add cannot work and might destroy
mdadm: data on /dev/sdaw. You should stop the array and re-assemble it.

mdadm --add /dev/md4 /dev/sdax
mdadm: /dev/md4 has failed so using --add cannot work and might destroy
mdadm: data on /dev/sdax. You should stop the array and re-assemble it.

8.) Second assemble attempt....

mdadm --assemble --force /dev/md4 /dev/sda[utsrqponmvwx]
...
mdadm: no recogniseable superblock on /dev/sdaw
mdadm: /dev/sdaw has no superblock - assembly aborted

9.) Drives show as removed from the array....

mdadm -D /dev/md4
/dev/md4:
Version : 1.2
Creation Time : Tue Mar 31 13:02:06 2015
Raid Level : raid10
Used Dev Size : -1
Raid Devices : 12
Total Devices : 10
Persistence : Superblock is persistent

Update Time : Thu Jun 4 18:24:15 2015
State : active, FAILED, Not Started
Active Devices : 10
Working Devices : 10
Failed Devices : 0
Spare Devices : 0

Layout : near=2
Chunk Size : 512K

Name : hydromediavault:vol5 (local to host hydromediavault)
UUID : f8964aaf:801f5634:e358c097:1d146306
Events : 239307

Number Major Minor RaidDevice State
0 0 0 0 removed
1 0 0 1 removed
2 66 240 2 active sync /dev/sdav
3 66 224 3 active sync /dev/sdau
4 66 208 4 active sync /dev/sdat
5 66 192 5 active sync /dev/sdas
6 66 176 6 active sync /dev/sdar
7 66 160 7 active sync /dev/sdaq
8 66 144 8 active sync /dev/sdap
9 66 128 9 active sync /dev/sdao
10 66 112 10 active sync /dev/sdan
11 66 96 11 active sync /dev/sdam

THE QUESTIONS:

#1: If RAID10 arrays can lose up to 2 drives and still be operational, why will a 12 disk RAID10 array not start up w/ 10 good drives? - Is there a way to force this?

#2: With the 2 drives missing the superblock showing as OK and not failed at hardware level, why can I not fail and remove the drives to re-add them to the array for re-assembly?

Any insight would be very helpful. I have key data on this array that is otherwise unrecoverable. Please Help!

ryecoaaron · 11. Juni 2015

Did you ever stop md4 before assembling: mdadm --stop /dev/md4?
Do sdaw and sdax show up in fdisk -l and blkid

You have a system this large with this many drives and no backup??

lrwaldon · 12. Juni 2015

Hello 'ryecoaaron'

Thank you for your reply. To answer your questions....

1. Did I run mdadm --stop /dev/md4? Yes, but the re-assemble would not even show indication of starting without trying to restart the array with a --scan switch to restart the array as shown in the following command results...

:~# mdadm --assemble -v --scan --force --run --uuid=f8964aaf:801f5634:e358c097:1d146306

...key results...

mdadm: /dev/sdav is identified as a member of /dev/md4, slot 2.
mdadm: /dev/sdau is identified as a member of /dev/md4, slot 3.
mdadm: /dev/sdat is identified as a member of /dev/md4, slot 4.
mdadm: /dev/sdas is identified as a member of /dev/md4, slot 5.
mdadm: /dev/sdar is identified as a member of /dev/md4, slot 6.
mdadm: /dev/sdaq is identified as a member of /dev/md4, slot 7.
mdadm: /dev/sdap is identified as a member of /dev/md4, slot 8.
mdadm: /dev/sdao is identified as a member of /dev/md4, slot 9.
mdadm: /dev/sdan is identified as a member of /dev/md4, slot 10.
mdadm: /dev/sdam is identified as a member of /dev/md4, slot 11.
mdadm: no uptodate device for slot 0 of /dev/md4
mdadm: no uptodate device for slot 1 of /dev/md4
mdadm: added /dev/sdau to /dev/md4 as 3
mdadm: added /dev/sdat to /dev/md4 as 4
mdadm: added /dev/sdas to /dev/md4 as 5
mdadm: added /dev/sdar to /dev/md4 as 6
mdadm: added /dev/sdaq to /dev/md4 as 7
mdadm: added /dev/sdap to /dev/md4 as 8
mdadm: added /dev/sdao to /dev/md4 as 9
mdadm: added /dev/sdan to /dev/md4 as 10
mdadm: added /dev/sdam to /dev/md4 as 11
mdadm: added /dev/sdav to /dev/md4 as 2
mdadm: failed to RUN_ARRAY /dev/md4: Input/output error
mdadm: Not enough devices to start the array.

2. Yes the drive show up under fdisk -l...

:~# fdisk -l

...Key results...

Disk /dev/sdaw: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xb23c4c98

Disk /dev/sdaw doesn't contain a valid partition table

Disk /dev/sdax: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xc99768eb

Disk /dev/sdax doesn't contain a valid partition table

...But not under the blkid command.

:~# blkid

....both drives before and after the "sdaw" & "sdax" assignment....

/dev/sdav: UUID="f8964aaf-801f-5634-e358-c0971d146306" UUID_SUB="2566b614-2e8b-d9b7-325f-1ce452f7a7f0" LABEL="hydromediavault:vol5" TYPE="linux_raid_member"
/dev/sday: UUID="7142528a-142b-2fcf-1864-bd51ab53d7c1" UUID_SUB="90a249cd-942b-fc3b-1cbc-b06e10977137" LABEL="hydromediavault:vol4" TYPE="linux_raid_member"

3. As far as a backup solution? I get that question as I work a lot with EMC's Avamar product.
This system however is not being backed up yet as this is a relatively new install, and the client has not purchased a solution for backups just yet. I.e., I have warned them this could happen, due to the storage server being a single point of failure, so this is damage control under a FUBAR situation.

I do have one interesting command result that looks promising however. Once again, understand I am used to hardware RAID, so am relatively new to MDADM and Linux software RAID. That is the following..

:~# mdadm --monitor /dev/md4
mdadm: Warning: One autorebuild process already running.

...Could this mean the Array is rebuilding before it comes back online?

Please any insight would be enormously helpful. Thanks in advance for the assistance here.

Dropkick Murphy · 12. Juni 2015

What does

Code

mdadm -D /dev/md4

show you?

ryecoaaron · 12. Juni 2015

I never use scan. Try:

mdadm --stop /dev/md4
mdadm --assemble /dev/md4 /dev/sda[mnopqrstuvwx] --verbose --force

lrwaldon · 12. Juni 2015

Hello "Dropkick Murphy"

The command you mentioned shows the following....

mdadm -D /dev/md4
/dev/md4:
Version : 1.2
Creation Time : Tue Mar 31 13:02:06 2015
Raid Level : raid10
Used Dev Size : -1
Raid Devices : 12
Total Devices : 10
Persistence : Superblock is persistent

Update Time : Thu Jun 4 18:24:15 2015
State : active, FAILED, Not Started
Active Devices : 10
Working Devices : 10
Failed Devices : 0
Spare Devices : 0

Layout : near=2
Chunk Size : 512K

Name : hydromediavault:vol5 (local to host hydromediavault)
UUID : f8964aaf:801f5634:e358c097:1d146306
Events : 239307

Number Major Minor RaidDevice State
0 0 0 0 removed
1 0 0 1 removed
2 66 240 2 active sync /dev/sdav
3 66 224 3 active sync /dev/sdau
4 66 208 4 active sync /dev/sdat
5 66 192 5 active sync /dev/sdas
6 66 176 6 active sync /dev/sdar
7 66 160 7 active sync /dev/sdaq
8 66 144 8 active sync /dev/sdap
9 66 128 9 active sync /dev/sdao
10 66 112 10 active sync /dev/sdan
11 66 96 11 active sync /dev/sdam

lrwaldon · 12. Juni 2015

Hello 'ryecoaaron',

I've actually tried this very approach. I tried multiple approaches before I finally caved and submitted to this forum. It's like the system is not behaving as expected. I am logged in as 'root' user, but I even tried these commands w/ 'sudo' and still the same results.

I can stop the array just fine with the '--stop' command, but when I try to start a re-assemble with the array stopped with the following command...

:~# mdadm --assemble /dev/md4 /dev/sda[mnopqrstuvwx] --verbose --force

....It returns an error that the array is currently stopped.

But when I added scan like in the following.....

:~# mdadm --assemble /dev/md4 /dev/sda[mnopqrstuvwx] --verbose --force

.....It will seem to start things up but again, returns the result that I showed in my original post...

mdadm: failed to RUN_ARRAY /dev/md4: Input/output error
mdadm: Not enough devices to start the array.

However, I wonder if I'm being impatient. I see the following when I run the '--monitor' switch....

:~# mdadm --monitor /dev/md4
mdadm: Warning: One autorebuild process already running.

...Could this mean the Array is rebuilding before it comes back online?

lrwaldon · 15. Juni 2015

Hello All,

In the interest of following the instruction given in the "Degraded raid array questions" sticky note for the RAID forum, here are the results of the commands requested for down array issues...
NOTE: Due to post character limitations, I cannot submit the entire results of some of these commands, so only the key results are given...

1. First - cat /proc/mdstat

Code

root@hostname:~# cat /proc/mdstat
Personalities : [raid10]
md4 : inactive sdav[2] sdam[11] sdan[10] sdao[9] sdap[8] sdaq[7] sdar[6] sdas[5] sdat[4] sdau[3]
      39068875120 blocks super 1.2


md5 : active (auto-read-only) raid10 sdal[0] sdaa[11] sdab[10] sdac[9] sdad[8] sdae[7] sdaf[6] sdag[5] sdah[4] sdai[3] sdaj[2] sdak[1]
      23441323008 blocks super 1.2 512K chunks 2 near-copies [12/12] [UUUUUUUUUUUU]


md3 : active raid10 sdbj[0] sday[11] sdaz[10] sdba[9] sdbb[8] sdbc[7] sdbd[6] sdbe[5] sdbf[4] sdbg[3] sdbh[2] sdbi[1]
      23441323008 blocks super 1.2 512K chunks 2 near-copies [12/12] [UUUUUUUUUUUU]


md2 : active raid10 sdbq[0] sdbk[11] sdbl[10] sdbm[9] sdbn[8] sdbo[7] sdbp[6] sdbr[5] sdbs[4] sdbt[3] sdbu[2] sdbv[1]
      23441323008 blocks super 1.2 512K chunks 2 near-copies [12/12] [UUUUUUUUUUUU]


md1 : active raid10 sdn[0] sdc[11] sdd[10] sde[9] sdf[8] sdg[7] sdh[6] sdi[5] sdj[4] sdk[3] sdl[2] sdm[1]
      23441323008 blocks super 1.2 512K chunks 2 near-copies [12/12] [UUUUUUUUUUUU]


md0 : active raid10 sdz[0] sdo[11] sdp[10] sdq[9] sdr[8] sds[7] sdt[6] sdu[5] sdv[4] sdw[3] sdx[2] sdy[1]
      23441323008 blocks super 1.2 512K chunks 2 near-copies [12/12] [UUUUUUUUUUUU]

Alles anzeigen

2. Second - fdisk -l (Including only results for 2 affected disks.)

Code

root@hostname:~# fdisk -l


Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0003ceba


   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048  1874483199   937240576   83  Linux
/dev/sda2      1874485246  1953523711    39519233    5  Extended
/dev/sda5      1874485248  1953523711    39519232   82  Linux swap / Solaris


WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdaw: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xb23c4c98


Disk /dev/sdaw doesn't contain a valid partition table


Disk /dev/sdax: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xc99768eb


Disk /dev/sdax doesn't contain a valid partition table

Alles anzeigen

3. Third - mdadm -D /dev/md4 (To demonstrate the issue at hand)

Code

root@hostname:~# mdadm -D /dev/md4
/dev/md4:
        Version : 1.2
  Creation Time : Tue Mar 31 13:02:06 2015
     Raid Level : raid10
  Used Dev Size : -1
   Raid Devices : 12
  Total Devices : 10
    Persistence : Superblock is persistent


    Update Time : Thu Jun  4 18:24:15 2015
          State : active, FAILED, Not Started
 Active Devices : 10
Working Devices : 10
 Failed Devices : 0
  Spare Devices : 0


         Layout : near=2
     Chunk Size : 512K


           Name : hydromediavault:vol5  (local to host hydromediavault)
           UUID : f8964aaf:801f5634:e358c097:1d146306
         Events : 239307


    Number   Major   Minor   RaidDevice State
       [color=#B22222]0       0        0        0      removed
       1       0        0        1      removed[/color]
       2      66      240        2      active sync   /dev/sdav
       3      66      224        3      active sync   /dev/sdau
       4      66      208        4      active sync   /dev/sdat
       5      66      192        5      active sync   /dev/sdas
       6      66      176        6      active sync   /dev/sdar
       7      66      160        7      active sync   /dev/sdaq
       8      66      144        8      active sync   /dev/sdap
       9      66      128        9      active sync   /dev/sdao
      10      66      112       10      active sync   /dev/sdan
      11      66       96       11      active sync   /dev/sdam

Alles anzeigen

4. Fourth - blkid (Showing results indicating these 2 drives do not show in the results of this command)

Code

root@hydromediavault:~# blkid
/dev/sdb1: UUID="0cf5d1c8-aca3-4dfd-8f20-da117f20ad26" TYPE="ext4"
/dev/sda1: UUID="27c09dc8-81de-44eb-ae06-da3cd733a157" TYPE="ext4"
/dev/sda5: UUID="329a497a-0808-4837-8552-2c9dc083b85d" TYPE="swap"


/dev/sdav: UUID="f8964aaf-801f-5634-e358-c0971d146306" UUID_SUB="2566b614-2e8b-d9b7-325f-1ce452f7a7f0" LABEL="hydromediavault:vol5" TYPE="linux_raid_member"
[color=#800000][i](Both ./sdaw & ./sdax Drives should be listing here but do not)[/i][/color]
/dev/sday: UUID="7142528a-142b-2fcf-1864-bd51ab53d7c1" UUID_SUB="90a249cd-942b-fc3b-1cbc-b06e10977137" LABEL="hydromediavault:vol4" TYPE="linux_raid_member"


/dev/md0: UUID="RvyJVq-yO2n-U1c7-LoJe-xibi-m3sc-WJlRgi" TYPE="LVM2_member"
/dev/md1: UUID="MQMkME-6gER-2k6y-HSev-NRaN-2Xwn-F4aJ0E" TYPE="LVM2_member"
/dev/md2: UUID="9guLDw-evIw-45XV-omZv-P3K1-o6bd-eQ2CWP" TYPE="LVM2_member"
/dev/md3: UUID="xrE9yG-FpWn-ONY8-cQii-UFK2-pYRQ-Mv1Ttd" TYPE="LVM2_member"
/dev/md5: UUID="QCneDY-SvD0-twcj-RwQa-V6Tu-Ra59-keMnk9" TYPE="LVM2_member"
/dev/mapper/Lun6-Lun6: UUID="9rSdgi-RCze-dS9f-6CTi-hsYO-lz7X-Vb3inU" TYPE="LVM2_member"
/dev/mapper/Lun4-Lun4: LABEL="LUN4" UUID="b5ffa606-a54f-46f3-bfe2-1127406e3c7e" TYPE="ext4"
/dev/mapper/Lun3-Lun3: LABEL="LUN3" UUID="1af8667f-6ac9-41b6-8bdb-5ec97ba897a3" TYPE="ext4"
/dev/mapper/Lun2-Lun2: LABEL="LUN2" UUID="b2d7e2d7-8052-4acb-a2af-e94c51ffcff0" TYPE="ext4"
/dev/mapper/Lun1-Lun1: LABEL="LUN1" UUID="19fd5749-23b3-4b1f-bd33-1f933282bbbc" TYPE="ext4"

Alles anzeigen

5.) As noted in previous post to this thread, the 2 HDD's in question show as status OK by SMART.

6.) For the past few days now MDADM monitor shows the following results...

Code

root@hostname:~# mdadm --monitor /dev/md4
mdadm: Warning: One autorebuild process already running.

Could the array be auto-rebuilding?

I know it took 3 to 4 days for the array to initialize when I created it because it's 12x4TB=23Tb in size.

Thanks again in advance for any help. (I am not clear on which code box I should be using. The one I chose gave line breaks, where the other did not.)

Dropkick Murphy · 15. Juni 2015

Pls try

Code

mdadm --monitor

on one of your healthy raids.
Why? See here: http://www.webhostingtalk.com/showthread.php?t=1170493
We can so figure out, if your md4 is rebuilding...

HTH

lrwaldon · 17. Juni 2015

Thanks for the reply. I tried the mdadm -monitor against one of my good RAIDS and it showed the same result as show....

Code

root@hydromediavault:~# mdadm --monitor /dev/md5
mdadm: Warning: One autorebuild process already running.

....Any thoughts on what can I do to get the RAID array up in a crippled state?

This is It's RAID 10 for an iSCSI array if that is helpful, but I find it hard to understand why 10 out of 12 drives would not be enough to bring the array at least back up.

Help Please.

lrwaldon · 17. Juni 2015

Zitat

Is it possible that this could be unique to this being a RAID 10 array created for iSCS?

Does this change how the array management is handled compared to NFS or SMB shares?

RAID10 Fails w/ “not enough operational mirrors.” w/ 10 of 12 available.

Jetzt mitmachen!

Tags