Raid1 Failure

ColdShoulderMedia · 3. September 2017

I've had a raid1 array failure, it looks like it happened overnight during a resync. Normally I would try to sort it out on my own, but with SMART being disabled on both drives for some reason, I'm less confident. Please let me know what info you need, or steps I should take to get my array functioning again.

As per ryecoaaron and Degraded or missing raid array questions Here is my info:

Code

root@RoachHotel:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdc[0] sdb[1]
      4883639360 blocks super 1.2 [2/2] [UU]
      [===================>.]  check = 96.7% (4723518144/4883639360) finish=25.8min speed=103240K/sec


md1 : active raid1 sdd[0] sde[1]
      1953383360 blocks super 1.2 [2/2] [UU]


md0 : active raid1 sdf[0] sdg[1](F)
      2930135360 blocks super 1.2 [2/1] [U_]


unused devices: <none>

Alles anzeigen

Code

root@RoachHotel:~# blkid
/dev/sda1: UUID="5a75f3af-251a-43c2-b1c6-d8bfb62b53ce" TYPE="ext4"
/dev/sde: UUID="e371445a-e72e-2298-0b4d-f961a7f3b799" UUID_SUB="0c838662-4d40-287a-b137-28a93821195e" LABEL="RoachHotel:Volume2" TYPE="linux_raid_member"
/dev/sdd: UUID="e371445a-e72e-2298-0b4d-f961a7f3b799" UUID_SUB="5b137bcd-d7fd-0a2f-1aab-3dd3c712d87b" LABEL="RoachHotel:Volume2" TYPE="linux_raid_member"
/dev/sdb: UUID="81aa4e20-7495-6f6b-39df-90c3236e8a53" UUID_SUB="e3fd4806-9ec2-f1cf-5c67-0539feec47f5" LABEL="RoachHotel:Volume3" TYPE="linux_raid_member"
/dev/md1: LABEL="Volume2" UUID="c806cd00-887e-4f26-a629-250a06182018" TYPE="ext4"
/dev/md2: LABEL="Volume3" UUID="cc3b5002-20d9-44bf-a583-e76fcbeb8bdd" TYPE="ext4"
/dev/sdc: UUID="81aa4e20-7495-6f6b-39df-90c3236e8a53" UUID_SUB="de60ab9e-0ce8-e732-9008-eada17f76806" LABEL="RoachHotel:Volume3" TYPE="linux_raid_member"

Code

root@RoachHotel:~# fdisk -l | grep "Disk "
Disk /dev/sdb doesn't contain a valid partition table
Disk /dev/sdc doesn't contain a valid partition table
Disk /dev/sdd doesn't contain a valid partition table
Disk /dev/sde doesn't contain a valid partition table
Disk /dev/md1 doesn't contain a valid partition table
Disk /dev/md2 doesn't contain a valid partition table
Disk /dev/sda: 32.0 GB, 32017047552 bytes
Disk identifier: 0x000753d0
Disk /dev/sdb: 5001.0 GB, 5000981078016 bytes
Disk identifier: 0x00000000
Disk /dev/sdc: 5001.0 GB, 5000981078016 bytes
Disk identifier: 0x00000000
Disk /dev/sdd: 2000.4 GB, 2000398934016 bytes
Disk identifier: 0x00000000
Disk /dev/sde: 2000.4 GB, 2000398934016 bytes
Disk identifier: 0x00000000
Disk /dev/md1: 2000.3 GB, 2000264560640 bytes
Disk identifier: 0x00000000
Disk /dev/md2: 5000.8 GB, 5000846704640 bytes
Disk identifier: 0x00000000

Alles anzeigen

Code

root@RoachHotel:~# cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#


# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
# Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
# To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
# used if no RAID devices are configured.
DEVICE partitions


# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes


# automatically tag new arrays as belonging to the local system
HOMEHOST <system>


# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=RoachHotel:Volume1 UUID=734cfdf1:3652f210:cf4a6e9b:f35a4fe0
ARRAY /dev/md1 metadata=1.2 name=RoachHotel:Volume2 UUID=e371445a:e72e2298:0b4df961:a7f3b799
ARRAY /dev/md2 metadata=1.2 name=RoachHotel:Volume3 UUID=81aa4e20:74956f6b:39df90c3:236e8a53


# instruct the monitoring daemon where to send mail alerts
MAILADDR coldshouldermedia@yahoo.com
MAILFROM root

Alles anzeigen

Code

root@RoachHotel:~# mdadm --detail --scan --verbose
ARRAY /dev/md0 level=raid1 num-devices=2 metadata=1.2
   devices=/dev/sdf,/dev/sdg
ARRAY /dev/md1 level=raid1 num-devices=2 metadata=1.2 name=RoachHotel:Volume2 UUID=e371445a:e72e2298:0b4df961:a7f3b799
   devices=/dev/sdd,/dev/sde
ARRAY /dev/md2 level=raid1 num-devices=2 metadata=1.2 name=RoachHotel:Volume3 UUID=81aa4e20:74956f6b:39df90c3:236e8a53
   devices=/dev/sdc,/dev/sdb

ColdShoulderMedia · 3. September 2017

It might be helpful to note that the 2 drives in question are SDF and SDG.

Also, the DegradedArray event I received says:
md0: active raid1 sdf[0] sdg[1](F)
2930135360 blocks super 1.2 [2/1] [U_]

ColdShoulderMedia · 6. September 2017

It seems like just a single disk failure, so I've ordered a new disk and will work on a rebuild. I'm sure this is obvious to most people, but I was just hoping for a confirmation from more knowledgeable folks here before I started spending money on the process.

ColdShoulderMedia · 9. September 2017

Okay, I added a disk back in and successfully rebuilt the array. Just making a note of this for anyone looking for the same info in the future.

So it is most important to note that OMV's default behavior seems to be removing both drives from a failed array, including SMART diagnostics.

Sc0rp · 13. September 2017

Re,

Zitat von ColdShoulderMedia

It seems like just a single disk failure,

Yeah, you're right:
md0 : active raid1 sdf[0] sdg[1](F) <- the (F) indicates a failed drive

Anyway, you can use SMART deamonized (periodically checking the drives) and "standalone" with
smartctl -a /dev/sd_

Sc0rp

ColdShoulderMedia · 15. September 2017

Thanks, Sc0rp!

I've been getting notifications for a missing spare since I did the rebuild, so I need to sort that out now.

tkaiser · 15. September 2017

Zitat von ColdShoulderMedia

It seems like just a single disk failure

Maybe, maybe not. You did not provide dmesg or /var/log/syslog output so at least I find it pretty hard to guess what might have happened? In fdisk output above /dev/sdg is totally missing, this can be the result of a completely dead drive or of $whatever. I've seen drives kicked out of arrays for various reasons already (cable/contact issues, underpowering, vibrations, sometimes even a disk problem or even a 'dead disk')

@ryecoaaron and others: Wouldn't it be a nice feature to allow OMV sending such 'debug logs' as Armbian already does (example). At least dmesg output, SMART health data (and attribute 199!) and HBA info should be collected and then submitted to an online pasteboard service as above.

ryecoaaron · 16. September 2017

Zitat von tkaiser

Wouldn't it be a nice feature to allow OMV sending such 'debug logs' as Armbian already does

omv-extras used to have a way to submit a report that was stored on the omv-extras server but not many people used it. People can also download a system information report from the Diagnostics -> System Information -> Report tab. But, I admit that armbianmonitor works pretty well. What would you suggest - fork armbianmonitor, just use it as is more systems, other ideas?

ColdShoulderMedia · 16. September 2017

tkaiser, thanks for the response.

I would be happy to give you more information if it might help me determine what has gone wrong. I posted all the info that ryecoaaron had in his sticky.

Just let me know what else you need, I'm not well versed in a debian environment.

tkaiser · 16. September 2017

Zitat von ryecoaaron

I admit that armbianmonitor works pretty well. What would you suggest - fork armbianmonitor, just use it as is more systems, other ideas?

Well, I think having a tool like omv-diag ready that can be used directly via SSH (and maybe later in some way through the web UI) to

upload a brief system overview (containing system overview and at least last 200 dmesg lines)
collect a more complete report somewhere to be manually pasted to an online pasteboard service

would already be nice to support such issues like this here. Since why does a disk disappear from the bus? If the reason is a faulty cable or contact issues then sure a replacement disk will help since it fixes the real issue ('connection loss') by accident too.

And armbianmonitor is not the right tool to fork (since in Armbian the main logging happens at every startup and goes to /var/log/armhwinfo.log where it will be collected later when armbianmonitor uploads stuff) so maybe starting with another fork that already took care of this? Eg. https://github.com/ayufan-rock…bin/rock64_diagnostics.sh

Zitat von ColdShoulderMedia

Just let me know what else you need

In case you haven't rebooted yet, the output from the following command would be useful:

Code

dmesg | curl -F 'sprunge=<-' http://sprunge.us

Jetzt mitmachen!