RAID 10 clean, degraded

zerozenit · 20. Dezember 2021

Hi, I have an issue with my RAID10 in my OMV 4.1.36-1. I don't know what happened. I have noticed that only 3 out of 4 discs are working and in Raid Managment the status is active/degraded. I use 4 x 12TB WD Red disks.

Any help from you is appreciated.

Thank you very much.

Code

root@pandora:~# cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md0 : active raid10 sda[0] sde[2] sdb[1]
      23437508608 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]
      bitmap: 82/175 pages [328KB], 65536KB chunk

unused devices: <none>

Code

root@pandora:~# blkid
/dev/sda: UUID="8b767a7d-c52c-068d-c04f-1a3cfd8d4c5f" UUID_SUB="3904f2f1-fe1f-bde3-a965-d9dbe0074f66" LABEL="pandora:Raid4x12TBWdRed" TYPE="linux_raid_member"
/dev/md0: LABEL="REDRAID4X12" UUID="5fd65f52-b922-45e3-a940-eb7c75460446" TYPE="ext4"
/dev/sdb: UUID="8b767a7d-c52c-068d-c04f-1a3cfd8d4c5f" UUID_SUB="a6bb8aa8-4e9b-7f90-b105-45a9301acbce" LABEL="pandora:Raid4x12TBWdRed" TYPE="linux_raid_member"
/dev/sdc1: UUID="2218-DC43" TYPE="vfat" PARTUUID="09f69470-ba7b-4b6b-9456-c09f4c6ad2ee"
/dev/sdc2: UUID="87bfca96-9bee-4725-ae79-d8d7893d5a49" TYPE="ext4" PARTUUID="3c45a8f0-3106-4ba8-89bc-b15d22e81144"
/dev/sdc3: UUID="856b0ba6-a0a9-49f2-81ef-27e24004aa98" TYPE="swap" PARTUUID="fda4b444-cf82-4ae8-b916-01b8244acee3"
/dev/sde: UUID="8b767a7d-c52c-068d-c04f-1a3cfd8d4c5f" UUID_SUB="6c9c5433-6838-c39f-abfa-7807205a3238" LABEL="pandora:Raid4x12TBWdRed" TYPE="linux_raid_member"

Code

root@pandora:~# fdisk -l | grep "Disk "
Disk /dev/sda: 10,9 TiB, 12000138625024 bytes, 23437770752 sectors
Disk /dev/sdb: 10,9 TiB, 12000138625024 bytes, 23437770752 sectors
Disk /dev/md0: 21,8 TiB, 24000008814592 bytes, 46875017216 sectors
Disk /dev/sdc: 28,7 GiB, 30752636928 bytes, 60063744 sectors
Disk identifier: 51328880-3F36-4C4F-A18D-76E5CF56DD7D
Disk /dev/sdd: 100 MiB, 104857600 bytes, 204800 sectors
Disk /dev/sde: 10,9 TiB, 12000138625024 bytes, 23437770752 sectors

Code

root@pandora:~# cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
# Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
# To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
# used if no RAID devices are configured.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=pandora:Raid4x12TBWdRed UUID=8b767a7d:c52c068d:c04f1a3c:fd8d4c5f

Alles anzeigen

Code

root@pandora:~# mdadm --detail --scan --verbose
ARRAY /dev/md0 level=raid10 num-devices=4 metadata=1.2 name=pandora:Raid4x12TBWdRed UUID=8b767a7d:c52c068d:c04f1a3c:fd8d4c5f
   devices=/dev/sda,/dev/sdb,/dev/sde

chente · 20. Dezember 2021

Dr. geaves will be here shortly

Have you checked the SMART information for your missing disk?

geaves · 20. Dezember 2021

Zitat von zerozenit

Any help from you is appreciated

Output of mdadm --detail /dev/md0

zerozenit · 20. Dezember 2021

Zitat von chente

Dr. geaves will be here shortly
Have you checked the SMART information for your missing disk?

It is no longer in the device list. Among the Scheduled tests there is "Capacity: n/a".

zerozenit · 20. Dezember 2021

Zitat von geaves

Output of mdadm --detail /dev/md0

Code

root@pandora:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Thu Mar 26 17:58:31 2020
     Raid Level : raid10
     Array Size : 23437508608 (22351.75 GiB 24000.01 GB)
  Used Dev Size : 11718754304 (11175.88 GiB 12000.00 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Dec 20 09:41:39 2021
          State : clean, degraded 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : pandora:Raid4x12TBWdRed  (local to host pandora)
           UUID : 8b767a7d:c52c068d:c04f1a3c:fd8d4c5f
         Events : 8489976

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync set-A   /dev/sda
       1       8       16        1      active sync set-B   /dev/sdb
       2       8       64        2      active sync set-A   /dev/sde
       -       0        0        3      removed

Alles anzeigen

geaves · 20. Dezember 2021

I'm trying to work out which is the failed drive from the array, I'm guessing it's /dev/sdd as it's the only one listed in fdisk that is not listed in blkid, you're going to need to replace that drive

zerozenit · 20. Dezember 2021

Zitat von geaves

I'm trying to work out which is the failed drive from the array, I'm guessing it's /dev/sdd as it's the only one listed in fdisk that is not listed in blkid, you're going to need to replace that drive

The day after tomorrow I will have the disk available for replacement. At that point what procedure should I follow exactly? Thank you very much.

geaves · 20. Dezember 2021

Zitat von zerozenit

At that point what procedure should I follow exactly? Thank you very much

You need to identify the failed drive, to do this look in storage -> disks, take a screenshot for reference, it will list the drives with their reference in OMV i.e. /dev/sd? (? being a, b, c, etc) it will also show the make and serial number.

Shut down the server, locate the failed drive using the information from the screenshot; WARNING!!! before you remove the failed drive from the server check and double check it's the right one, remove the failed drive and install the new one, make a note of the serial number.

Start the server, storage -> disks select the new drive and click wipe on the menu, select short, click OK and wait until it has completed.

Raid Management, select the raid and click recover on the menu, a dialog will open showing the new drive, select it and click OK, the array should now rebuild. BTW due to the size of the drives this will take a llllooooonnnnngggg time

WARNING!! the last time I dealt with a Raid10 for a user, he lost all his data, how, why, I have no idea, a Raid10 is a striped raid using two Raid1 mirrors, I assume this is how OMV works, as I've never used Raid10.

Due to the size of your drives and therefore the size of the array I assume you don't have a backup just in case this goes wrong.

zerozenit · 20. Dezember 2021

Zitat von geaves

You need to identify the failed drive, to do this look in storage -> disks, take a screenshot for reference, it will list the drives with their reference in OMV i.e. /dev/sd? (? being a, b, c, etc) it will also show the make and serial number.

Schermata-2021-12-20-alle-12.04.03.png

Could there be a problem inside the case? A problem with the Sata cable? Thank you.

geaves · 20. Dezember 2021

Zitat von zerozenit

A problem with the Sata cable?

Possibly, but the other question is what's your hardware, searching for asmt109x- config suggests it's a usb bridge!!

zerozenit · 20. Dezember 2021

VERY IMPORTANT NEWS:

Zitat von geaves

You need to identify the failed drive, to do this look in storage -> disks, take a screenshot for reference, it will list the drives with their reference in OMV i.e. /dev/sd? (? being a, b, c, etc) it will also show the make and serial number.

In Storage -> Disks i clicked "Scan" and the disk appeared as /dev/sdf!

Schermata-2021-12-20-alle-12.29.23.png

I can see it also under S.M.A.R.T. -> Devices (and Schedule tests), also under File Systems, but not under RAID Management (I assume to stay there it should have been /dev/sdd... not /dev/sdf...).

What would be better to do now? Surely the system has worked for some time with just three drives.

Thank you.

zerozenit · 20. Dezember 2021

Zitat von geaves

Possibly, but the other question is what's your hardware, searching for asmt109x- config suggests it's a usb bridge!!

Maybe it's the USB connection to the UPS?

geaves · 20. Dezember 2021

Zitat von zerozenit

In Storage -> Disks i clicked "Scan" and the disk appeared as /dev/sdf

Then that might suggest there is a faulty sata cable

Zitat von zerozenit

but not under RAID Management (I assume to stay there it should have been /dev/sdd... not /dev/sdf...).

If you select the Raid under Raid Management does /dev/sdf show in the dialog

geaves · 20. Dezember 2021

Zitat von zerozenit

Maybe it's the USB connection to the UPS?

Don't know I just found some information doing a search

zerozenit · 20. Dezember 2021

Zitat von geaves

Then that might suggest there is a faulty sata cable
If you select the Raid under Raid Management does /dev/sdf show in the dialog

It is not shown. The fourth disk is always "removed".

geaves · 20. Dezember 2021

Zitat von zerozenit

It is not shown. The fourth disk is always "removed

have you tried recover on the menu after selecting the raid, if there's no drive available the dialog will be empty

zerozenit · 20. Dezember 2021

Zitat von geaves

have you tried recover on the menu after selecting the raid, if there's no drive available the dialog will be empty

There is only /dev/sdd... I don't know why but I have a feeling that a reboot might fix the situation. Or could it be very dangerous?

geaves · 20. Dezember 2021

Zitat von zerozenit

but I have a feeling that a reboot might fix the situation

are you a windows user

As /dev/sdf is under SMART run a test on it, you're looking at the output from 197 and 198 particularly their value

zerozenit · 20. Dezember 2021

Zitat von geaves

are you a windows user

Mac! But my firts server was a FreeBSD in 1995. Mac for everyday and Linux for servers, Win only in some VM only when it is not possible to do without it

Zitat von geaves

are you a windows user

As /dev/sdf is under SMART run a test on it, you're looking at the output from 197 and 198 particularly their value

Code

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD120EFAX-68UNTN0
Serial Number:    8CJW753E
LU WWN Device Id: 5 000cca 26fe88fec
Firmware Version: 81.00A81
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 20 13:15:03 2021 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     164 (intermediate level without standby)
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (   87) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (1259) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  --S---   127   127   054    -    112
  3 Spin_Up_Time            POS---   253   253   024    -    90 (Average 215)
  4 Start_Stop_Count        -O--C-   100   100   000    -    1340
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         -O-R--   100   100   067    -    0
  8 Seek_Time_Performance   --S---   140   140   020    -    15
  9 Power_On_Hours          -O--C-   099   099   000    -    11303
 10 Spin_Retry_Count        -O--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    1340
 22 Unknown_Attribute       PO---K   100   100   025    -    100
192 Power-Off_Retract_Count -O--CK   099   099   000    -    1763
193 Load_Cycle_Count        -O--C-   099   099   000    -    1763
194 Temperature_Celsius     -O----   162   162   000    -    40 (Min/Max 20/52)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

trunched by me

Alles anzeigen

Thank you

chente · 20. Dezember 2021

Dr. Geaves, if you'll allow me to interrupt ...

That disk seems to be fine (although a little hot, 40º). I would do the same with the others and compare the number of starts and / or the number of hours of operation. Assuming they were purchased at the same time, if this disk is significantly lower it could be the result of a hardware failure preventing this disk from booting.

RAID 10 clean, degraded

Jetzt mitmachen!

Ähnliche Themen

6x 3TB RAID6 - one disk dying, what to do?

Replacing disk in Raid1 [OMV6] - entry not visible

Mirror RAID clean, degraded

Mirror RAID clean, degraded

Tags