Raid 5 after extension: clean, degraded

Skullchuck · 27. Januar 2023

Hi all,

I'm on 6.0.28-3 (Shaitan). I extended my RAID-5 to a new 4 TB HDD, having 3x 4 TB already in the system. During extension, the system shut down (for some unknown reason, maybe overheating), but when I started it again, it continued with the extension. When it ended (seemingly successfully), I was in a hurry and just quickly extended the file system, which worked.

Now having a closer look at the RAID, it tells me it's in the state "clean, degraded":

Code

Version : 1.2 Creation Time : Sun Sep 2 04:05:15 2018 Raid Level : raid5 Array Size : 11720661504 (11177.69 GiB 12001.96 GB) Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB) Raid Devices : 4 Total Devices : 3 Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Jan 27 07:00:56 2023 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 512K

Consistency Policy : bitmap
Name : openmediavault:RAIDAR (local to host openmediavault) UUID : e98b7abd:4f328c81:40a102c3:1824afcf Events : 79896
Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 2 8 48 2 active sync /dev/sdd - 0 0 3 removed

Googling suggested to me the mdadm --add command to add the missing drive back to the array. However, I would have expected the "recover" option in the GUI to do the same, but I cannot select a device there:

Does anyone have experience with this? Can I safely execute the mdadm --add command or do I need to do something else?

Here some detailed information:

cat /proc/mdstat

Code

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md127 : active raid5 sdc[1] sdb[0] sdd[2]
11720661504 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
bitmap: 20/30 pages [80KB], 65536KB chunk

blkid

Code

/dev/sda1: UUID="64ae1488-3bd9-4236-8742-9ea44db6f56c" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="76aa5ac0-01"
/dev/sda5: UUID="c2b0cb47-aeec-4b5a-8285-857b1c56da54" TYPE="swap" PARTUUID="76aa5ac0-05"
/dev/sdb: UUID="e98b7abd-4f32-8c81-40a1-02c31824afcf" UUID_SUB="a36eadb0-2348-fb83-ec76-65c9fa5df48b" LABEL="openmediavault:RAIDAR" TYPE="linux_raid_member"
/dev/sdc: UUID="e98b7abd-4f32-8c81-40a1-02c31824afcf" UUID_SUB="94cf7512-43e5-3957-7060-0e6cc0cdd526" LABEL="openmediavault:RAIDAR" TYPE="linux_raid_member"
/dev/sdd: UUID="e98b7abd-4f32-8c81-40a1-02c31824afcf" UUID_SUB="f1ae8b96-55da-2541-bc00-7be870687109" LABEL="openmediavault:RAIDAR" TYPE="linux_raid_member"
/dev/md127: LABEL="Raidar" UUID="5d21dac9-d7ba-4831-9d29-e6d9d8de5b3b" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sde: UUID="e98b7abd-4f32-8c81-40a1-02c31824afcf" UUID_SUB="e80184c3-5dc3-17b4-1f73-a6f95f5fb718" LABEL="openmediavault:RAIDAR" TYPE="linux_raid_member"
/dev/sdf1: UUID="b533ba9f-52ff-9d49-8092-a954a53881e4" BLOCK_SIZE="4096" TYPE="ext4" PTUUID="d433308c" PTTYPE="dos" PARTUUID="d433308c-01"

fdisk -l | grep "Disk "

Code

Disk /dev/sda: 111,79 GiB, 120034123776 bytes, 234441648 sectors
Disk model: 2115
Disk identifier: 0x76aa5ac0
Disk /dev/sdb: 3,64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: WDC WD40EFRX-68N
Disk /dev/sdc: 3,64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: WDC WD40EFRX-68N
Disk /dev/sdd: 3,64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: WDC WD40EFRX-68N
Disk /dev/md127: 10,92 TiB, 12001957380096 bytes, 23441323008 sectors
Disk /dev/sde: 3,64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: WDC WD40EFRX-68W
Disk /dev/sdf: 114,61 GiB, 123060879360 bytes, 240353280 sectors
Disk model:  SanDisk 3.2Gen1
Disk identifier: 0xd433308c

Alles anzeigen

cat /etc/mdadm/mdadm.conf

Code

# This file is auto-generated by openmediavault (https://www.openmediavault.org)
# WARNING: Do not edit this file, your changes will get lost.


# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#


# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
# Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
# To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
# used if no RAID devices are configured.
DEVICE partitions


# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

[...]

# definitions of existing MD arrays
ARRAY /dev/md/openmediavault:RAIDAR metadata=1.2 name=openmediavault:RAIDAR UUID=e98b7abd:4f328c81:40a102c3:1824afcf

Alles anzeigen

mdadm --detail --scan --verbose

Code

ARRAY /dev/md/openmediavault:RAIDAR level=raid5 num-devices=4 metadata=1.2 name=openmediavault:RAIDAR UUID=e98b7abd:4f328c81:40a102c3:1824afcf
devices=/dev/sdb,/dev/sdc,/dev/sdd

Any help is appreciated, thank you.

geaves · 27. Januar 2023

Zitat von Skullchuck

However, I would have expected the "recover" option in the GUI to do the same, but I cannot select a device there:

That is because the drive /dev/sde has a raid signature on it according to blkid

I assume that /dev/sde is the drive you added to grow the array, if that's the case then mdadm --add /dev/md127 /dev/sde should add the drive back to the array

Skullchuck · 27. Januar 2023

Zitat von geaves

That is because the drive /dev/sde has a raid signature on it according to blkid

I assume that /dev/sde is the drive you added to grow the array, if that's the case then mdadm --add /dev/md127 /dev/sde should add the drive back to the array

Thank you for the quick response. OK, I did that. Unfortunately it's still "clean, degraded":

Code

          Version : 1.2
     Creation Time : Sun Sep  2 04:05:15 2018
        Raid Level : raid5
        Array Size : 11720661504 (11177.69 GiB 12001.96 GB)
     Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Jan 27 10:12:16 2023
             State : clean, degraded
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 1
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : openmediavault:RAIDAR  (local to host openmediavault)
              UUID : e98b7abd:4f328c81:40a102c3:1824afcf
            Events : 79932

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd
       -       0        0        3      removed

       3       8       64        -      faulty   /dev/sde

Alles anzeigen

cat /proc/mdstat

Code

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md127 : active raid5 sde[3](F) sdc[1] sdb[0] sdd[2]
11720661504 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
bitmap: 11/30 pages [44KB], 65536KB chunk

Looks like the drive is faulty

Do you know if this must be a hardware error or do I have any (software-wise) recovery options from here on?

geaves · 27. Januar 2023

Zitat von Skullchuck

Looks like the drive is faulty

Is the drive new or repurposed? Have you run a long SMART test on it? Might not be the drive could be, m'board connection, sata cable, intermittent power

Zitat von Skullchuck

Do you know if this must be a hardware error or do I have any (software-wise)

This is hardware related

Zitat von Skullchuck

recovery options from here on

A backup the current array is still accessible, but I always advise restricting access as much as possible

Skullchuck · 27. Januar 2023

The drive is brand-new. I can only image that the shutdown during extension broke something

I would like to run a long SMART test, but again no device is listed:

Am I doing something wrong?

EDIT: Solved it by

sudo smartctl -t long /dev/sde

It will be running for 10 hours...

geaves · 27. Januar 2023

Zitat von Skullchuck

The drive is brand-new

Then you would not expect mdadm to mark it as failed, but then again nothing is guaranteed. If the SMART output shows nothing relevant then I would suggest a wipe of the drive Storage -> Disks -> Wipe and run secure

If the drive fails again to be added to the array then you're looking at hardware, or if you're able to connect it to a Windows machine and run the manufacturers diagnostic tools

To get an RMA on the drive you need to ensure that it is the drive and not something else

Skullchuck · 27. Januar 2023

So the test is still running, but curiously I looked at sudo smartctl -a /dev/sde and found some errors in the log:

Code

Error 804 occurred at disk power-on lifetime: 61 hours (2 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.


After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 02 00 00 00 a0  Device Fault; Error: ABRT


Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
ef 10 02 00 00 00 a0 08      09:42:04.027  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.026  IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08      09:42:04.026  SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 a0 08      09:42:04.026  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.026  IDENTIFY DEVICE


Error 803 occurred at disk power-on lifetime: 61 hours (2 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.


After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 46 00 00 00 a0  Device Fault; Error: ABRT


Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
ef 03 46 00 00 00 a0 08      09:42:04.026  SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 a0 08      09:42:04.026  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.026  IDENTIFY DEVICE
ef 10 02 00 00 00 a0 08      09:42:04.011  SET FEATURES [Enable SATA feature]


Error 802 occurred at disk power-on lifetime: 61 hours (2 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.


After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 02 00 00 00 a0  Device Fault; Error: ABRT


Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
ef 10 02 00 00 00 a0 08      09:42:04.026  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.026  IDENTIFY DEVICE
ef 10 02 00 00 00 a0 08      09:42:04.011  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.010  IDENTIFY DEVICE


Error 801 occurred at disk power-on lifetime: 61 hours (2 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.


After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 02 00 00 00 a0  Device Fault; Error: ABRT


Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
ef 10 02 00 00 00 a0 08      09:42:04.011  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.010  IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08      09:42:04.010  SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 a0 08      09:42:04.010  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.010  IDENTIFY DEVICE


Error 800 occurred at disk power-on lifetime: 61 hours (2 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.


After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 61 46 00 00 00 a0  Device Fault; Error: ABRT


Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
ef 03 46 00 00 00 a0 08      09:42:04.010  SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 a0 08      09:42:04.010  SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 a0 08      09:42:04.010  IDENTIFY DEVICE
ef 10 02 00 00 00 a0 08      09:42:03.991  SET FEATURES [Enable SATA feature]

Alles anzeigen

It doesn't seem like these error codes (800-804) are the official ones from WD, but maybe someone knows how to interpret them...

Skullchuck · 11. Februar 2023

Indeed, it was the disk. I replaced it and now it works like a charm.

Raid 5 after extension: clean, degraded

Skullchuck 11. Februar 2023

Jetzt mitmachen!