RAID5 - OS crash & HDD replacement

nomax · 24. August 2020

Hi guys, problem is the following - my NAS is a Dell T20 server running currently 5x 2TBs disks in RAID5. This has been set up a while back with OMV4, running on an USB stick. In the last week I noticed a couple of issues (mainly I couldn't log into the webinterface) so I just restarted the server. Noting that there seemed to be some consistency issues on the filesystem (it was booting default to initramfs which I was able to fix with fsck). Indepedently, I had a look into my webinterface once it was back up online and could see that there were some issues with one of the HDDs (red alarm signal in SMART - having some bad sectors) and the RAID was graded as 'clean, degraded' - files were still accessible.

So I did get a new HDD and wanted to replace it over the weekend - unfortunately there hit a power outage on Thursday/Friday and the system went off. Rebooting the machine the USB thumb drive seemed unreadable so I did set up a new OMV installation (v5) on a different thumb drive. After successfully setting it up I replaced the drive that was claimed to have bad sectors with a new one - unfortunately I do not see the RAID now in the webinterface. I thought software RAID would be hot swappable, especially because 4 drives are still operating fine, looks like that is now a bit of an issue...I have now put the old (bad sectors) HDD back in and seem to have issues to get the RAID back into active mode:

Code

root@openmediavault:~# cat /proc/mdstat                                         Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : inactive sdd[3] sda[0] sde[5] sdc[4]
      7813542240 blocks super 1.2

unused devices: <none>

This is the defective drive:

Code

root@openmediavault:~# mdadm -E /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 7b1be9be:d75af1ec:637002de:39001c58
           Name : Nas:RAID5
  Creation Time : Sat Dec 20 15:15:32 2014
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 3906771120 (1862.89 GiB 2000.27 GB)
     Array Size : 7813533696 (7451.57 GiB 8001.06 GB)
  Used Dev Size : 3906766848 (1862.89 GiB 2000.26 GB)
    Data Offset : 258048 sectors
   Super Offset : 8 sectors
   Unused Space : before=257968 sectors, after=4272 sectors
          State : active
    Device UUID : 82f55d04:1e521e01:6af8d94b:6c2aa6cf

    Update Time : Wed May 27 10:34:45 2020
       Checksum : 534bf9e5 - correct
         Events : 12972

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)
root@openmediavault:~#

Alles anzeigen

And these are the ones that operate correctly:

Code

root@openmediavault:~# mdadm -E /dev/sda /dev/sdc /dev/sdd /dev/sde
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 7b1be9be:d75af1ec:637002de:39001c58
           Name : Nas:RAID5
  Creation Time : Sat Dec 20 15:15:32 2014
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 3906771120 (1862.89 GiB 2000.27 GB)
     Array Size : 7813533696 (7451.57 GiB 8001.06 GB)
  Used Dev Size : 3906766848 (1862.89 GiB 2000.26 GB)
    Data Offset : 258048 sectors
   Super Offset : 8 sectors
   Unused Space : before=257968 sectors, after=4272 sectors
          State : active
    Device UUID : c762fde1:9e8f0b91:cca4ab0c:6ae35729

    Update Time : Fri Aug 21 17:59:45 2020
       Checksum : dbcc9920 - correct
         Events : 63606

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A.AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 7b1be9be:d75af1ec:637002de:39001c58
           Name : Nas:RAID5
  Creation Time : Sat Dec 20 15:15:32 2014
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 3906771120 (1862.89 GiB 2000.27 GB)
     Array Size : 7813533696 (7451.57 GiB 8001.06 GB)
  Used Dev Size : 3906766848 (1862.89 GiB 2000.26 GB)
    Data Offset : 258048 sectors
   Super Offset : 8 sectors
   Unused Space : before=257960 sectors, after=4272 sectors
          State : active
    Device UUID : e392ac97:159df386:b562983e:0ed8929a

    Update Time : Fri Aug 21 17:59:45 2020
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 2a9394ed - correct
         Events : 63606

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : A.AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 7b1be9be:d75af1ec:637002de:39001c58
           Name : Nas:RAID5
  Creation Time : Sat Dec 20 15:15:32 2014
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 3906771120 (1862.89 GiB 2000.27 GB)
     Array Size : 7813533696 (7451.57 GiB 8001.06 GB)
  Used Dev Size : 3906766848 (1862.89 GiB 2000.26 GB)
    Data Offset : 258048 sectors
   Super Offset : 8 sectors
   Unused Space : before=257968 sectors, after=4272 sectors
          State : active
    Device UUID : 9590adb7:f274c2eb:b2f9b504:1dae9a1a

    Update Time : Fri Aug 21 17:59:45 2020
       Checksum : f580cbde - correct
         Events : 63606

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : A.AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 7b1be9be:d75af1ec:637002de:39001c58
           Name : Nas:RAID5
  Creation Time : Sat Dec 20 15:15:32 2014
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 3906771120 (1862.89 GiB 2000.27 GB)
     Array Size : 7813533696 (7451.57 GiB 8001.06 GB)
  Used Dev Size : 3906766848 (1862.89 GiB 2000.26 GB)
    Data Offset : 258048 sectors
   Super Offset : 8 sectors
   Unused Space : before=257960 sectors, after=4272 sectors
          State : active
    Device UUID : 70f0181e:be31b75e:e78f498a:7379f630

    Update Time : Fri Aug 21 17:59:45 2020
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 6ad84a5a - correct
         Events : 63606

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : A.AAA ('A' == active, '.' == missing, 'R' == replacing)

Alles anzeigen

Now I am feeling a bit hopeless - I am not sure how to interpret this?

Code

root@openmediavault:~# mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Sat Dec 20 15:15:32 2014
        Raid Level : raid5
     Used Dev Size : 1953383424 (1862.89 GiB 2000.26 GB)
      Raid Devices : 5
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Aug 21 17:59:45 2020
             State : active, FAILED, Not Started
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : unknown

              Name : Nas:RAID5
              UUID : 7b1be9be:d75af1ec:637002de:39001c58
            Events : 63606

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       -       0        0        1      removed
       -       0        0        2      removed
       -       0        0        3      removed
       -       0        0        4      removed

       -       8       64        4      sync   /dev/sde
       -       8       32        2      sync   /dev/sdc
       -       8        0        0      sync   /dev/sda
       -       8       48        3      sync   /dev/sdd

Alles anzeigen

Though it looks like OMV can still see all HDDs and correctly considers them being in a RAID:

Code

root@openmediavault:~# blkid
/dev/sdf1: UUID="fe602a13-21c8-49b7-a9b0-de6c397bd6a9" TYPE="ext4" PARTUUID="ad5bf966-01"
/dev/sdf5: UUID="83f6ba28-0283-47b2-98a4-f832ad8486c1" TYPE="swap" PARTUUID="ad5bf966-05"
/dev/sde: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="70f0181e-be31-b75e-e78f-498a7379f630" LABEL="Nas:RAID5" TYPE="linux_raid_member"
/dev/sda: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="c762fde1-9e8f-0b91-cca4-ab0c6ae35729" LABEL="Nas:RAID5" TYPE="linux_raid_member"
/dev/sdc: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="e392ac97-159d-f386-b562-983e0ed8929a" LABEL="Nas:RAID5" TYPE="linux_raid_member"
/dev/sdd: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="9590adb7-f274-c2eb-b2f9-b5041dae9a1a" LABEL="Nas:RAID5" TYPE="linux_raid_member"
/dev/sdb: UUID="7b1be9be-d75a-f1ec-6370-02de39001c58" UUID_SUB="82f55d04-1e52-1e01-6af8-d94b6c2aa6cf" LABEL="Nas:RAID5" TYPE="linux_raid_member"

It says still active, but FAILED/not started - does anyone have an idea here how to circumvent this maybe?

Code

root@openmediavault:~# cat /sys/block/md127/md/array_state
inactive

Thanks in advance!!

geaves · 24. August 2020

Zitat von nomax

I thought software RAID would be hot swappable

I wished it was

The post is clear but also confusing, I'm guessing that the cat /proc/mdstat was when you added the new drive and the output from blkid is from when you put the failing drive back in.

To get the array active it needs to be stopped then reassembled you may also have to create an mdadm conf, but first things first;

mdadm --stop /dev/md127

mdadm --assemble --force --verbose /dev/md127 /dev/sd[abcde]

this usually works, but no guarantee

nomax · 24. August 2020

doesn't look like this did the trick - it considers them as 'busy'?

Code

root@openmediavault:~# mdadm --assemble --force --verbose /dev/md127 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
mdadm: looking for devices for /dev/md127
mdadm: /dev/sda is busy - skipping
mdadm: /dev/sdb is busy - skipping
mdadm: /dev/sdc is busy - skipping
mdadm: /dev/sdd is busy - skipping
mdadm: /dev/sde is busy - skipping
root@openmediavault:~#

geaves · 24. August 2020

Zitat von nomax

doesn't look like this did the trick - it considers them as 'busy

did you stop it first as per my post

nomax · 24. August 2020

Zitat von geaves

did you stop it first as per my post

Weird, I just tried it again, same sequence! Now it shows up in OMV - still as clean, degraded. I suspect I need to decouple the defective HDD now somehow?

Code

root@openmediavault:~# mdadm --stop /dev/md127
mdadm: stopped /dev/md127
root@openmediavault:~# mdadm --assemble --force --verbose /dev/md127 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
mdadm: looking for devices for /dev/md127
mdadm: /dev/sda is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdb is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sdd is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sde is identified as a member of /dev/md127, slot 4.
mdadm: Marking array /dev/md127 as 'clean'
mdadm: added /dev/sdb to /dev/md127 as 1 (possibly out of date)
mdadm: added /dev/sdc to /dev/md127 as 2
mdadm: added /dev/sdd to /dev/md127 as 3
mdadm: added /dev/sde to /dev/md127 as 4
mdadm: added /dev/sda to /dev/md127 as 0
mdadm: /dev/md127 has been started with 4 drives (out of 5).

Alles anzeigen

geaves · 24. August 2020

If mdadm doesn't report the array as stopped the second command will report the array as busy

/dev/sdb is that the drive with the bad sectors

nomax · 24. August 2020

Zitat von geaves

If mdadm doesn't report the array as stopped the second command will report the array as busy

/dev/sdb is that the drive with the bad sectors

Yes it is. It is the one that is missing in the webinterface (under RAID management).

Can I just remove the faulty one, add the functioning one and start to grow the RAID/add to the array or is there anything else I need to do?

geaves · 24. August 2020

Zitat von nomax

Can I just remove the faulty one

No!! you have to tell mdadm what to do, this is the normal procedure,

mdadm /dev/md127 --fail /dev/sdb

mdadm /dev/md127 --remove /dev/sdb

you should then be able to physically remove the drive, install the new drive, then;

Storage -> Disks select the new drive, click wipe on the menu and select short.

Raid Management -> Select the raid, on the menu click recover, a dialog box should display the new drive, select it and click OK, the raid should now rebuild

nomax · 25. August 2020

Zitat von geaves

No!! you have to tell mdadm what to do, this is the normal procedure,

mdadm /dev/md127 --fail /dev/sdb
mdadm /dev/md127 --remove /dev/sdb

you should then be able to physically remove the drive, install the new drive, then;
Storage -> Disks select the new drive, click wipe on the menu and select short.
Raid Management -> Select the raid, on the menu click recover, a dialog box should display the new drive, select it and click OK, the raid should now rebuild

Alles anzeigen

Just checked - /dev/sdb seems to be already removed from the array, could it be that mdadm has done that just by itself? Would it be safe now to physically remove the faulty drive and connect the new one?

Code

root@openmediavault:~# mdadm /dev/md127 --fail /dev/sdb
mdadm: set device faulty failed for /dev/sdb:  No such device
root@openmediavault:~# mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Sat Dec 20 15:15:32 2014
        Raid Level : raid5
        Array Size : 7813533696 (7451.57 GiB 8001.06 GB)
     Used Dev Size : 1953383424 (1862.89 GiB 2000.26 GB)
      Raid Devices : 5
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Aug 21 17:59:45 2020
             State : clean, degraded
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : resync

              Name : Nas:RAID5
              UUID : 7b1be9be:d75af1ec:637002de:39001c58
            Events : 63606

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       -       0        0        1      removed
       4       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       5       8       64        4      active sync   /dev/sde
root@openmediavault:~#

Alles anzeigen

geaves · 25. August 2020

Zitat von nomax

Just checked - /dev/sdb seems to be already removed from the array, could it be that mdadm has done that just by itself

Highly probable, the (possibly out of date) error is correctable, whilst mdadm knows about sdb it's effectively removed it due to the error, you should be good to carry on, replace the drive and rebuild.

nomax · 26. August 2020

Thanks, it worked well!! Recovery took a fair bit but now everything is again up and running :)!!

Really appreciate the help and support on here!

geaves · 26. August 2020

You're welcome

Jetzt mitmachen!