Cannot recover 3 disk RAID5 array with two working disks

timmywo · 6. Dezember 2023

Hello,

I had a 3 Disk RAID5 array, a few weeks ago one of the disks failed and the array was happily working in a degraded state with the two remaining disks.

Today I received a replacement disk and powered off the system. On turning the system back on my RAID array in the UI was missing. From the command line I can see the following...

Bash

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdc[1](S) sdb[0](S)
7813772976 blocks super 1.2


unused devices: <none>

Bash

sudo /sbin/mdadm --misc --detail /dev/md0 
/dev/md0:
           Version : 1.2
        Raid Level : raid0
     Total Devices : 2
       Persistence : Superblock is persistent

             State : inactive
   Working Devices : 2

              Name : openmediavault:MonkeyFiles  (local to host openmediavault)
              UUID : edbda77a:1cc0c766:adcc8679:bd0c59f0
            Events : 235096

    Number   Major   Minor   RaidDevice

       -       8       32        -        /dev/sdc
       -       8       16        -        /dev/sdb

Alles anzeigen

Raid Level should be 5, not 0.

Code

sudo /sbin/mdadm --examine /dev/sd[cb]
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : edbda77a:1cc0c766:adcc8679:bd0c59f0
           Name : openmediavault:MonkeyFiles  (local to host openmediavault)
  Creation Time : Thu Sep 30 21:10:09 2021
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 7813772976 (3725.90 GiB 4000.65 GB)
     Array Size : 7813772288 (7451.79 GiB 8001.30 GB)
  Used Dev Size : 7813772288 (3725.90 GiB 4000.65 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=688 sectors
          State : clean
    Device UUID : f73725f7:ea9072fd:b382784e:33b60a46

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Dec  6 12:12:43 2023
       Checksum : 54170199 - correct
         Events : 235096

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : edbda77a:1cc0c766:adcc8679:bd0c59f0
           Name : openmediavault:MonkeyFiles  (local to host openmediavault)
  Creation Time : Thu Sep 30 21:10:09 2021
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 7813772976 (3725.90 GiB 4000.65 GB)
     Array Size : 7813772288 (7451.79 GiB 8001.30 GB)
  Used Dev Size : 7813772288 (3725.90 GiB 4000.65 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=688 sectors
          State : clean
    Device UUID : f7f0e83a:b2ea11aa:0937ad91:03a52345

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Dec  6 12:12:43 2023
       Checksum : 86c7b787 - correct
         Events : 235096

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : A.. ('A' == active, '.' == missing, 'R' == replacing)

Alles anzeigen

Both disks seem okay, but for some reason, sdc is marked as spare.

I have tried to reassemble the array but have no joy:

Code

$ sudo /sbin/mdadm /dev/md0 --assemble /dev/sd[bc]
mdadm: /dev/sdb is busy - skipping
mdadm: /dev/sdc is busy - skipping
$ sudo /sbin/mdadm --stop /dev/md0
mdadm: stopped /dev/md0
$ sudo /sbin/mdadm /dev/md0 --assemble --force /dev/sd[bc]
mdadm: /dev/md0 assembled from 1 drive and 1 spare - not enough to start the array.

I am assuming the issue is the array being incorrectly marked as 0 or that sdc is marked as Spare and not Active - both I cannot find any way to resolve.

Im not sure how best to proceed, I'm at the point of making a fresh array and recovering from backups - but as mdadm --examine /dev/sd[cb] seems to report both disks are fine I feel I should be able to recover the array in a degraded state and then add the replacement drive (sdd).

Any guidance would be much appreciated, thank you!

timmywo · 7. Dezember 2023

I was able to bring back the array with the following commands:

Bash

$ sudo /sbin/mdadm --create --level=5 --layout=ls --chunk=512 --raid-devices=3 --assume-clean /dev/md0 /dev/sdb /dev/sdc missing
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid5 devices=3 ctime=Thu Sep 30 21:10:09 2021
mdadm: /dev/sdc appears to be part of a raid array:
       level=raid5 devices=3 ctime=Thu Sep 30 21:10:09 2021
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

Data was also recovered. I got some direction from the following post, while the answer did not work for me, the second answer gave details on how to create a new array while trying to keep data, output from mdadm --examine /dev/sd[cb] gave me the details I needed.

I then readded the new drive sdd via the UI and started recovering the array from its degraded state.

I think the issue was due to sdc being marked as a spare, I don't think my actions were the best and there must be a less risky option so will leave this unresolved to see if anyone has better suggestions for next time.

Soma · 7. Dezember 2023

geaves might have a word to say about this,

geaves · 8. Dezember 2023

Zitat von timmywo

I think the issue was due to sdc being marked as a spare

Yes it was, this has come up on here before with no reason why and I've had no real world experience of this

Zitat von timmywo

I don't think my actions were the best

You were very, very lucky, the --assume-clean switch is an absolute last resort, this either works or destroys your data

timmywo · 8. Dezember 2023

geaves yes it was, next step was to restore from backup so happy to try.

Are you aware of any way to change the drive from spare to active?

I believe the trigger for the issue was due to a power disruption to this drive, when I added the new drive it was not detected and I found some lose power cables to both these drives where causing the issue.

geaves · 10. Dezember 2023

Zitat von timmywo

Are you aware of any way to change the drive from spare to active

No none that I have found

Zitat von timmywo

I believe the trigger for the issue was due to a power disruption to this drive,

A lot of mdadm raid issues appear to stem from power issues, either an actual power loss or a hardware power loss, something must trigger mdadm into marking a drive as a spare when in actual fact the drive was active before the power failure/disruption.

What you didn't post and what I forgot to ask for was in mdadm conf file (cat /etc/mdadm/mdadm.conf) that file should contain information regarding the array's setup. Did the information within that file under -> definition of arrays, contain --spares=1, if it did simply editing that file (which is not recommended) may have corrected the error.

I've never experienced this in a real world scenario, because in commercial use we had a ups + backups, at home I would run backups/rsync every night, so a worst case situation I would restore from back up or even redeploy, as OMV can be set up and running in a day, it's the restore process that takes the time.

BTW you don't have use /sbin to run mdadm commands

timmywo · 10. Dezember 2023

Zitat

Did the information within that file under -> definition of arrays, contain --spares=1, if it did simply editing that file (which is not recommended) may have corrected the error.

I did check this file and remembered it all seems fine, I cannot remember if it noted there was a spare or not, thanks for this pointer.

Zitat

because in commercial use we had a ups + backups, at home I would run backups/rsync every night, so a worst case situation I would restore from back up or even redeploy

My only trouble was to save on backup costs I did not backup my media (TV, Film) collection as this was not important. Backups are nightly and I run on UPS, my mistake is poking around the inside while the server was powered on!

Zitat

BTW you don't have use /sbin to run mdadm commands

For some reason, I have to reference the executable directly on my machine, no idea why, it is easier to do it than find out why I can't call the program directly in my session

Thanks for feedback, shame mdadm has no way to update the device role

Cannot recover 3 disk RAID5 array with two working disks

chente 6. Dezember 2023

Jetzt mitmachen!