Raid 5 Grow Failure - ?How to Stop it?

eubyfied · 6. Januar 2024

Hi.

I have searched the forum and been scouring the internet for weeks and am stuck.

I am trying to grow my Raid 5 from 7 to 8 drives. The Reshaping has never progressed from 0.0% complete and is processing at 0K/second. The file system is not mounted during this entire time. The 7 drive array was working great before I tried to add another drive to it. I was just getting close to running out of space and wanted to grow it.

I need the commands to stop the grow / reshaping and just put the array back to the 7 drives. I have rebooted the server several times and it always goes back to this state.

drive /dev/sdi is the newest drive.

Every time the process gets stopped on a random ### blocked for more than ### seconds. It has been sitting for 2 weeks now and nothing.

Yes the md0/raid5 is eating up 100% of my CPU.

Please help...I have tried my best and just want to get my array back online with the data that is on the drives.

Thanks in advance if you can help me.

OMV 6

2008 MacPro Cheese Grater chassis

Code

root@doghouse:~# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sat Apr  2 17:07:37 2022
        Raid Level : raid5
        Array Size : 82033511424 (78233.25 GiB 84002.32 GB)
     Used Dev Size : 13672251904 (13038.88 GiB 14000.39 GB)
      Raid Devices : 8
     Total Devices : 8
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Dec 31 17:49:41 2023
             State : active, reshaping
    Active Devices : 8
   Working Devices : 8
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

    Reshape Status : 0% complete
     Delta Devices : 1, (7->8)

              Name : openmediavault:0
              UUID : 48006a0c:b8db73b6:78e2d6fd:9d123d27
            Events : 12083448

    Number   Major   Minor   RaidDevice State
       7       8       80        0      active sync   /dev/sdf
       1       8       96        1      active sync   /dev/sdg
       2       8      112        2      active sync   /dev/sdh
       3       8       32        3      active sync   /dev/sdc
       4       8        0        4      active sync   /dev/sda
       6       8       48        5      active sync   /dev/sdd
       5       8      128        6      active sync   /dev/sdi
       8       8       16        7      active sync   /dev/sdb

Alles anzeigen

Code

root@doghouse:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdg[1] sdf[7] sda[4] sdh[2] sdb[8] sdc[3] sdd[6] sdi[5]
      82033511424 blocks super 1.2 level 5, 512k chunk, algorithm 2 [8/8] [UUUUUUUU]
      [>....................]  reshape =  0.0% (58364/13672251904) finish=10209618072.6min speed=0K/sec
      bitmap: 13/102 pages [52KB], 65536KB chunk

unused devices: <none>

Code

[   46.824939] md: reshape of RAID array md0
[  242.816937] INFO: task md0_resync:793 blocked for more than 120 seconds.
[  242.816977]       Tainted: G          I        6.1.0-0.deb11.13-amd64 #1 Debian 6.1.55-1~bpo11+1
[  242.817011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.817040] task:md0_resync      state:D stack:0     pid:793   ppid:2      flags:0x00004000
[  242.817046] Call Trace:
[  242.817049]  <TASK>
[  242.817053]  __schedule+0x306/0xa40
[  242.817066]  schedule+0x5d/0xd0
[  242.817073]  reshape_request+0x8b1/0x960 [raid456]
[  242.817092]  ? raid5_sync_request+0x9a/0x3b0 [raid456]
[  242.817106]  ? sched_energy_aware_handler+0xb0/0xb0
[  242.817114]  md_do_sync.cold+0x3fb/0x96e [md_mod]
[  242.817143]  md_thread+0xb1/0x160 [md_mod]
[  242.817161]  ? preempt_count_add+0x70/0xa0
[  242.817165]  ? _raw_spin_lock_irqsave+0x24/0x60
[  242.817170]  ? unregister_md_personality+0x70/0x70 [md_mod]
[  242.817188]  kthread+0xe7/0x110
[  242.817194]  ? kthread_complete_and_exit+0x20/0x20
[  242.817199]  ret_from_fork+0x22/0x30
[  242.817207]  </TASK>
[  363.645903] INFO: task md0_resync:793 blocked for more than 241 seconds.
[  363.645946]       Tainted: G          I        6.1.0-0.deb11.13-amd64 #1 Debian 6.1.55-1~bpo11+1
[  363.645981] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.646011] task:md0_resync      state:D stack:0     pid:793   ppid:2      flags:0x00004000
[  363.646018] Call Trace:
[  363.646020]  <TASK>
[  363.646025]  __schedule+0x306/0xa40
[  363.646037]  schedule+0x5d/0xd0
[  363.646044]  reshape_request+0x8b1/0x960 [raid456]
[  363.646062]  ? raid5_sync_request+0x9a/0x3b0 [raid456]
[  363.646077]  ? sched_energy_aware_handler+0xb0/0xb0
[  363.646085]  md_do_sync.cold+0x3fb/0x96e [md_mod]
[  363.646113]  md_thread+0xb1/0x160 [md_mod]
[  363.646131]  ? preempt_count_add+0x70/0xa0
[  363.646136]  ? _raw_spin_lock_irqsave+0x24/0x60
[  363.646141]  ? unregister_md_personality+0x70/0x70 [md_mod]
[  363.646159]  kthread+0xe7/0x110
[  363.646165]  ? kthread_complete_and_exit+0x20/0x20
[  363.646170]  ret_from_fork+0x22/0x30
[  363.646179]  </TASK>
[  484.475683] INFO: task md0_resync:793 blocked for more than 362 seconds.
[  484.475728]       Tainted: G          I        6.1.0-0.deb11.13-amd64 #1 Debian 6.1.55-1~bpo11+1
[  484.475768] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.475803] task:md0_resync      state:D stack:0     pid:793   ppid:2      flags:0x00004000
[  484.475809] Call Trace:
[  484.475811]  <TASK>
[  484.475816]  __schedule+0x306/0xa40
[  484.475828]  schedule+0x5d/0xd0
[  484.475835]  reshape_request+0x8b1/0x960 [raid456]
[  484.475854]  ? raid5_sync_request+0x9a/0x3b0 [raid456]
[  484.475868]  ? sched_energy_aware_handler+0xb0/0xb0
[  484.475876]  md_do_sync.cold+0x3fb/0x96e [md_mod]
[  484.475905]  md_thread+0xb1/0x160 [md_mod]
[  484.475924]  ? preempt_count_add+0x70/0xa0
[  484.475928]  ? _raw_spin_lock_irqsave+0x24/0x60
[  484.475933]  ? unregister_md_personality+0x70/0x70 [md_mod]
[  484.475951]  kthread+0xe7/0x110
[  484.475956]  ? kthread_complete_and_exit+0x20/0x20
[  484.475961]  ret_from_fork+0x22/0x30
[  484.475970]  </TASK>
[  605.304458] INFO: task md0_resync:793 blocked for more than 483 seconds.
[  605.304501]       Tainted: G          I        6.1.0-0.deb11.13-amd64 #1 Debian 6.1.55-1~bpo11+1
[  605.304537] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.304567] task:md0_resync      state:D stack:0     pid:793   ppid:2      flags:0x00004000
[  605.304574] Call Trace:
[  605.304576]  <TASK>
[  605.304581]  __schedule+0x306/0xa40
[  605.304594]  schedule+0x5d/0xd0
[  605.304601]  reshape_request+0x8b1/0x960 [raid456]
[  605.304619]  ? raid5_sync_request+0x9a/0x3b0 [raid456]
[  605.304634]  ? sched_energy_aware_handler+0xb0/0xb0
[  605.304641]  md_do_sync.cold+0x3fb/0x96e [md_mod]
[  605.304669]  md_thread+0xb1/0x160 [md_mod]
[  605.304688]  ? preempt_count_add+0x70/0xa0
[  605.304692]  ? _raw_spin_lock_irqsave+0x24/0x60
[  605.304697]  ? unregister_md_personality+0x70/0x70 [md_mod]
[  605.304715]  kthread+0xe7/0x110
[  605.304720]  ? kthread_complete_and_exit+0x20/0x20
[  605.304725]  ret_from_fork+0x22/0x30
[  605.304733]  </TASK>
[  726.131432] INFO: task md0_resync:793 blocked for more than 604 seconds.
[  726.131488]       Tainted: G          I        6.1.0-0.deb11.13-amd64 #1 Debian 6.1.55-1~bpo11+1
[  726.131540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.131586] task:md0_resync      state:D stack:0     pid:793   ppid:2      flags:0x00004000
[  726.131595] Call Trace:
[  726.131599]  <TASK>
[  726.131605]  __schedule+0x306/0xa40
[  726.131620]  schedule+0x5d/0xd0
[  726.131630]  reshape_request+0x8b1/0x960 [raid456]
[  726.131657]  ? raid5_sync_request+0x9a/0x3b0 [raid456]
[  726.131679]  ? sched_energy_aware_handler+0xb0/0xb0
[  726.131690]  md_do_sync.cold+0x3fb/0x96e [md_mod]
[  726.131731]  md_thread+0xb1/0x160 [md_mod]
[  726.131758]  ? preempt_count_add+0x70/0xa0
[  726.131766]  ? _raw_spin_lock_irqsave+0x24/0x60
[  726.131774]  ? unregister_md_personality+0x70/0x70 [md_mod]
[  726.131800]  kthread+0xe7/0x110
[  726.131808]  ? kthread_complete_and_exit+0x20/0x20
[  726.131815]  ret_from_fork+0x22/0x30
[  726.131828]  </TASK>

Alles anzeigen

eubyfied · 7. Januar 2024

There are 70 views on the post but no one has any suggestions.

Is this an impossibility?

elio_ · 7. Januar 2024

OK so I didn't want to answer earlier because there could be a bunch of reasons... but let me tell you what I would do in this situation:

1. Make absolutely sure you have a backup, RAID operation are always dangerous. But in a case like this where something clearly went wrong this is even more important.

If you don't have a backup already make one now. You said that you unmounted the filesystem, this is actually not necessary for RAID operations. So in case you don't have a backup I would mount the file system and make a backup. (Yes mount it in the "reshaping" state that it is currently in.)

2. Now you said it got stuck on the reshape, so I understand that you did add the drive to the raidarray but did not yet us the,

mdadm --grow /dev/md0 --array-size []

command

or used the "grow" button on OMV.

Please tell me if this assesment is wrong.

3. The next step I would take is to try to remove the drive you added earlier to the array again.

In the web-interface the action would be: Storage -> Software RAID -> selct the array in quesiton -> hit "remove" on the top -> select the appropriate drive -> confirm.

If this is locked (greyed out) (maybe because a reshape is in action) try to do it over the CLI. The commands would be:

mdadm --manage /dev/mdX --fail /dev/sdX
mdadm --manage /dev/mdX --remove /dev/sdX

If any of the commands fail maybe try with a -f at the end to force it. (Maybe because a reshape is happening right now)

Note that this is the most dangerous part as I don't know what state the array is in!!! There was clearly some sort of error but I don't know what. Usually I would never do something like this during a reshape but if you already tried everything else maybe this is the only way to go. Make sure you have a backup as I don't know what will happen to the array.

(This is also why I didn't want to give an answer at first, because I didn't want to be the guy that gave you instructions to brick you RAID array...)

4. Now check the state of the RAID array, I think it should report:

Raid devices = 8
Total devices = 7

If any sort of RAID process (like restore or reshape) started automaticaly wait for this process to finish, with this many drives this might take a loooong time (also depending on the size).

5. Now you want to set the raid devices to 7 again, the command is:

mdadm --grow /dev/mdX --raid-devices=7

Wait for any sort of raid opeartions (like a restore or a reshape) to finish.

5.1 Considering that you used the omv "grow" button the array might already have been grown. (Even though it doesn't look like it from the status).

If there is an error of the sort: "this change will reduce the size of the array. use --grow --array-size first to truncate array. e.g. mdadm --grow /dev/md0 --array-size 3906767872" use the:

mdadm --grow /dev/md0 --array-size 3906767872

command. Just use the size that the output suggested in the command output.

After doing that,

mdadm --grow /dev/mdX --raid-devices=7

should just work.

6. Secure erase your drive, I have heard that drives that were once in a raid array and are used in a raid array again at a later time "could" cause catastrophic erros if not secure erased.

7. Try to add the same drive again.

If anything doesn't work as planned I am happy to help but I can't guarantee for anything...

eubyfied · 8. Januar 2024

....thanks so much for at least a place to start.

Zitat

command

or used the "grow" button on OMV.

You are correct, I pressed the grow button in OMV.

Zitat

(This is also why I didn't want to give an answer at first, because I didn't want to be the guy that gave you instructions to brick you RAID array...)

At this point my server is sitting there not getting used because the array reshape has just locked the whole thing up completely. So I either need to start over or something. two of my drives have Warnings...but not complete Failure. So if I do get it working again I will probably swap those out one at a time and then try to grow it again....that is a big if.

Thanks to anyone who takes the time to help out. This is a topic I haven't been able to track down anywhere. Sometimes I feel like I know what I am doing and some times I don't....

elio_ · 8. Januar 2024

Hmm so while I don't know what kind of warnings the drives have there might absolutely be a correlation with the stuck reshape... which would be... not great. Especially with a RAID 5 array which is not recomended at all for this many drives anyways.

1. Do you have a backup? If so restoring could be a faster (du to the unbearably long restore operations of RAID) and also less stressfull option considering that I don't know what the odds of success are in you case. (But then again if you already have a backup you might as well try)

Zitat von eubyfied

At this point my server is sitting there not getting used because the array reshape has just locked the whole thing up completely.

2. In case you don't have a backup:

So do you mean the RAID operation is locked up (the reshape in this case) or can you not mount and use the array at all?

If you can mount the array and use it you should absolutely try to make a backup (at least of the important stuff).

I know that you might not have the storage necessary considering you have an array of over 80TB but in this situation I don't know if fixing the array is going to work out especially with 2 drives having some sort of errors so a backup is crucial.

3. Concerning your drives with errors: I know this doesn't help but statisticaly drives (especially ones that are failing) fail quite often during RAID operations due to the the high load on them.

4. You said that the array is "locked up" well there isn't per se a command to stop the reshape just like there isn't a command to start the reshape. Rather the reshape just starts after adding the drive and the only way I can think of is to undo those steps and hope the array doesn't get bricked by removing the drive or by subsequent drive failures as the array will certanly be in a "degraded" state.

5. Once you have a backup I would try to start with my steps.

I also made a edit to the post above (Step 5.1)

Maybe some other people also have Ideas what to do...

geaves · 8. Januar 2024

ssh into omv as root and try this command echo max > /sys/block/md0/md/sync_max

But in all honesty 8 drives in a raid 5 is suicidal, you know 2 drives have bad sectors and have failed to replace those, there's only one way this can can got and that's tits up!!

The fact that the rebuild appears to have halted, it could be 'stuck' on a bad sector of one of those drives.

bobdanburg · 30. Januar 2024

Zitat von geaves

ssh into omv as root and try this command echo max > /sys/block/md0/md/sync_max

I'm not sure why this worked, but it saved my hide...

I tried to change bitmap to none after using the grow command like a big dumb dumb. This got things running.

Jetzt mitmachen!