PVE Kernel nuked RAID

ptruman · 15. Juni 2021

Lo

Due to a powercut my system rebooted.

When it did, I had no Docker containers, and when I checked, my OMV share drive was missing.

I have rebooted back to use 5.4.106-1-pve #1 SMP PVE 5.4.106-1 and it works. I used that as a panic default, but I can see in logs I was under 5.4.114-1 before.

However, under 5.4.119-1-pve, when I start, it fails to mount OMV data partition.

mdadm --assemble --scan then reports /dev/md0 is already in use

Clues? (also to warn others!)

ryecoaaron · 15. Juni 2021

I would guess it isn't the kernel version causing this. Probably a bad initramfs. Fix with sudo update-initramfs -k all -u

And if you are running a raid array, you really should have battery backup.

ptruman · 15. Juni 2021

It's softraid, and I have UPS

But from what I can see in default nut, the default command to shutdown in /etc/nut/upsmon.conf - was /sbin/shutdown

and shutdown ACTUALLY lives in /usr/sbin/shutdown

If that's the case, that could affect all OMV users?

So it didn't shutdown...

Should I try the sudo update-initramfs -k all -u under the new (failed) kernel or from the working one (and then reboot?)

ryecoaaron · 15. Juni 2021

Zitat von ptruman

It's softraid,

raid is raid and needs battery backup regardles of type whenever data is written to more than one drive. Good that you have a ups.

Zitat von ptruman

But from what I can see in default nut, the default command to shutdown was /sbin/shutdown
and shutdown ACTUALLY lives in /usr/sbin/shutdown

/sbin is a symlink to /usr/sbin on Debian (https://wiki.debian.org/UsrMerge). If your system is not like this, then you need to install usrmerge.

lrwxrwxrwx 1 root root 8 Jan 13 2020 sbin -> usr/sbin

Zitat von ptruman

If that's the case, that could affect all OMV users?

If they have an unmerged system, yes. But I thought the nut plugin didn't even install on an unmerged system but I haven't tried lately.

Zitat von ptruman

Should I try the sudo update-initramfs -k all -u under the new (failed) kernel or from the working one (and then reboot?)

Doesn't matter. The -k all parameter means build a new initramfs for all kernels installed on the system.

ptruman · 15. Juni 2021

Noted on the /sbin -> /usr/sbin (that is working here)

I will try the initramfs - although the kernel updates have all been done via the OMV GUI to date.

ptruman · 15. Juni 2021

Interestingly, I've also noticed that my "many kernels" have been reduced back to 5 (I had another thread about them racking up)

So something in the OMV-Extras/reboot IS tidying them up....

ryecoaaron · 15. Juni 2021

Zitat von ptruman

I will try the initramfs - although the kernel updates have all been done via the OMV GUI to date.

As it should but a power cut can corrupt files.

Zitat von ptruman

Interestingly, I've also noticed that my "many kernels" have been reduced back to 5 (I had another thread about them racking up)
So something in the OMV-Extras/reboot IS tidying them up....

That is all apt. omv-extras and reboot won't change that. 5 is still too many unless you have the 5.4 and 5.11 kernels installed.

ptruman · 15. Juni 2021

All I have (after initramfs) is:

update-initramfs: Generating /boot/initrd.img-5.4.119-1-pve

update-initramfs: Generating /boot/initrd.img-5.4.114-1-pve

update-initramfs: Generating /boot/initrd.img-5.4.106-1-pve

update-initramfs: Generating /boot/initrd.img-5.4.103-1-pve

update-initramfs: Generating /boot/initrd.img-5.4.101-1-pve

ryecoaaron · 15. Juni 2021

Perfect. Does rebooting into the 119 kernel work now?

ptruman · 15. Juni 2021

I'm about to try - many key things live in the Docker image on that box, i.e. DHCP/DNS so it's "fun" when it's down

ptruman · 15. Juni 2021

Right, major issue now

I could previously reboot to an old version. Now I've done the update-initramfs -k all -u thing (as suggested), ALL kernels don't load my OMV Data drive.... :\

If I try mdadm --assemble --scan I get

Code

mdadm: Fail create md0 when using /sys/module/md_mod/parameters/new_array
mdadm: /dev/md0 is already in use

if I cat /sys/modules/md_mod/parmeters/new_array I get permission denied

it is u+w only (not read)

If I set u+r on new_array and cat it, I get "Operation not permitted" (and I then set u-r on it again)

Checking....

ptruman · 15. Juni 2021

Small update - /etc/mdadm/mdadm.conf listed the drives thus:

ARRAY /dev/md/debian:0 metadata=1.2 name=debian:0 UUID=90dece3a:d04b3040:2c4fa91c:9c57ccb2

ARRAY /dev/md/debian:1 metadata=1.2 name=debian:1 UUID=e4673560:6b1f8d7a:be5a9a98:fcb7ccc9

ARRAY /dev/md0 metadata=1.2 name=OMVDataRAID UUID=bbfef1bf:cd1f8aab:2421bb14:c4cc3028

Given /dev/md/debian:0 symlinks to /dev/md0 - the last line is never going to work....

So, I changed the last line to :

ARRAY /dev/md127 metadata=1.2 name=OMVDataRAID UUID=bbfef1bf:cd1f8aab:2421bb14:c4cc3028

and mdadm --assemble --scan worked

I could then mount /dev/md127 and things reappeared.

Trying reboot

ptruman · 15. Juni 2021

I can now boot, however I see an error on boot about /dev/md0 still being in use (and /sys/module/md_mod/parameters/new_array) but it's now booting AND mounting my data volume - but I suspect this is due to my kludge, not a "proper" fix. I don't deem this stable (or at least not without an explanation).

I'm rather sceptical about what has changed where :\

Code

cat messages.1  | grep "Jun 11" | grep pve
Jun 11 07:48:51 openmediavault cron-apt:   pve-headers-5.4.119-1-pve pve-kernel-5.4.119-1-pve
Jun 11 07:48:51 openmediavault cron-apt:   libwebp6 libwebpmux3 pve-headers-5.4 pve-kernel-5.4

So I think apt or initram (after pve kernel download - the last was Jun 11th...) did something

Forcing the initramfs per suggestion broke the older installed kernels, so I couldn't (easily) recover - until I did the /etc/mdadm/mdadm.conf fix, which is manual and a bit worrying, given the file states (at the top):

Code

# This file is auto-generated by openmediavault (https://www.openmediavault.org)
# WARNING: Do not edit this file, your changes will get lost.

So what broke my /etc/mdadm/mdadm.conf file? How do I get this back? From what I can see in the OMV saltstack files all OMV is doing is:

Code

mdadm_save_config:
cmd.run:
- name: "mdadm --detail --scan >> /etc/mdadm/mdadm.conf"

So what caused it to default back to /dev/md0 when I clearly already have a /dev/md0 - as /dev/md/debian:0 is a symlink to /dev/md0

Should I change /etc/mdadm/mdadm.conf to be:

Code

ARRAY /dev/md/debian:0 metadata=1.2 name=debian:0 UUID=90dece3a:d04b3040:2c4fa91c:9c57ccb2
ARRAY /dev/md/debian:1 metadata=1.2 name=debian:1 UUID=e4673560:6b1f8d7a:be5a9a98:fcb7ccc9
ARRAY /dev/md/OMVDataRAID metadata=1.2 name=OMVDataRAID UUID=bbfef1bf:cd1f8aab:2421bb14:c4cc3028

...as I can't see what is clashing with /dev/md0... :\

If I (or OMV) does do a --detail --scan now I/it will get the same output back in the file but I can't see why it changed in the first place.

Clues?

Do I need to redo the update-initramfs -k all -u again?

ryecoaaron · 16. Juni 2021

No need to keep trying to fix the old kernels. I would:

sudo omv-salt deploy run mdadm

sudo update-initramfs -u

ptruman · 20. Juni 2021

For completeness, done and booted.

I would love to know what caused my mdadm conf to get rewritten to duplicate /dev/md0 though ... :\

Jetzt mitmachen!