Beiträge von SignedOne

SignedOne · 28. Oktober 2023

Zitat von ryecoaaron

Hard to say. Most Linux drivers are written generically and some motherboards have buggy implementations of hardware.

That makes sense, but apart from a recent reinstall and the switch from an internal NVME SSD to a SATA SSD nothing regarding my setup really changed. Same software, same architecture (USB HDDs, mergerFS and SnapRAID on top of LUKS encrypted drives), same cables, same powersupplies, same HDDs.

I didn't monitor the log before the reinstall and even on the new setup, they system at least didn't lock up the drives, this all manifested in the last two weeks.

Currently though I'm happy to report that the system is stable. The resets and I/O errors get logged frequently and are somewhat concerning, but the drives and LUKS volumes stay mounted despite some heavy-ish IO activity in the last 24 hours.

I might try to go back to a 6.x kernel later on, since the usbcore.quirks=VID:PID:k hack seems to have been the key, apart from the CMCI storms that for some reason disappeared on the 5.15 kernel or at least don't get reported anymore.

SignedOne · 27. Oktober 2023

Switched to proxmox 5.15; thank you for your continued support, despite this being a generic linux issue. Is it alright if I keep updating this thread or would you prefer me to post somewhere else, since this isn't really an OMV issue?

That alone didn't help, but in my search for the root of these issues, I disabled USB autosuspend via kernel parameters in /etc/default/grub.

Now I get these messages consistently when a reset and I/O error happens:

Code

Okt 27 21:33:51 OMV kernel: usb 2-5: Disable of device-initiated U1 failed.
Okt 27 21:33:56 OMV kernel: usb 2-5: Disable of device-initiated U2 failed.
Okt 27 21:33:56 OMV kernel: usb 2-5: reset SuperSpeed USB device number 4 using xhci_hcd
Okt 27 21:33:56 OMV kernel: sd 4:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=10s
Okt 27 21:33:56 OMV kernel: sd 4:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 06 31 aa 57 f0 00 00 08 00 00 00
Okt 27 21:33:56 OMV kernel: blk_update_request: I/O error, dev sdc, sector 26603050992 op 0x0:(READ) flags 0x80700 phys_seg 78 prio class 0

It looks like the USB HDDs ignore the autosuspend setting, and this may either be the cause or a symptom of the issue. I found a resource suggesting adding "usbcore.quirks=VID:PID:k" as a kernel parameter to try to force the disabling of autosuspend, I'll try that next.

_______________________________________

The resets and I/O errors still happen after adding "usbcore.quirks=VID:PID:k" for all USB HDDs to the kernel parameters, but the operations on the disk don't get interrupted and the disks don't get unmounted. SMART values all seem to be ok and the system seems to be kind of stable now at least. Not ideal in any way and I'm still confused on what brought these issues on in the first place.

SignedOne · 27. Oktober 2023

Unfortunately, the issue persists even on the 6.2 proxmox kernel. I don't know if it's simply bad luck or it somehow corrected itself before, but now even "regular" high I/O, e.g. writing a backup to a USB HDD, results in resets, I/O errors and the vanishing of all USB HDDs.

SignedOne · 27. Oktober 2023

Zitat von ryecoaaron

I don't think there is anything I can do from the snapraid or luks plugin side to help. If usb is disconnecting due to activity, that feels like a hardware issue and/or driver bug in the kernel. I could be wrong but I don't have any other ideas.

Oh I didn't think so, I meant that I do hope that another kernel somehow can mitigate those issues that seem to be a hardware issue. This is very clearly not an issue that is caused by the LUKS or SnapRAID plugin or OMV in general.

SignedOne · 27. Oktober 2023

Zitat von ryecoaaron

Seems like a motherboard issue causing a usb disconnect that locks the drive. Snapraid is probably just causing lots of activity that is causing the disconnect. All I can think is you should try a different kernel like the proxmox 6.2 or 5.19 kernel.

Thank you for the info, I really do hope that this is fixable or at least mitigable via the software route. I'll "abuse" the SnapRAID check for increased I/O for a while longer and monitor the log, so far the CMCI storms, USB resets and I/O errors don't cause the disks to show up as disconnected and the LUKS volumes are still mounted.

If that happens again, I'll try upgrading to the proxmox kernel, retest and report back.

SignedOne · 27. Oktober 2023

Update 1: Connected all 3 USB HDD through an (unpowered) USB HUB to one USB port on the PC, when running SnapRAID check, I got I/O errors and all 3 USB HDDs vanished from the disk page in the OMV webGUI. This showed up in the journal:

kernel.txt

This is causing me serious worry:

Code

Okt 27 12:23:31 OMV kernel: hub 2-4:1.0: over-current condition

Connecting all 3 USB drives to the other 3 ports and retesting, maybe the port in question is simply defective.

With all 3 drives connected to different ports (except the one where I got the over-current condition on) SnapRAID check is currently running for a lot longer than previously, though I still get much more infrequent errors:

Code

Okt 27 12:48:07 OMV kernel: mce: CMCI storm detected: switching to poll mode
Okt 27 12:50:25 OMV kernel: usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
Okt 27 12:50:25 OMV kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Okt 27 12:50:25 OMV kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 00 d1 dd 58 00 00 00 02 00 00 00
Okt 27 12:50:25 OMV kernel: I/O error, dev sdc, sector 3520944128 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
Okt 27 12:54:26 OMV kernel: usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
Okt 27 12:54:26 OMV kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Okt 27 12:54:26 OMV kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 00 b7 4d 0a 00 00 00 02 00 00 00
Okt 27 12:54:26 OMV kernel: I/O error, dev sdc, sector 3075279360 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
Okt 27 12:54:55 OMV kernel: perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500

_______________________________________________________________________________________________________________________

The journey continues.

Now I got an USB reset and an I/O error on a different disk. Which makes me think that the issue isn't the one drive, but simply was a symptom due to that specific drive getting a huge I/O spike before the other drives when SnapRAID runs a check:

Code

Okt 27 13:09:55 OMV kernel: mce: CMCI storm subsided: switching to interrupt mode
Okt 27 13:11:42 OMV kernel: usb 2-5: reset SuperSpeed USB device number 4 using xhci_hcd
Okt 27 13:11:43 OMV kernel: sd 4:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Okt 27 13:11:43 OMV kernel: sd 4:0:0:0: [sdd] tag#0 CDB: Read(16) 88 00 00 00 00 00 1c 36 2e 00 00 00 02 00 00 00
Okt 27 13:11:43 OMV kernel: I/O error, dev sdd, sector 473312768 op 0x0:(READ) flags 0x80700 phys_seg 50 prio class 2

I was thinking that maybe the storage driver, in particular UAS, plays a role here, but this is my current USB HDD setup via lsusb -t:

Code

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/7p, 5000M
    |__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=usb-storage, 5000M
    |__ Port 2: Dev 3, If 0, Class=Mass Storage, Driver=uas, 5000M
    |__ Port 5: Dev 4, If 0, Class=Mass Storage, Driver=usb-storage, 5000M

Ironically, the only drive not throwing any errors (so far, that is) is the one with the UAS driver.

SignedOne · 27. Oktober 2023

Apparently whenever my weekly SnapRAID sync job configured via the SnapRAID plugin runs, it causes one LUKS volume on a USB HDD to unmount.

The LUKS volume in question (A) is part of a mergerFS pool with another, smaller drive (B). SnapRAID is set up to sync both data drives (A & B) to a third parity drive (C).

I found this thread describing pretty much the same issue with no apparent solution. According to info in this thread, the issue might be that the drive goes to sleep or gets disconnected because of disk I/O from the SnapRAID sync. But since the drive seems to work reliably in regular operation (frequent disk I/O), this doesn't seem very plausible to me.

Similar to their issue, I consistently get the software issue page from OMV after manually reloading the mergerFS pool after manually unlocking the LUKS volumes after a reboot; however, after simply reloading the page the pool works as expected and shows up correctly in the OMV webGUI.

However, I found these log entries:

journalctl -o short-precise -k

Code

Okt 23 13:05:35.305443 OMV kernel: usb 2-4: reset SuperSpeed USB device number 2 using xhci_hcd
Okt 23 13:05:35.329429 OMV kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Okt 23 13:05:35.330472 OMV kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 02 78 75 65 10 00 00 00 88 00 00
Okt 23 13:05:35.330904 OMV kernel: I/O error, dev sdc, sector 10610894096 op 0x0:(READ) flags 0x80700 phys_seg 17 prio class 2
...
Okt 25 10:17:41.647963 OMV kernel: I/O error, dev sdc, sector 25959428936 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
Okt 25 10:18:15.905742 OMV kernel: usb 2-4: reset SuperSpeed USB device number 2 using xhci_hcd
Okt 25 10:18:15.925459 OMV kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Okt 25 10:18:15.926522 OMV kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 06 0b 53 55 40 00 00 01 00 00 00
Okt 25 10:18:15.927299 OMV kernel: I/O error, dev sdc, sector 25959814464 op 0x0:(READ) flags 0x80700 phys_seg 18 prio class 2
...
Okt 26 11:14:36.112416 OMV kernel: I/O error, dev sdc, sector 26601962464 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
Okt 27 03:06:34.385435 OMV kernel: mce: CMCI storm detected: switching to poll mode
Okt 27 03:11:35.001719 OMV kernel: mce: CMCI storm subsided: switching to interrupt mode
Okt 27 04:15:55.081580 OMV kernel: usb 2-4: USB disconnect, device number 2
Okt 27 04:15:55.389821 OMV kernel: usb 2-4: new SuperSpeed USB device number 5 using xhci_hcd
Okt 27 04:15:55.409543 OMV kernel: usb 2-4: New USB device found, idVendor=1058, idProduct=25a3, bcdDevice=10.29
Okt 27 04:15:55.411286 OMV kernel: usb 2-4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Okt 27 04:15:55.412229 OMV kernel: usb 2-4: Product: Elements 25A3
Okt 27 04:15:55.413066 OMV kernel: usb 2-4: Manufacturer: Western Digital
Okt 27 04:15:55.413963 OMV kernel: usb 2-4: SerialNumber: [redacted]
Okt 27 04:15:55.414803 OMV kernel: usb-storage 2-4:1.0: USB Mass Storage device detected
Okt 27 04:15:55.415801 OMV kernel: scsi host5: usb-storage 2-4:1.0
Okt 27 04:15:56.446151 OMV kernel: scsi 5:0:0:0: Direct-Access     WD       Elements 25A3    1029 PQ: 0 ANSI: 6
Okt 27 04:15:56.447874 OMV kernel: sd 5:0:0:0: Attached scsi generic sg2 type 0
Okt 27 04:15:56.448881 OMV kernel: sd 5:0:0:0: [sde] Very big device. Trying to use READ CAPACITY(16).
Okt 27 04:15:56.450000 OMV kernel: sd 5:0:0:0: [sde] 27344699392 512-byte logical blocks: (14.0 TB/12.7 TiB)
Okt 27 04:15:56.451044 OMV kernel: sd 5:0:0:0: [sde] 4096-byte physical blocks
Okt 27 04:15:56.451955 OMV kernel: sd 5:0:0:0: [sde] Write Protect is off
Okt 27 04:15:56.452843 OMV kernel: sd 5:0:0:0: [sde] Mode Sense: 47 00 10 08

Alles anzeigen

Which seem to point to disconnect issues on that drive (sde = A, the drive causing problems). I'm also a bit confused regarding the errors for sdc, my mounts at the moment are sda, sdb, sdd and sde.

LUKS is setup to decrypt sdb, sdd and sde. I'm not sure why one of the drives wasn't automatically mounted as sdc in the first place. After a reboot, the drives mount as sda,b,c and d as expected, so I currently regard this as a side effect of the frequent disconnects and remounts.

I have OMV6 installed on a small x64 PC with 3 externally powered USB HDDs connected and a SATA SSD as boot drive. A few weeks ago, I scrapped my old OMV6 (software) setup due to boot drive failure and started from scratch (complete reinstall, no restoration of backup) with the exact same hardware setup (except a new boot drive). This wasn't an issue before with pretty much the identical setup.

Does anybody have an idea what the issue here could be?

SignedOne · 17. Februar 2023

Here are the steps I took to upgrade my underlying Debian bullseye release to bookworm:

- Make sure all packages are up to date with

Code

sudo apt update && sudo apt upgrade -y

- Optionally backup apt sources, e.g.

Code

sudo cp /etc/apt/sources.list sources.list.backup

- Replace all occurrences of" bullseye" with "bookworm" in /etc/apt/sources.list and all *.list in /etc/apt/sources.list.d, e.g. with

Code

sudo sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list

Using "testing" instead of "bookworm" is also possible, but using the latter automatically transitions the installation into stable, once bookworm reaches that state instead of staying on the testing release branch.

- Update and upgrade again, keeping all current configurations when asked

Code

sudo apt update && sudo apt upgrade -y

- Reboot

- Verify that the new kernel is used, in my case Linux 6.1.0-3-amd64

- See that now the OMV updater/apt upgrade wants to remove OMV, abandon approach for now until I have more time to think it through.

SignedOne · 16. Februar 2023

Thank you for the information!

I simply took a backup of the OS drive and planned to just go for the upgrade, but actually cloning the drive and trying the upgrade on the backup first seems more sensual.

Will do and report back.

SignedOne · 16. Februar 2023

Zitat von Soma

And you want to do this because???
What will a testing kernel provide to you that will make the server run better?

A fix to the i915 driver as mentioned in the post.

SignedOne · 16. Februar 2023

So apparently kernel versions from 5.18 to 6.0.17/6.1.3 introduced a bug, where the i915 driver for Intel onboard graphics can hang, preventing one of my docker containers (jellyfin) to work as expected (tonemapping HDR to SDR while transcoding video files).

I have a fully up to date OMV6 install (6.3.0-2 (Shaitan), that was installed with the official OMV Iso, deployed on bare metal with kernel Linux 6.0.0-0.deb11.6-amd64.

This questions may sound naive, but can I simply upgrade my kernel, e.g. via installing 6.1.8-1 from the testing repo without further modifications and expect OMV6, its plugins, docker and other stuff to still work?

Beiträge von SignedOne

Heavy disk I/O causes LUKS volume to unmount (not an OMV issue, mitigated)

Heavy disk I/O causes LUKS volume to unmount (not an OMV issue, mitigated)

Heavy disk I/O causes LUKS volume to unmount (not an OMV issue, mitigated)

Heavy disk I/O causes LUKS volume to unmount (not an OMV issue, mitigated)

Heavy disk I/O causes LUKS volume to unmount (not an OMV issue, mitigated)

Heavy disk I/O causes LUKS volume to unmount (not an OMV issue, mitigated)

Heavy disk I/O causes LUKS volume to unmount (not an OMV issue, mitigated)

Can I upgrade the kernel from 6.00 to > 6.0.17/6.1.3 with OMV6 still working as expected?

Can I upgrade the kernel from 6.00 to > 6.0.17/6.1.3 with OMV6 still working as expected?

Can I upgrade the kernel from 6.00 to > 6.0.17/6.1.3 with OMV6 still working as expected?

Can I upgrade the kernel from 6.00 to > 6.0.17/6.1.3 with OMV6 still working as expected?