Beiträge von SignedOne

    Hard to say. Most Linux drivers are written generically and some motherboards have buggy implementations of hardware.

    That makes sense, but apart from a recent reinstall and the switch from an internal NVME SSD to a SATA SSD nothing regarding my setup really changed. Same software, same architecture (USB HDDs, mergerFS and SnapRAID on top of LUKS encrypted drives), same cables, same powersupplies, same HDDs.


    I didn't monitor the log before the reinstall and even on the new setup, they system at least didn't lock up the drives, this all manifested in the last two weeks.



    Currently though I'm happy to report that the system is stable. The resets and I/O errors get logged frequently and are somewhat concerning, but the drives and LUKS volumes stay mounted despite some heavy-ish IO activity in the last 24 hours.


    I might try to go back to a 6.x kernel later on, since the usbcore.quirks=VID:PID:k hack seems to have been the key, apart from the CMCI storms that for some reason disappeared on the 5.15 kernel or at least don't get reported anymore.

    Switched to proxmox 5.15; thank you for your continued support, despite this being a generic linux issue. Is it alright if I keep updating this thread or would you prefer me to post somewhere else, since this isn't really an OMV issue?


    That alone didn't help, but in my search for the root of these issues, I disabled USB autosuspend via kernel parameters in /etc/default/grub.

    Now I get these messages consistently when a reset and I/O error happens:


    Code
    Okt 27 21:33:51 OMV kernel: usb 2-5: Disable of device-initiated U1 failed.
    Okt 27 21:33:56 OMV kernel: usb 2-5: Disable of device-initiated U2 failed.
    Okt 27 21:33:56 OMV kernel: usb 2-5: reset SuperSpeed USB device number 4 using xhci_hcd
    Okt 27 21:33:56 OMV kernel: sd 4:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=10s
    Okt 27 21:33:56 OMV kernel: sd 4:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 06 31 aa 57 f0 00 00 08 00 00 00
    Okt 27 21:33:56 OMV kernel: blk_update_request: I/O error, dev sdc, sector 26603050992 op 0x0:(READ) flags 0x80700 phys_seg 78 prio class 0

    It looks like the USB HDDs ignore the autosuspend setting, and this may either be the cause or a symptom of the issue. I found a resource suggesting adding "usbcore.quirks=VID:PID:k" as a kernel parameter to try to force the disabling of autosuspend, I'll try that next.


    _______________________________________


    The resets and I/O errors still happen after adding "usbcore.quirks=VID:PID:k" for all USB HDDs to the kernel parameters, but the operations on the disk don't get interrupted and the disks don't get unmounted. SMART values all seem to be ok and the system seems to be kind of stable now at least. Not ideal in any way and I'm still confused on what brought these issues on in the first place.

    I don't think there is anything I can do from the snapraid or luks plugin side to help. If usb is disconnecting due to activity, that feels like a hardware issue and/or driver bug in the kernel. I could be wrong but I don't have any other ideas.

    Oh I didn't think so, I meant that I do hope that another kernel somehow can mitigate those issues that seem to be a hardware issue. This is very clearly not an issue that is caused by the LUKS or SnapRAID plugin or OMV in general.

    Seems like a motherboard issue causing a usb disconnect that locks the drive. Snapraid is probably just causing lots of activity that is causing the disconnect. All I can think is you should try a different kernel like the proxmox 6.2 or 5.19 kernel.

    Thank you for the info, I really do hope that this is fixable or at least mitigable via the software route. I'll "abuse" the SnapRAID check for increased I/O for a while longer and monitor the log, so far the CMCI storms, USB resets and I/O errors don't cause the disks to show up as disconnected and the LUKS volumes are still mounted.

    If that happens again, I'll try upgrading to the proxmox kernel, retest and report back.

    Update 1: Connected all 3 USB HDD through an (unpowered) USB HUB to one USB port on the PC, when running SnapRAID check, I got I/O errors and all 3 USB HDDs vanished from the disk page in the OMV webGUI. This showed up in the journal:

    kernel.txt


    This is causing me serious worry:

    Code
    Okt 27 12:23:31 OMV kernel: hub 2-4:1.0: over-current condition


    Connecting all 3 USB drives to the other 3 ports and retesting, maybe the port in question is simply defective.


    With all 3 drives connected to different ports (except the one where I got the over-current condition on) SnapRAID check is currently running for a lot longer than previously, though I still get much more infrequent errors:


    Code
    Okt 27 12:48:07 OMV kernel: mce: CMCI storm detected: switching to poll mode
    Okt 27 12:50:25 OMV kernel: usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
    Okt 27 12:50:25 OMV kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
    Okt 27 12:50:25 OMV kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 00 d1 dd 58 00 00 00 02 00 00 00
    Okt 27 12:50:25 OMV kernel: I/O error, dev sdc, sector 3520944128 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
    Okt 27 12:54:26 OMV kernel: usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
    Okt 27 12:54:26 OMV kernel: sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
    Okt 27 12:54:26 OMV kernel: sd 2:0:0:0: [sdc] tag#0 CDB: Read(16) 88 00 00 00 00 00 b7 4d 0a 00 00 00 02 00 00 00
    Okt 27 12:54:26 OMV kernel: I/O error, dev sdc, sector 3075279360 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
    Okt 27 12:54:55 OMV kernel: perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500

    _______________________________________________________________________________________________________________________

    The journey continues.


    Now I got an USB reset and an I/O error on a different disk. Which makes me think that the issue isn't the one drive, but simply was a symptom due to that specific drive getting a huge I/O spike before the other drives when SnapRAID runs a check:

    Code
    Okt 27 13:09:55 OMV kernel: mce: CMCI storm subsided: switching to interrupt mode
    Okt 27 13:11:42 OMV kernel: usb 2-5: reset SuperSpeed USB device number 4 using xhci_hcd
    Okt 27 13:11:43 OMV kernel: sd 4:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
    Okt 27 13:11:43 OMV kernel: sd 4:0:0:0: [sdd] tag#0 CDB: Read(16) 88 00 00 00 00 00 1c 36 2e 00 00 00 02 00 00 00
    Okt 27 13:11:43 OMV kernel: I/O error, dev sdd, sector 473312768 op 0x0:(READ) flags 0x80700 phys_seg 50 prio class 2


    I was thinking that maybe the storage driver, in particular UAS, plays a role here, but this is my current USB HDD setup via lsusb -t:

    Code
    /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/7p, 5000M
        |__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=usb-storage, 5000M
        |__ Port 2: Dev 3, If 0, Class=Mass Storage, Driver=uas, 5000M
        |__ Port 5: Dev 4, If 0, Class=Mass Storage, Driver=usb-storage, 5000M

    Ironically, the only drive not throwing any errors (so far, that is) is the one with the UAS driver.

    Apparently whenever my weekly SnapRAID sync job configured via the SnapRAID plugin runs, it causes one LUKS volume on a USB HDD to unmount.

    The LUKS volume in question (A) is part of a mergerFS pool with another, smaller drive (B). SnapRAID is set up to sync both data drives (A & B) to a third parity drive (C).


    I found this thread describing pretty much the same issue with no apparent solution. According to info in this thread, the issue might be that the drive goes to sleep or gets disconnected because of disk I/O from the SnapRAID sync. But since the drive seems to work reliably in regular operation (frequent disk I/O), this doesn't seem very plausible to me.


    Similar to their issue, I consistently get the software issue page from OMV after manually reloading the mergerFS pool after manually unlocking the LUKS volumes after a reboot; however, after simply reloading the page the pool works as expected and shows up correctly in the OMV webGUI.



    However, I found these log entries:


    journalctl -o short-precise -k

    Which seem to point to disconnect issues on that drive (sde = A, the drive causing problems). I'm also a bit confused regarding the errors for sdc, my mounts at the moment are sda, sdb, sdd and sde.

    LUKS is setup to decrypt sdb, sdd and sde. I'm not sure why one of the drives wasn't automatically mounted as sdc in the first place. After a reboot, the drives mount as sda,b,c and d as expected, so I currently regard this as a side effect of the frequent disconnects and remounts.



    I have OMV6 installed on a small x64 PC with 3 externally powered USB HDDs connected and a SATA SSD as boot drive. A few weeks ago, I scrapped my old OMV6 (software) setup due to boot drive failure and started from scratch (complete reinstall, no restoration of backup) with the exact same hardware setup (except a new boot drive). This wasn't an issue before with pretty much the identical setup.



    Does anybody have an idea what the issue here could be?

    Here are the steps I took to upgrade my underlying Debian bullseye release to bookworm:


    - Make sure all packages are up to date with


    Code
    sudo apt update && sudo apt upgrade -y


    - Optionally backup apt sources, e.g.


    Code
    sudo cp /etc/apt/sources.list sources.list.backup



    - Replace all occurrences of" bullseye" with "bookworm" in /etc/apt/sources.list and all *.list in /etc/apt/sources.list.d, e.g. with

    Code
    sudo sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list

    Using "testing" instead of "bookworm" is also possible, but using the latter automatically transitions the installation into stable, once bookworm reaches that state instead of staying on the testing release branch.


    - Update and upgrade again, keeping all current configurations when asked


    Code
    sudo apt update && sudo apt upgrade -y

    - Reboot

    - Verify that the new kernel is used, in my case Linux 6.1.0-3-amd64

    - See that now the OMV updater/apt upgrade wants to remove OMV, abandon approach for now until I have more time to think it through.

    So apparently kernel versions from 5.18 to 6.0.17/6.1.3 introduced a bug, where the i915 driver for Intel onboard graphics can hang, preventing one of my docker containers (jellyfin) to work as expected (tonemapping HDR to SDR while transcoding video files).


    I have a fully up to date OMV6 install (6.3.0-2 (Shaitan), that was installed with the official OMV Iso, deployed on bare metal with kernel Linux 6.0.0-0.deb11.6-amd64.



    This questions may sound naive, but can I simply upgrade my kernel, e.g. via installing 6.1.8-1 from the testing repo without further modifications and expect OMV6, its plugins, docker and other stuff to still work?