OMV hangs, lsof tainted?

  • Hello,


    Every few weeks, for an unknown reason, my OMV freezes. It is no longer accessible through the network, and it's headless so all I can do is hit the reset button.


    This last time, however, it kept logging things so maybe this will help clarify the root cause.


    From zero load, the CPU load jumped to a constant 1 in the afternoon of May 26, and stayed like this for nearly 30 hours. In the evening of May 27, the system stopped logging data - I assume that's when it stopped responding at all. Logging resumed in the evening of May 28 after a cold reboot.


    Here are the messages that came out of the blue, alternating between CPU0 and CPU1 every 30 minutes. Any idea what is this, what's up with "lsof Tainted"?


    This is a system with a low-power CPU (AMD BE-2350) and a massive heatsink with passive cooling in a case with well thought airflow, to minimize noise from an extremely underloaded system. I'd really like to stop these hangs from happening ever again, since the overheating can't do any good.


    Anything I can try? Any information you need? It's running OMV 1.19 with all the updates to day, with OMVExtras and Transmission as extra plugins, and only SMB/SSH/Torrent services running.


    Thanks in advance!



    • Official Post

    Thoughts...


    - Check your memory with memtest
    - Switch to backports 3.16 kernel (button to install in omv-extras)
    - What type of media is OMV installed on?

    omv 7.4.8-1 sandworm | 64 bit | 6.8 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.14 | compose 7.2.5 | k8s 7.3.1-1 | cputemp 7.0.2 | mergerfs 7.0.5 | scripts 7.0.9


    omv-extras.org plugins source code and issue tracker - github - changelogs


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • Memory is fine :| I'll run memtest on it in a loop again, just to be sure.


    OMV is installed on a IDE hard drive, the only drive on-board the motherboard. All other (SATA) controllers are disabled. Storage is on a hardware RAID controller on PCI Express.


    Backports? What does it do?

    • Official Post

    You are running the 3.2 kernel. The backports 3.16 is much newer (same kernel that Debian 8 Jessie uses) and may have better drivers for your motherboard. I use it on all of my systems.

    omv 7.4.8-1 sandworm | 64 bit | 6.8 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.14 | compose 7.2.5 | k8s 7.3.1-1 | cputemp 7.0.2 | mergerfs 7.0.5 | scripts 7.0.9


    omv-extras.org plugins source code and issue tracker - github - changelogs


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • Gotcha. Done. Although it's not very logical to me how a newer kernel can have improved support for a really old motherboard (2nd half of 2007) :D


    And now, we wait.


    Is there a monitoring/watchdog solution I can use, to alert me when CPU load stays in an elevated state for more than a few minutes?


    Any other way to observe what's going on with the box when it freezes - would remote logging help?

    • Official Post

    Things do get better with time sometimes :)


    OMV actually emails you when load is high if you have notifications setup. If it freezes quickly, it may not have time to send them though. I would look through the logs to see if any errors are showing.

    omv 7.4.8-1 sandworm | 64 bit | 6.8 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.14 | compose 7.2.5 | k8s 7.3.1-1 | cputemp 7.0.2 | mergerfs 7.0.5 | scripts 7.0.9


    omv-extras.org plugins source code and issue tracker - github - changelogs


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • Alright, with monitoring enabled it crashed a couple more times until I could catch this red-handed.


    At 9:30pm sharp tonight, my box spiked up to 100% CPU load and stayed there for half an hour until I rebooted it. It did send me an alert that the CPU was way up, and then two other interesting mails with identical content:


    From: Cron Daemon
    Message: Segmentation fault
    Subject: Cron <root@openmediavault> [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)


    Indeed, processes list showed quite a few processes with >80% CPU load, all about generating graphs.


    I'd point my finger at the cron job generating graphs, however... as I'm typing this, the system stopped responding again over the network - no web interface or SSH access. It did this as I was browsing through the pages of system logs.


    Any clues? Is it a software bug somewhere? Is it a faulty drive?

    • Official Post

    A drive can fail or start failing without bad blocks.

    omv 7.4.8-1 sandworm | 64 bit | 6.8 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.14 | compose 7.2.5 | k8s 7.3.1-1 | cputemp 7.0.2 | mergerfs 7.0.5 | scripts 7.0.9


    omv-extras.org plugins source code and issue tracker - github - changelogs


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • I did an upgrade to 2.1, but I still can't pinpoint the issue or trust the existing system disk. It's an old 20GB IDE drive, and I would love to replace it. Question is - how to back up my settings/users/shares etc. so that I can easily restore them on a clean install on a new drive?


    Drat, the box powered up like 15 minutes ago just froze and did a cold reboot by itself. Not sure if linked to transmission traffic; a torrent was downloading, and the new Gigabit internet connection might be too much for it. (Yeah, 1Gbps internet connection, gigabit router with hardware PPPoE offloading, speed test returns between 930-970Mbps... All for €12/month. Wanna move here? :) )

  • The box froze again just after midnight, and has been staying at full load for 7 hours. What now, shut down the system every time I am not using it? I am really, really pissed off. Even Windows would be a more stable option than this.


    Please provide instructions on how to backup the settings regarding users, shares and permissions, so that I can do a clean install from the ISO. I'm at the point where this is the last chance I'm giving OMV before moving on.

  • I did an upgrade to 2.1, but I still can't pinpoint the issue or trust the existing system disk. It's an old 20GB IDE drive, and I would love to replace it. Question is - how to back up my settings/users/shares etc. so that I can easily restore them on a clean install on a new drive?

    There is no way in backing up and restoring a config, yet.
    You can back up your config by yourself and use it -just for referrence- to reconfigure your newly installed OMV. However, you still can backup your whole system disk via clonezilla for example and restore it onto your new HDD.

  • Apologies for the frustration @tekkb and all. It was just incredibly aggravating to not figure out where the issue was coming from, nor to be able to get some expert support on the matter. In hindsight, nobody else would have been able to pinpoint the root cause anyway :D


    To close the loop, here's what happened meanwhile:

    • disabled the on-board NIC (Realtek) and installed an Intel add-on Gigabit Ethernet card with less CPU load - nada;
    • replaced the suspect drive with another one, connected to on-board SATA, then tried installing CentOS and Fedora Server - neither would work in graphical mode, and in low res graphical mode both froze during the install;
    • reduced the shared memory allocated to the graphics - no change;
    • tested the RAM exhaustively, passed 5 times over 9 hours - no issue here;
    • had a rare moment of divine inspiration :D and removed the HP Smart Array P410 hardware RAID controller from the PCI Express slot - installation of Fedora Server completed without glitches;
    • found out that HP actually released an advisory related to this card, they found out that in a particular combination of firmware + SATA drives + status polling the board would cause the system to freeze or reboot unexpectedly;
    • HP had published an updated firmware to correct the issue, and fortunately I just had installed Fedora which is on HP's list of supported OS's and could apply the RPM patch easily; (last time I had to install a Windows Server trial just to apply the firmware delivered as an executable, because it would run only under a Windows Server environment)
    • as I'm not familiar with either Fedora or CentOS and could not get Cockpit to work for remote server management, I gave up on that and returned to Debian;
    • I could not get OMV installed on top of Debian (a screenfull of dependencies which it couldn't handle), so I switched the hard drive again and performed a clean install of OMV 2.x.


    It's been 12 hours now and the new system is still working, all settings/shares/users/plugins were manually added in like half an hour, and I'm keeping an eye on it to see if it freezes again. New hard drive has less than 1,000 hours of uptime and the long test has passed.


    Sorry again for venting here. The issue was caused by a firmware bug in the hardware RAID controller, corrected by the vendor, and not related to lsof or another software component of OMV or Debian. I need to subscribe to support alerts for this piece of hardware.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!