[solved] OMV 4 hangs randomly with high load > 100

  • Hi,


    I am reading here for a while, now i encounter an (for me) unsolvable problem..


    My system randomly hangs and is unresponsive (all docker container / services are not reachable). SSH login works but if I for example run "top" it hangs/freezes as well.



    I think (haven't verified if this is always the case) I can make the system responsive again, if I login to the webinterface (OMV GUI). Once logged in, the system will return back to idle state and remain with low load..


    Until now I couldn't find a single source for this, checked the process logs, iotop etc.


    Is it possible that the GUI will "lock" the system somehow? How can I get logs if the system hangs?


    I currently run a OMV backup with fsarchiver, which is utilizing the CPU quite a bit, but runs smoothly with load of 2.2-2.4. I don't know if this one is the cause for the initial high load and the freeze of any other processes..


    My system:


    8GB RAM
    Intel Celeron 1.6Ghz Quad-Core
    2x6TB Data (SATA3)
    1x128GB SSD System (SATA3)


    OS


    No LSB modules are available.
    Distributor ID: Debian
    Description: Debian GNU/Linux 9.6 (stretch)
    Release: 9.6
    Codename: stretch


    OMV


    Release: 4.1.17-1
    Codename: Arrakis


    System


    Linux aries 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux


    ps auxf (with normal load): https://pastebin.com/raw/PvWR8v80

  • Never heard of this.. To me it would seem if the system is "locked" you would not be able to run the webUI...


    What file systems are you using? Powersaving? Are the drives showing any other weird activity?

  • Yeah, all docker services are not available if system "locks" but SSH is accessable (with issues as described above) and WebUI seems to clear the lock from the system.


    File system on all disks is ext4 on storage and system disk.
    The machine is running 24/7, no powersaving on the machine itself.
    The disks have the following modes:
    Storage Raid1 Disk1: spindown
    Storage Raid1 Disk2: disabled (don't know why, maybe set this to spindown, too?)
    System Disk: Disabled (SDD)


    Drives seem fine, what do you describe as "weird"? Temperature is okay, SMART is okay..


    EDIT: Added list of docker containers..

  • Just a quick hint: the concept of 'average load' in Linux is mostly misunderstood since confusing. It's not 'CPU utilization' but load also increases once the system is stuck on I/O. Full details: http://www.brendangregg.com/bl…/linux-load-averages.html


    If a system is in such a state (waiting for I/O requests to be finished) it usually behaves like almost frozen. I would check logs and health of storage.

  • Thanks for the hint. I think I understand the concept - I wrote a little check script to run, whenever high load is found and give me the iostat, free memory and process list of my system.


    I am confused by the spike of load > 100 therefore I mentioned it.


    Both disks are brand new, SSD is also new, everything is not older than one month - but I will investigate if the disks are somehow causing errors.

  • Okay, today I got something new. I tried to access my nextcloud from remote and got the following logs and a lot of emails telling me nginx crashed (running the omv gui) etc.



    And after that I get around 400 lines of this


    And after that mysqld complains


    This is my `iostat -x` now with everything running:


    I now suspect docker being the bottleneck. Since I have all containers running on the default bridge, it might cause delays? Any recommended write up I can check for such errors?

  • One logging problem seems to be, if the system is clogged, the logs are not generated


    In other words: Looking into the provided logs makes no sense?


    Anyway: if this is really the output of iostat 120 then what's going on with your storage? Two disks show a constant read activity of 8 MB/s with +130 transactions per second while the md0 device shows way less but also constant utilization. Disclaimer: No idea what should be 'normal behavior' since I don't use mdraid's raid1 since close to useless.

  • This is correct, iostat 120 gives me the output posted (added new iostat, too).


    This the output of iotop -oPa -d 2 (running for 30min or so)



    It shows me alot of writing from just the journaling service.. This could be the reason for the constant traffic?



    iostat 120 from the last hour or so: https://pastebin.com/g885FkAc



    UPDATE:


    High I/O of the journaling resulted in mysql being the cause. If I follow the guide here: https://medium.com/@n3d4ti/i-o…l-import-data-a06d017a2ba I get it down to 0.X% and almost no traffic. But as pointed out in the post, this is not always a good setting for production. So what do you think? I leave it off for 1-2 days to see if it stops the spikes.


    Also I was thinking about my docker container having services logging into sqlite database which are stored on my raid. Is it a good practice to include another SSD into the system for the appdata folder? Since my raid is constantly written to, it never sleeps..


    UPDATE 2:
    I've found that using Nextcloud App on iOS to scroll through a bunch of images, will cause very high load on my system. Mostly apache process will spike. I run the official docker image and will try to investigate further

  • Hello,


    First post here but thought I would share my experience.


    I have upgraded yesterday from OMV3 to OMV4 and been randomly experiencing same issues as OP.


    After doing some research and reading the post here and getting the hint from post #2 I came up with the same result.


    PHP5 was still installed after OMV4 upgrade.


    To check if check PHP5 is still installed:


    Code
    dpkg -l | grep php

    To remove leftovers of PHP5:


    Code
    apt-get purge 'php5*'


    Use at your own risk!


    All kudos go to posters in the mentioned post.


    My system has been running smooth since, no errors, no high CPU (100% and crash before), no huge load average (20+ before, 0.03 currently).


    Fingers crossed everything is back normal, at least for my system.

  • Great to see you have similar issues. Not great, but good to know there's someone else ;-) Since my system is currently clogged, I started investigating again..


    No high cpu usage / processes taking ram or cpu. BUT CPU waiting is high with around 24
    Now after 10min being clogged (docker containers are not processing) I recognized my light going off (which is controlled by my smart home..) So the system is "free" again, load drops immediatly. I've done nothing but opening top or iotop..


    top (clogged)


    Healthy top:


    How can the waiting be investigated further? I check iotop (nothing special, not much writing..). I'll let iostat 120 run overnight, to see if there is something useful in it..



    UPDATE:


    I ran this while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 30; done overnight (no clogging this night..)
    and get some locking processes (rsync backup job) for about 15min, but no message about high load this time, so seems uncritical.


    iostat 120 was running and is showing some output like this: https://pastebin.com/g85TLKgj

  • Tried about 1000 different things to resolve the referenced issue.


    I am happy to report that the last 3 days are back to normal.


    However I am concerned since I have not actually located what I did to fix the issue.


    One of the things that I have disabled that maybe relevant is the SMB service.

  • My /var/log/messages has a lot of those:



    and those:

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!