[solved] OMV 4 hangs randomly with high load > 100

  • Hi,


    I am reading here for a while, now i encounter an (for me) unsolvable problem..


    My system randomly hangs and is unresponsive (all docker container / services are not reachable). SSH login works but if I for example run "top" it hangs/freezes as well.



    I think (haven't verified if this is always the case) I can make the system responsive again, if I login to the webinterface (OMV GUI). Once logged in, the system will return back to idle state and remain with low load..


    Until now I couldn't find a single source for this, checked the process logs, iotop etc.


    Is it possible that the GUI will "lock" the system somehow? How can I get logs if the system hangs?


    I currently run a OMV backup with fsarchiver, which is utilizing the CPU quite a bit, but runs smoothly with load of 2.2-2.4. I don't know if this one is the cause for the initial high load and the freeze of any other processes..


    My system:


    8GB RAM
    Intel Celeron 1.6Ghz Quad-Core
    2x6TB Data (SATA3)
    1x128GB SSD System (SATA3)


    OS


    No LSB modules are available.
    Distributor ID: Debian
    Description: Debian GNU/Linux 9.6 (stretch)
    Release: 9.6
    Codename: stretch


    OMV


    Release: 4.1.17-1
    Codename: Arrakis


    System


    Linux aries 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux


    ps auxf (with normal load): https://pastebin.com/raw/PvWR8v80

  • Yeah, all docker services are not available if system "locks" but SSH is accessable (with issues as described above) and WebUI seems to clear the lock from the system.


    File system on all disks is ext4 on storage and system disk.
    The machine is running 24/7, no powersaving on the machine itself.
    The disks have the following modes:
    Storage Raid1 Disk1: spindown
    Storage Raid1 Disk2: disabled (don't know why, maybe set this to spindown, too?)
    System Disk: Disabled (SDD)


    Drives seem fine, what do you describe as "weird"? Temperature is okay, SMART is okay..


    EDIT: Added list of docker containers..

  • Just a quick hint: the concept of 'average load' in Linux is mostly misunderstood since confusing. It's not 'CPU utilization' but load also increases once the system is stuck on I/O. Full details: http://www.brendangregg.com/bl…/linux-load-averages.html


    If a system is in such a state (waiting for I/O requests to be finished) it usually behaves like almost frozen. I would check logs and health of storage.

  • Thanks for the hint. I think I understand the concept - I wrote a little check script to run, whenever high load is found and give me the iostat, free memory and process list of my system.


    I am confused by the spike of load > 100 therefore I mentioned it.


    Both disks are brand new, SSD is also new, everything is not older than one month - but I will investigate if the disks are somehow causing errors.

  • Okay, today I got something new. I tried to access my nextcloud from remote and got the following logs and a lot of emails telling me nginx crashed (running the omv gui) etc.



    And after that I get around 400 lines of this


    And after that mysqld complains


    This is my `iostat -x` now with everything running:


    I now suspect docker being the bottleneck. Since I have all containers running on the default bridge, it might cause delays? Any recommended write up I can check for such errors?

  • One logging problem seems to be, if the system is clogged, the logs are not generated


    In other words: Looking into the provided logs makes no sense?


    Anyway: if this is really the output of iostat 120 then what's going on with your storage? Two disks show a constant read activity of 8 MB/s with +130 transactions per second while the md0 device shows way less but also constant utilization. Disclaimer: No idea what should be 'normal behavior' since I don't use mdraid's raid1 since close to useless.

  • This is correct, iostat 120 gives me the output posted (added new iostat, too).


    This the output of iotop -oPa -d 2 (running for 30min or so)



    It shows me alot of writing from just the journaling service.. This could be the reason for the constant traffic?



    iostat 120 from the last hour or so: https://pastebin.com/g885FkAc



    UPDATE:


    High I/O of the journaling resulted in mysql being the cause. If I follow the guide here: https://medium.com/@n3d4ti/i-o…l-import-data-a06d017a2ba I get it down to 0.X% and almost no traffic. But as pointed out in the post, this is not always a good setting for production. So what do you think? I leave it off for 1-2 days to see if it stops the spikes.


    Also I was thinking about my docker container having services logging into sqlite database which are stored on my raid. Is it a good practice to include another SSD into the system for the appdata folder? Since my raid is constantly written to, it never sleeps..


    UPDATE 2:
    I've found that using Nextcloud App on iOS to scroll through a bunch of images, will cause very high load on my system. Mostly apache process will spike. I run the official docker image and will try to investigate further

    6 Mal editiert, zuletzt von raws99 () aus folgendem Grund: added update2

  • Hello,


    First post here but thought I would share my experience.


    I have upgraded yesterday from OMV3 to OMV4 and been randomly experiencing same issues as OP.


    After doing some research and reading the post here and getting the hint from post #2 I came up with the same result.


    PHP5 was still installed after OMV4 upgrade.


    To check if check PHP5 is still installed:


    Code
    dpkg -l | grep php

    To remove leftovers of PHP5:


    Code
    apt-get purge 'php5*'


    Use at your own risk!


    All kudos go to posters in the mentioned post.


    My system has been running smooth since, no errors, no high CPU (100% and crash before), no huge load average (20+ before, 0.03 currently).


    Fingers crossed everything is back normal, at least for my system.

  • Unfortunately just happened again. As soon as I manage to login to the webui system starts responding again and load average drops from 10+ to 0.x.


    I will investigating and let you know.

  • Great to see you have similar issues. Not great, but good to know there's someone else ;) Since my system is currently clogged, I started investigating again..


    No high cpu usage / processes taking ram or cpu. BUT CPU waiting is high with around 24
    Now after 10min being clogged (docker containers are not processing) I recognized my light going off (which is controlled by my smart home..) So the system is "free" again, load drops immediatly. I've done nothing but opening top or iotop..


    top (clogged)


    Healthy top:


    How can the waiting be investigated further? I check iotop (nothing special, not much writing..). I'll let iostat 120 run overnight, to see if there is something useful in it..



    UPDATE:


    I ran this while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 30; done overnight (no clogging this night..)
    and get some locking processes (rsync backup job) for about 15min, but no message about high load this time, so seems uncritical.


    iostat 120 was running and is showing some output like this: https://pastebin.com/g85TLKgj

    2 Mal editiert, zuletzt von raws99 () aus folgendem Grund: added update

  • My /var/log/messages has a lot of those:



    and those:

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!