How to debug "random" system hangs?

  • Hi,


    I'm running the latest OMV6 headless on J1900 CPU.


    Everything runs fine, apart from every 5-15 days the system will hang.


    When this happens, I hear the attached USD HDD suddenly stop, then I can't SSH in and there is no display output from the DVI or HDMI on the machine. The power light on the PC stays on, I need to power cycle to get going again.


    Any tips on how I can debug what might be causing this hang? I suspect it might have something to do with qBittorrent, when I stop that docker in Portainer, I didn't see a hang for 20 days, but that's not definitely the cause. It could also be when qB goes to access a certain area of the USB HDD? Not sure.


    Is there any debug log in (the kernel or OMV) which might log what the system was doing right before the hang? Seems very hard to debug, as this only happens once every two weeks or so.


    I have a messy work around now, where I have a smart plug attached to the system, and a Raspberry Pi on the network pinging the OMV server every 5 minutes and if there's no response it will control the smart plug to power cycle the OMV server. Not ideal.

  • Lot of things can cause that. Anything in the logs? Syslog or journal may be of help. Eventually sth like mce what are typ. hardware errors. However, when I had problems with system crashs it always in the end turned out as RAM problem. Thus out of practical experience I personally would recommend to buy a new ram kit and see if the problem is gone.

  • Thanks. I haven't heard of mce before. Is that something I should be looking for in the syslog or journal?


    Good idea about the RAM, I have 2 x 4GB sticks in there, I'll try with just one of them at a time to see if it's caused by one of those sticks.


    The nature of this problem means it could be a month before I can rule out the RAM (waiting for the hang to happen, or not :) )

  • Not absolutely sure atm but yes I think when you search for mce in syslog you should find info about critical hardware errors.


    If you use flashmemory plugin it may be necessary to disable it. (It moves logs to ram and so they get lost on a crash. After reboot you can only find logs from before last normal shutdown)

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!