Raspberry Pi 4 Watchdog Not Working?

    • OMV 5.x (beta)
    • Resolved
    • Raspberry Pi 4 Watchdog Not Working?

      I am running my RPI4 4GB OMV 5, installed per the setup guide.

      I am running into random crashes, mostly when I am using portainer to start/stop a container stack.
      I was going to verify that the watchdog timer is running, so that the timer could reboot the system after a hang (which results in total non-responsiveness, not even a ping can elicit a response)

      Below is what is in the watchdog config:
      /etc/watchdog.conf
      # This file is auto-generated by openmediavault (https://www.openmediavault.org)
      # WARNING: Do not edit this file, your changes will get lost.
      watchdog-device = /dev/watchdog
      # This greatly decreases the chance that watchdog won't be scheduled before
      # your machine is really loaded
      realtime = yes
      priority = 1

      The watchdog is not reacting to the hang, and I am forced to manually power cycle the unit to recover.

      What does 'realtime = yes' mean, and is there a way to configure the watchdog (because it says OMV supercede any changes.)


      -------------------------------------------------------------------------------------------------------------------------------------------
      EDIT:
      It seems my watchdog has issues, which is the problem.

      Attempted to start the watchdog via ssh, here is the output:

      Job for watchdog.service failed because the control process exited with error code.
      See "systemctl status watchdog.service" and "journalctl -xe" for details.

      This is the output of 'systemctl status watchdog.service'
      Display Spoiler

      watchdog.service - watchdog daemon
      Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor preset: enabled)
      Active: failed (Result: exit-code) since Sun 2019-12-22 13:09:49 GMT; 29s ago
      Process: 7656 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modpro
      Process: 7658 ExecStopPost=/bin/sh -c [ $run_wd_keepalive != 1 ] || false (code=exited, status=0/SUCCESS)


      Dec 22 13:09:49 raspberrypi systemd[1]: Starting watchdog daemon...
      Dec 22 13:09:49 raspberrypi sh[7656]: modprobe: FATAL: Module softdog not found in directory /lib/modules/4.19.75-v7l+
      Dec 22 13:09:49 raspberrypi systemd[1]: watchdog.service: Control process exited, code=exited, status=1/FAILURE
      Dec 22 13:09:49 raspberrypi systemd[1]: watchdog.service: Failed with result 'exit-code'.
      Dec 22 13:09:49 raspberrypi systemd[1]: Failed to start watchdog daemon.
      Dec 22 13:09:49 raspberrypi systemd[1]: watchdog.service: Triggering OnFailure= dependencies.


      This is the output of 'journalctl -xe'
      Display Spoiler

      --
      -- A start job for unit wd_keepalive.service has begun execution.
      --
      -- The job identifier is 1507.
      Dec 22 13:07:25 raspberrypi sh[6662]: modprobe: FATAL: Module softdog not found in directory /lib/modules/4.19.75-v7l+
      Dec 22 13:07:25 raspberrypi systemd[1]: wd_keepalive.service: Control process exited, code=exited, status=1/FAILURE
      -- Subject: Unit process exited
      -- Defined-By: systemd
      -- Support: debian.org/support
      --
      -- An ExecStartPre= process belonging to unit wd_keepalive.service has exited.
      --
      -- The process' exit code is 'exited' and its exit status is 1.
      Dec 22 13:07:25 raspberrypi systemd[1]: wd_keepalive.service: Failed with result 'exit-code'.
      -- Subject: Unit failed
      -- Defined-By: systemd
      -- Support: debian.org/support
      --
      -- The unit wd_keepalive.service has entered the 'failed' state with result 'exit-code'.
      Dec 22 13:07:25 raspberrypi systemd[1]: Failed to start watchdog keepalive daemon.
      -- Subject: A start job for unit wd_keepalive.service has failed
      -- Defined-By: systemd
      -- Support: debian.org/support
      --
      -- A start job for unit wd_keepalive.service has finished with a failure.
      --
      -- The job identifier is 1507 and the job result is failed.
      Dec 22 13:07:27 raspberrypi monit[1483]: 'filesystem_srv_dev-disk-by-id-usb-SABRENT_SABRENT_DB9876543214E-0-0-part1' spa
      Dec 22 13:07:57 raspberrypi monit[1483]: 'filesystem_srv_dev-disk-by-id-usb-SABRENT_SABRENT_DB9876543214E-0-0-part1' spa

      The post was edited 1 time, last by RFBomb ().

    • I've been researching this all day.
      Apparently the fault is due to Raspian not shipping with the 'softdog' module, which is something debian linux ships with.

      I found several threads that declared that "this is not OMV's problem and therefore it should be brought up with the people that develop Raspian. OMV devs will not dedicate time to fix it." were essentially all the answers I found.

      SoftDog is a software watchdog. The Raspberry Pi has a hardware watchdog that can very easily be enabled, so I can understand why softdog was never implemented. Enabling the watchdog provides the reboot on system hang I was hoping for.

      Solution:
      Edit the file /etc/systemd/system.conf and set the following options:

      RuntimeWatchdogSec=10 (Max value of 15 here. I set mine to 14)
      ShutdownWatchdogSec=10min

      Then I verified with a forkbomb test
      : (){ :|:& };:

      After a few seconds, I saw the desired effect. I had two command prompts running, SSH and another command prompt open with a persisnt ping test. Pings never stopped being responded to, but the SSH session was kicked. After a few seconds I tried to re-SSH into the device and it worked like a charm. (Previously, all my attempts at enabling the watchdog required a hard power-cycle by physically unplugging the pi4 after the fork-bomb test. During the test, the pings were functional (just like they are with the watchdog active) but ssh was always refused due to timeout. The fact that I can now SSH after a forkbomb seems to indicate this is working fine.)

      The post was edited 1 time, last by RFBomb: Additional Details ().

    • RFBomb wrote:

      I found several threads that declared that "this is not OMV's problem and therefore it should be brought up with the people that develop Raspian. OMV devs will not dedicate time to fix it." were essentially all the answers I found.
      Maybe you should submit a pull request with your change to this file - github.com/openmediavault/open…ploy/watchdog/default.sls
      omv 5.3.2 usul | 64 bit | 5.3 proxmox kernel | omvextrasorg 5.2.4
      omv-extras.org plugins source code and issue tracker - github

      Please read this before posting a question and this and this for docker questions.
      Please don't PM for support... Too many PMs!

    • I wasn't trying to be snarky, but was just summarizing the various responses I found. That said, Thank you for linking to that file. I kept seeing the 'auto-generated by OpenMediaVault' header, but was never able to find out where it actually keeps track of that and forces the changes out. (I also don't know the interval, so I'm not sure if or when my changes would be wiped out).

      I've been continuing to experiment with it though, and found that though my original reply using SystemD was mostly functional, it wasn't a full reboot. It got the OS back, but would not restart OMV, or Portainer.

      The following changes are what allowed me to actually work with watchdog as expected, restoring the full system after a hang.

      Disable the changes to SystemD I noted in my original post -- they cause systemd to use /dev/watchdog and prevent the watchdog daemon from accessing it. (Though I supposed you could set the daemon to use watchdog0 if you wanted to)

      Additional lines to /etc/watchdog.conf
      watchdog-timeout=14
      max-load-1 = 24

      Changes to /etc/default/watchdog
      watchdog_options="softboot" -- This calls for a full reboot of the system. Same as if power was cycled (without actually cycling power).

      watchdog_module="bcm2835_wdt" -- Changed from 'softdog' to look at the hardware device on the RPI.




      I will try to submit that change over on the github. Once again, thanks for linking me to it, I was scratching my head over if my changes would actually be erased or if it was just a warning.


      Edit: I submitted it as an 'Issue' (I don't know how to perform a 'pull request'. And when I tried it it was for a whole branch of changes and not the single one I'm recommending.)

      The post was edited 1 time, last by RFBomb: Additonal notes ().

    • If someone wants to use a different watchdog module simply do the following:

      Add OMV_WATCHDOG_WATCHDOGMODULE="foobar" to /etc/default/openmediavault and execute omv-salt deploy run watchdog.

      See github.com/openmediavault/open…/watchdog/default.sls#L25 for more environment variables.

      P.S.: This is the OMV way to customize and override defaults that are under control of OMV.
      Absolutely no support through PM!

      I must not fear.
      Fear is the mind-killer.
      Fear is the little-death that brings total obliteration.
      I will face my fear.
      I will permit it to pass over me and through me.
      And when it has gone past I will turn the inner eye to see its path.
      Where the fear has gone there will be nothing.
      Only I will remain.

      Litany against fear by Bene Gesserit
    • That is great, I wish I knew about that.
      But unfortunately, it still does not fully resolve the issue for the Raspberry Pi.

      The WatchDog Daemon runs a default timer of 60s if the conf file does not specify otherwise. The Pi's hardware has a max limit of 15s watchdog timer.

      I propose adding a new parameter to the base salt file to have a custom timer.

      {% set watchdog_timeout = salt['pillar.get']('default:OMV_WATCHDOG_WATCHDOGTIMER', '60') %}

      then under 'realtime' option, specify:
      watchdog-timeout= {{ watchdog_timeout }}

      Then, a user would be able to customize their watchdog timer for their hardware with ease, using the method you suggest:
      Add / Modify the lines in /etc/default/openmediavault
      OMV_WATCHDOG_WATCHDOGMODULE="bcm2835_wdt"
      OMV_WATCHDOG_WATCHDOGTIMER="14"
      then execute omv-salt deploy run watchdog

      Without the timer setting, the watchdog deamon will still fail to start on the pi.
      Unfortunately, after some additional testing I am noticing that the watchdog daemon isn't starting for some reason on my device unless i tell it to. My workaround is going to be running a cron-job to start it once OMV finishes booting.