Server Crashing - Need help debugging

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • Server Crashing - Need help debugging

      So as of late my server (a Dell T1700) has started crashing approximately once a day so hard that it brings my entire network down until I reboot my server. When I look at my screen I see a variety of error messages but one that I see frequently and stands out to me is

      watchdog BUG: soft lockup - CPU#3 stuck for 22s!
      rcu INFO: rcu_sched detected stall on CPU
      (see attachment)

      My friend google wasn't much help. It sounds like there's a wide variety of things that could cause problems like this. I had a few ideas and was wondering if people could help me narrow it down or suggest something else:
      Failing HDD
      - Not likely as I recently replaced it and have scanned it several time with no issue.
      Failing Ram
      - Ran memtest86+ for over 24 hours with no issues
      Failing MB/CPU
      - Not sure but not likely and if it is the I'm more or less SOL.
      Possible Failing PSU
      - Some people online suggested this fixed the problem for them, others said it didn't make any difference. It probably would be good to get rid of the OEM power supply (which probably isn't great quality) but i had spending money when I don't have to.
      Corrupted Kernel files
      - Possible but the only solution would be a complete re install which would take a really long time (don't have a backup that I know is clean)

      Anything else I haven't thought of?

      Thanks!
      Images
      • IMG_20190610_163632.jpg

        273.17 kB, 703×937, viewed 20 times
    • As you said that could be anything of the above and google searches are not definitive, nothing new there, but it could be related to load.
      But one thing does stand out in that screen shot (albeit hard to read) Plex Script, whatever that is, it could suggest that this cannot run due to load on the system. First try, is to disable/stop Plex and reboot -> does the error/problem stop, if it does then there is something wrong with the Plex install; are you running the plugin? is it installed via the cli? If it's the plugin then that is deprecated no further support and most users have Plex deployed via Docker.
      Another option is unplug all your drives and reboot is the error still there, if not that could be hardware or software.
      If this is hardware related then it's going to difficult to find.

      The last resort is a clean install.
      Raid is not a backup! Would you go skydiving without a parachute?
    • geaves wrote:

      As you said that could be anything of the above and google searches are not definitive, nothing new there, but it could be related to load.
      But one thing does stand out in that screen shot (albeit hard to read) Plex Script, whatever that is, it could suggest that this cannot run due to load on the system. First try, is to disable/stop Plex and reboot -> does the error/problem stop, if it does then there is something wrong with the Plex install; are you running the plugin? is it installed via the cli? If it's the plugin then that is deprecated no further support and most users have Plex deployed via Docker.
      Another option is unplug all your drives and reboot is the error still there, if not that could be hardware or software.
      If this is hardware related then it's going to difficult to find.

      The last resort is a clean install.
      yes that particular instance was Plex but it's usually just a random process that fails. It seems different every time. And yes I run just about everything in Docker.
      It could be load related, but it doesn't seem to always be as it's failed when not under load too.

      It's really tough as it doesn't happen right away, it takes a while to reproduce. So I'll change wondering and I might not know for 24hrs+ until it falls again.