System freezes/crashes and reboots without obvious reason

  • I had no problems running OMV5 but nonetheless a few weeks ago I did a fresh install of OMV 7 and it's working okay-ish:

    Sometimes, the system freezes or crashed and then reboots without an obvious reason. I'd rule out hardware problems since nothing like that happened with OMV5.

    I've brought this up on the Discord sever but that only lead to changing the Jellyfin container because of a “Out of memory: Killes process 8052 (jellyfin)” message in the logs.


    After installing Immich and letting it scan GBs of image it also got unresponsive and rebooted a few times. I've stopped the container for now.

    I've also suspected the MiniDLNA service since it has to scan a few gigabytes (which was still running when the most recent reboot happened).


    I could not find anything particular in the syslog. Looking at the Performance Graphs, it seems to happen with a high Wait-IO. Four of my six hard-disks are connected via a PCIe SATA card, two and the system SSD are connected directly to motherboard. Though, I don't see how this should be such a bootleneck that the system just gives up …

    I think, it also happened outside of high Wait-IO but this is not always clear since the graph is very tight for events that are too long ago.


    First aid's items 7 and 8 return “The configuration status file is valid.” and “All RRD database files are valid.”


    The most recent reboot happened tonight, there are about 20 minutes of syslog missing. Before the system stops recording log entries, I'm seeing a lot of messages like

    Code
    2024-10-11T04:00:30+0200 OMV collectd[907]: plugin_read_thread: read-function of the `rrdcached' plugin took 16.404 seconds, which is above its read interval (10.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
    2024-10-11T04:00:38+0200 OMV collectd[907]: plugin_read_thread: read-function of the `df' plugin took 25.177 seconds, which is above its read interval (10.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.

    the rrdcached entries dominate it though.


    I've attaches the chatlog starting at 12pm. The crash/reboot happens between 4am and 4:23am (line 488/489).


    Running last reboot shows:

    I remember turning on SMART later which might make me think this problem is related: hard reboots with Samsung 850 pro SSD and SMART monitoring since OMV 6.9.6

    But it's an Intenso SSD. I'm inclined to disable SMART monitoring and, well, wait for a week until it happens again or not …


    Motherboard: Gigabyte B450 Aorus M.

    CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics

    AMD Ryzen 3 2200G with Radeon Vega Graphics

  • Since it was at night, it wasn't user-caused. I have two Rsync tasks: one at 4am, one at 5am. Though, both didn't have to copy anything since nothing changed in the sources. In the finest image one can see small spikes every five minutes, I believe these to be caused by the Arrs. Jellyfin also have some scheduled tasks but looking at the interview, they've all run again after the mentioned reboot, so it isn't anything that is causing it everytime.


    I'll attach the images I posted on Discord here as crash123. Note that the first memory usage image has a different scale than the CPU usage one above. It shows that the memory grew until the crash. The second column shows that it can happen without this growth.


    I just now noticed that Prowlarr says that a indexer proxy is not available. I let it test it twice which leads to the images titles freeze1. I couldn't repeat the same a third time because the test is successfull. I believe I could see kswapd0 having a high load for the unsuccessful tests (which didn't terminate on their own). However, for the crash at 4am from my first post, flaresolverr doesn't show anything in the log.


    Interestingly, before and during yesterday's crash, I can see that Prowlarr logged something:

    It appears that the memory usage graphs stop at 6GB which surprises me a bit because I've installed 2x 4GB. The System Information tab informs me abput 6.72G(i)B. A test free -b -t returns

    Code
                  gesamt       benutzt     frei      gemns.  Puffer/Cache verfügbar
    Speicher: 7211257856  5782593536   190672896   262074368  1810407424  1428664320
    Swap:     1023406080  1023406080           0
    Gesamt:   8234663936  6805999616   190672896

    The 7211257856 bytes correspond to 6.72 GiB. I don't remember if it ever showed 8GB. Is the Radeo Vega Graphics card reserving some for it? Do I have a memory problem? Should I run memtest?


    I've now checked the RAM bars, put them back in and it reports the same but booting up I could see that it initializes a RAMdisk, is that the 1GB Swap it shows?

    Booting up I could also see the following messages:

    Code
    [    0.151386] ACPI BIOS Error (bug): Failure creating named object [_SB.SMICI,
     AE_ALREADY_EXISTS (20220331/dswload2-326)
    [    0.151394] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (202203
    31/psobject-220)
    [    0.151399] ACPI BIOS Error (bug): Failure creating named object [_SB.SMIBI,
     AE_ALREADY_EXISTS (20220331/dsfield-637)
    • Official Post

    Did you port the jellyfin database to OMV7? That result could be a jellyfin scan of your entire library to create a new database. Depending on the size of your library and the speed of your hardware, this task could take up to several days.

    • Official Post

    I'd rule out hardware problems since nothing like that happened with OMV5.

    Before ruling out hardware, we need to look at a few realities.


    - Hardware can, and does, develop issues over time. (As it is with most hardware, vehicles included, it works until it doesn't.)

    - Just because a failure occurred after upgrading doesn't mean, definitively, that the upgrade is responsible. It could be a coincidence. (However, I'll admit that it is an indicator.)

    - As an example, power supplies are known for developing issues over time.

    - Nearly any hardware component of a PC can trigger a random BIOS reboot, to include hard drives and SSD's.

    - Have you looked at CPU temp's? (Obviously,a processor that's overworked and getting hot can trigger a reboot.) Recently, on one of my clients, I found that I needed to refresh the thermal paste between the heat sink and the CPU. (Why? The CPU was getting hot and I was observing odd behavior.) The old paste was thoroughly dried out.

    The 7211257856 bytes correspond to 6.72 GiB. I don't remember if it ever showed 8GB. Is the Radeo Vega Graphics card reserving some for it? Do I have a memory problem? Should I run memtest?

    Yes because, if for no other reason, it's easy to do. (It appears that you've already reseated the modules. Along with being easy and quick, that's a good thing to do.) Bad mem modules are known to cause bizarre behavior.

    There are stress testing distro's; you might look at like -> Stress Linux to see if it can trigger a reboot.


    Four of my six hard-disks are connected via a PCIe SATA card,

    This is a wildcard. Who made the "PCIe SATA card"? Is this a generic Aliexpress item like -> this?. When was it added? Have you tested with this card unplugged?

    I remember turning on SMART later which might make me think this problem is related: hard reboots with Samsung 850 pro SSD and SMART monitoring since OMV 6.9.6

    But it's an Intenso SSD. I'm inclined to disable SMART monitoring and, well, wait for a week until it happens again or not …

    The above ref is a very specific issue, that the OP did not report back on. In any case, your idea of disabling SMART and waiting a week is not a bad one. (Again, free and easy.) On the other hand, generally speaking, SMART is your friend. I would closely examine all SMART stat's on all drives, to get an idea of their general health.

    Since "Intenso" is an off brand SSD (most likely rebranded from another SSD OEM), you might consider testing it with one of the many available -> drive tools.

    _______________________________________________________________________________


    Looking at the software end:


    - You said you did a fresh rebuild. Did you run a hash on the ISO after downloading it? (To check for corruption during the download.)

    - Have you disabled ALL Dockers and other server add-on's? (While Debian and it's packages are thoroughly tested, Dockers and other add-on's usually do not get the same level of testing.)
    - If you have a backup of the old OMV5 build, it would be worth trying again for a test.
    - Because it's easy and free, doing another OMV7 build (after checking the hash after the download) is in order. (In a second build of OMV7, you might consider building it on a USB thumbdrive, replacing the Intenso SSD, so that all of your hard drives and / or SSD's can be removed one at a time.)
    - You could try another Debian based Live distro like -> Knoppix to see how it reacts to your hardware.

    _______________________________________________________________________________

    When we're talking about random reboots, here's the bottom line:
    Other then memory testing, a CPU burn in test, and maybe combing through dmesg for under / over voltage situations (and other hardware issues) that may exist and are detected, you're looking at a process of elimination. That could become painful so it makes sense to do the easiest things first.

  • Did you port the jellyfin database to OMV7? That result could be a jellyfin scan of your entire library to create a new database. Depending on the size of your library and the speed of your hardware, this task could take up to several days.

    No, it was a fresh install. The library has been scanned by now. I've setup the system on August 13.


    Before ruling out hardware, we need to look at a few realities.

    Yeah, I was hoping it is something else before I go through the whole spiel of meticulously disabling or removing parts of the build and waiting a week or more until maybe something happens again. Maybe it is something obvious to the experts on this forum or I could run quick tests to check various configurations …

    This is a wildcard. Who made the "PCIe SATA card"? Is this a generic Aliexpress item like -> this?. When was it added? Have you tested with this card unplugged?

    The card is a German Inline 76617F. The model seems to be discontinued as I can't find in on their website but the manual is available (includes a photo of the card even): https://www.inline-info.com/do…te/01_Handbuch/76617G.pdf
    It's a PCIe 2.0 card. A user on the OMV Discord server worried that this is bottlenecking my system as it only supports a througput of 0.5GB/s. But I don't see why this should freeze or crash the system.

    No, I haven't tested it without the card yet since this will make the system barely perform the task it should be.

    Stress Linux to see if it can trigger a reboot.

    I will consider this.


    I've started the Immich job again and jellyfin which was used to stream music was killed nine times, each time because of "Out of memory". I don't know whether this indicates a RAM problem or a RAM management problem. (Who decides Jellyfin gets killed so that Immich can get more? Just give Immich less!)


    I currently am running Memtest which has now passed five tests – four in the attached image that shows two 4GB sticks with 6.92GB, I'm guessing the 1GB missing is still the GPU reserved one. The sticks are installed in slots 1 and 2 as per the manual for dual channel support. I don't know why Memtest labels these slot 2 and 3.


    Looking at the software end:


    - You said you did a fresh rebuild. Did you run a hash on the ISO after downloading it? (To check for corruption during the download.)

    - Have you disabled ALL Dockers and other server add-on's? (While Debian and it's packages are thoroughly tested, Dockers and other add-on's usually do not get the same level of testing.)
    - If you have a backup of the old OMV5 build, it would be worth trying again for a test.
    - Because it's easy and free, doing another OMV7 build (after checking the hash after the download) is in order. (In a second build of OMV7, you might consider building it on a USB thumbdrive, replacing the Intenso SSD, so that all of your hard drives and / or SSD's can be removed one at a time.)
    - You could try another Debian based Live distro like -> Knoppix to see how it reacts to your hardware.

    _______________________________________________________________________________

    When we're talking about random reboots, here's the bottom line:
    Other then memory testing, a CPU burn in test, and maybe combing through dmesg for under / over voltage situations (and other hardware issues) that may exist and are detected, you're looking at a process of elimination. That could become painful so it makes sense to do the easiest things first.

    – I don't remember.

    – No, of course not. I want to use the system. I've already disabled/paused Immich and Minidlna as these seemed to be responsible that is just guessing.

    – I ran a backup process in OMV5 but I don't think I want to go through the process of restoring and configuring it again.

    – I'm inclined to buy a M.2 to install a fresh (and hash-checked) OMV7 build.


    I was worried about the bottom line because process of eliminiation (i.e. stripping the system of its element) is as far as I got alone.

    Thank you for your answer.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!