OMV crashing/rebooting

  • Hi,

    I am trying to figure out why OMV is randomly rebooting. I'm not sure if it is rebooting or crashing and then restarting. It happened on August 12th and 15th at midnight for no apparent reason. It appears that the system is offline for 6 minutes each time, or at least there are no logs in that time.

    OMV version 6.4.8-1 (Shaitan) | 64 bit | Kernel: Linux 6.1.0-0.deb11.7-amd64

    System running with i7-4770K and 16 GB RAM.


    Syslog for first reboot looks like:

    And for the second reboot (shortened for brevity):


    Are there other logs which hold more info? I don't see anything being logged as to why the system is restarting.

    Thank you for any help you can provide.

  • crashtest

    Approved the thread.
    • Official Post

    You should check the logs in journalctl just before the boot.


    In the journal the boot process starts with something like

    Code
    -- Boot 27844393d3ad45d3b9c679a6c241968c --

    Here some pages that explain how to use journalctl

    Using journalctl - The Ultimate Guide To Logging
    Journalctl is a utility for querying and displaying logs from journald, systemd’s logging service. Since journald stores log data in a binary format instead of…
    www.loggly.com

    journalctl

  • Unfortunately I don't see any more information with journalctl logs. It just restarted at 23.00 again today, and this is what the journalctl log looks like:

    Is there any way of changing the verbosity of the logs?

  • I'm not sure why its not there, or what that means, but I used the same method to print the journalctl logs and the log from the previous incident did have the boot line, this is what it looked like:

    It is from the corresponding syslog2 incident posted above.


  • there's a SMART message in the 2nd syslog:


    Aug 15 23:40:19 node smartd[1483]: Device: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_500GB_S598NJ0NB02829P [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71


    71 C is too hot, the 860 EVO is rated only for 0 to 70 C operating temp. It might go into a self protect mode to cool down and prevent permanent damage, potentially causing the random crashes.


    1. what is causing the high temperature, is it heavy r/w load, or bad cooling, or both?
    1. what is its SMART health status, and the log?
  • Aug 15 23:40:19 node smartd[1483]: Device: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_500GB_S598NJ0NB02829P [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71

    Am I missing something obvious here? I just checked the logs and this is what they look like:

    Code
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-ST10000NE0008-2JM101_ZHZ386NL [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-ST10000NE0008-2JM101_ZHZ386NL [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 38
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-ST8000DM004-2CX188_ZCT16A1G [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 68 to 69
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-ST8000DM004-2CX188_ZCT16A1G [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 68 to 69
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-WDC_WD120EDAZ-11F3RA0_5PJBZ4HF [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 180 to 175
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-WDC_WD80EMAZ-00WJTA0_7HJPDM1F [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 191 to 185
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S599NE0MA38750N [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 74
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-Samsung_SSD_860_EVO_500GB_S598NJ0NB02829P [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 74 to 75
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-ADATA_SX900_7D1920007493 [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29

    Am I right to assume that this is saying that one drive WDC_WD80EMAZ... is reading 191 celcius and the same SSD that you mentioned EVO_500GB is reading 75 celcius?

    Because I checked the page Drives > Smart > Devices and those two drives are saying they are 36 and 25 degrees respectively. I also checked the temperature manually through the terminal sudo smartctl -a /dev/sdi | grep "Temperature" and it also says 25 (190 Airflow_Temperature_Cel 0x0032 075 055 000 Old_age Always - 25).

    (Sorry if these questions seem ignorant, I am no expert at these things.)


    Quote
    1. what is causing the high temperature, is it heavy r/w load, or bad cooling, or both?

    I'm not sure, I have an 804 node Case and the SSD in question is mounted in the front pannel with a 120 fan about an inch from it pulling air.

    I also don't see excessive load in the performance statistics graph for that drive.

    Quote
    1. what is its SMART health status, and the log?

    Here is the smart data, from what I see it is saying the drive is 25 celcius.

    And some more stats which I believe show the drive has never gone above 45 celcius:

    Why the discrepancy between the smart logs and syslog?


    And thank you for your help.

  • agree, 191 C temperature makes no sense, it seems the temperature from syslog does not correlate with smartctl. /dev/sdi looks healthy to me, so the temperature is likely not the root cause.


    However, syslog shows 2 issues for a different drive, which seem to be more serious:

    Code
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-ST8000DM004-2CX188_ZCT16A1G [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 68 to 69
    Aug 26 19:57:13 node smartd[1466]: Device: /dev/disk/by-id/ata-ST8000DM004-2CX188_ZCT16A1G [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 68 to 69


    What is the SMART assessment for that specific drive?


    Also, suggest to double check that in the OMV GUI in System => Power management => Scheduled Tasks, there is no daily reboot setup.

  • What is the SMART assessment for that specific drive?


    Also, suggest to double check that in the OMV GUI in System => Power management => Scheduled Tasks, there is no daily reboot setup.

    Checked, there are no scheduled tasks setup.

    The smart assessment says that drive quality is good, but it seems to have a large number for 'Hardware_ECC_Recovered'.

    Here is the full assessment:

    SMARTExtended.txt

    Last week that value was lower: "195 Hardware_ECC_Recovered  -O-RC-  80  64  0  -  107826065"

    Is this serious?

  • my understanding of 195 Hardware_ECC_Recovered is that bigger and increasing values are better.


    1 Raw_Read_Error_Rate seems to be Seagate specific with no real meaning for the user.


    And health assessment is PASSED in the SMART report. Also this specific drive can be considered OK and not related to the random reboots.


    Unfortunately nothing suspect is left that should be analyzed as cause for the random reboots ...

  • Unfortunately I have still not figured this issue out yet.

    This is what my uptime looks like for the past month.

    It just rebooted again and here are some images from the diagnostics pages:

    It looks like the Wait-IO spikes for about 10 minutes before the reboot occurs, and I think it is the sdi disk that is spiking, but not sure what exactly this means. Attached below is the syslog, with the first part shown below. The disk sdi is the SSD which stores OMV.

    According to the graphs, it looks like machine was off from 19:49 - 18:06. As you can see below, there was a segfault, not sure if that is related.

    Here is the full log: syslog14.09.2023.txt

    I would appreciate any help you could provide.

  • The mounting is suspicious and should not be necessary. Did you set up megaraid cli so you can monitor the card? There are instructions around for installing the drivers on debian and then you can use storcli or connect from megaraid UI running on another desktop. From there you can watch the temps and you can check all the settings. Maybe the drive is disconnecting for some reason you will see in the megaraid log. Cabling even can be finicky.


    If not cabling or disk issue, the controller could be overheating. These hw raid cards are notorious for this. LSI/Avago says they can run super hot, but it really seems like a bad idea long term to me. I have a sas lsi raid card, 9361-8i, and what I did was 1) reseat the heatsink on the raid controller chip with new thermal paste, 2) used screws to tighten it down instead of the screws with spring clips that were on there. and 3) picked up a small noctua fan. Find some small screws the right size and you can actually attach the fan to the metal fins on the heat sink. Just screw between the fins. With that my controller is runing at ~50C. I didn't just make this up, it's what many people do with these cards. Even 50C sounds hot but that damn controller chip reported 95C before the changes.


    steve

  • Hendrik1

    Changed the title of the thread from “OMV rebooting at midnight” to “OMV crashing/rebooting”.
  • The mounting is suspicious and should not be necessary. Did you set up megaraid cli so you can monitor the card? There are instructions around for installing the drivers on debian and then you can use storcli or connect from megaraid UI running on another desktop. From there you can watch the temps and you can check all the settings. Maybe the drive is disconnecting for some reason you will see in the megaraid log. Cabling even can be finicky.

    Thank you for the suggestions, I will take a look at these monitoring solutions.

    If not cabling or disk issue, the controller could be overheating. These hw raid cards are notorious for this. LSI/Avago says they can run super hot, but it really seems like a bad idea long term to me. I have a sas lsi raid card, 9361-8i, and what I did was 1) reseat the heatsink on the raid controller chip with new thermal paste, 2) used screws to tighten it down instead of the screws with spring clips that were on there. and 3) picked up a small noctua fan. Find some small screws the right size and you can actually attach the fan to the metal fins on the heat sink. Just screw between the fins. With that my controller is runing at ~50C. I didn't just make this up, it's what many people do with these cards. Even 50C sounds hot but that damn controller chip reported 95C before the changes.

    The SSD is connected directly to the motherboard, but I do have an lsi card for some of the spinning disks. It is one that doesn't have any onboard temperature sensors, but it always feels very hot. I will try to do something similar as you when I have more time.

    The crashes only started happening frequently during the last two months, so perhaps it is less likely that the issue is stemming from cabling or temperatures as those haven't really changed much over the last year.

  • I had another crash last night, and this one lasted a lot longer, at least according to the performance statistics there was no data logged for about 1 hour and 45 minutes (around 1:30-3:15AM).

    From the log it looks like nginx kept crashing for about 30 minutes before the entire os crashed.

    And then for most of the 1.45 hours it looped with a message about 'oom-kill' killing a process and printing the call trace. This long loop ends when instead of printing out the entire stack trace, it changes from:

    Code
    Sep 18 02:47:51 node kernel: [283338.673771] RIP: 0033:0x45466e
    Sep 18 02:47:51 node kernel: [283338.674297] Code: 9c 00 00 00 83 3d 45 92 72 02 00 0f 8f 85 00 00 00 48 89 44 24 20 48 8b 88 38 01 00 00 48 89 c8 e8 37 3d 00 00 48 85 c0 74 10 <80> 78 24 06 75 0a 48 8b 6c 24 10 48 83 c4 18 c3 48 8b 44 24 20 48
    Sep 18 02:47:51 node kernel: [283338.674884] RSP: 002b:00007f049a7fbea8 EFLAGS: 00010206

    to an error about unable to access bytes and then a microcode updated:

    Code
    Sep 18 02:52:52 node kernel: [283629.875750] RIP: 0033:0x440ddb
    Sep 18 02:52:52 node kernel: [283629.876268] Code: Unable to access opcode bytes at 0x440db1.
    Sep 18 02:52:52 node kernel: [283629.876784] RSP: 002b:000000c003471d48 EFLAGS: 00010202
    Sep 18 03:13:13 node systemd-modules-load[360]: Inserted module 'macvlan'
    Sep 18 03:13:13 node kernel: [    0.000000] microcode: microcode updated early to revision 0x28, date = 2019-11-12
    Sep 18 03:13:13 node blkmapd[379]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
    Sep 18 03:13:13 node kernel: [    0.000000] Linux version 6.1.0-0.deb11.7-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-2~bpo11+1 (2023-04-23)
    Sep 18 03:13:13 node kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.1.0-0.deb11.7-amd64 root=UUID=e3f751f8-a0bf-4a6d-b6d3-65283a6aaf28 ro quiet

    I'll copy some of the log here that could be relevant, and attach the entire log et the end.

  • How is your swap configured? Is it possible the system is starting to swap and swap is either not available or not working? When it comes to memory though its hard to tell who is a victiim and who is the guilty party.


    The oom killer has a scoring algorithm to pick a victim and it's picking the process running QtWebEngine. Try shutting down whatever docker container is using the Qt libratry and see if the system gets stable.


    Also maybe change the mount -a to mount the exact disk which was problematic. The mount -a is causing docker to remount some things every time you run the command and that might be stressing a little used code path leading to a leak somewhere.


    steve

  • I haven't made any changes to the swap, it looks like it is 1GB and here is the output from cat /proc/meminfo:

    It looks like this could be related to some kind of memory leak because according to the diagnostics graph, the memory used is slowly increasing until the server crashes.

    I've removed the mount -a from the cronjob but that didn't stop the memory from increasing.

    I'm not sure exactly how to see which docker contain was the one that was killed by the oom-killer or if that one even is the culprit, I only have basic knowledge in this respect.

    Would you know how to find the container that was running this task?

    Code
    task_memcg=/system.slice/docker-6719f68064d6f9a44987c5431b136fcad3fe58beb4321c90be7ec8e29beb44bb.scope,task=QtWebEngineProc,pid=9696,uid=1000

    Like would I have to open each container and search all processes for a qtWebEngine process?

    Thanks for the assistance!

  • That slice is just sort of the big bucket resources are coming out of. You could create a smaller slice for docker and limit it. Then you would just have docker crashing, but it would be better to figure out what is leaking. Just run "docker stats" and watch it. Hopefully it will be obvious.


    steve

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!