Sporadically Shutdown.

  • hello,

    since some weeks my installation shows sporadically shutdown (every 1 to 3 days).

    Of course it was running already since several years without problems, being upgraded up to V5

    While leaving a terminal window open, it shows kernel warnings about "critical temperature".

    But in a second open window, running "htop" it seems obvious, that cpu temperature is normal

    For first reaction, having read other threads, i have removed the flash-to-ram-plugin

    (odroid xu4/mSDHC). Does anybody have experience about this behavour?


    brds, mopedfahrer

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

  • Hello,


    just for an update.

    I had removed folder2ram and reinstalled.

    After that the XU4 runs since 3 days without shutdown...


    waiting still some more time and then closing the thread

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

  • Ok same evening it happend again...

    And same situation:

    syslogd-message - "kernel:[277332.279851] thermal thermal_zone1: critical temperature reached (115 C), shutting down"

    while "armbian-setup->monitoring" shows same time:


    Time big.LITTLE load %cpu %sys %usr %nice %io %irq CPU C.St.

    22:55:27: 600/ 600MHz 0.15 0% 0% 0% 0% 0% 0% 56.0°C 0/3


    Anybody who has an idea? Be welcome...

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

  • Hello macom,

    the hints with Hardkernel/Odroid - forum was probably the most valuable one.

    It seems to be a HW-failure of sensors in the SoC.

    In one thread of the forum I found the following :


    https://forum.odroid.com/viewtopic.php?f=99&t=33676


    where a workaroung is mentioned


    With thanks to "rooted"

    ________________________________

    Re: ODROID Xu4 Suddenly Heats up and Shuts Down

    Post by rooted » Mon Feb 04, 2019 4:54 am

    You can do this:

    Code
    echo 60000 | sudo tee /sys/devices/virtual/thermal/thermal_zone0/emul_temp

    Thermal will not work correctly but as it isn't already at least it will let your device function.

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

  • There is more then one thermal sensor. Usually every core has one and then there are several more. Even PCIE units and harddrives can have temperature sensors. So whatever is shown in the monitoring window cannot be correct if it shows one temperature. If you do


    find /sys -name "temp*_input"


    you'll probably find more temperature sensors. To demonstrate: my laptop has 13 temperature sensors alone ... for CPU cores, graphic chips, the harddrives, even the wifi module has a sensor.


    You should find out which sensor is creating the data leading to the kernel to shut down instead of applying commands you found in a forum. Please check the output of the sensors command. If that's not available install the lm-sensors package and run sudo sensors-detect to load all necessary modules. Then you can find out which sensor goes crazy. Sometimes it is necessary to tweak the output received from the sensor or adjust the offset as shown here: https://wiki.archlinux.org/ind…sting_temperature_offsets

  • Hello dleidert , thank you for your comments.

    Problem is, that I am not a Linux guy. If I have a problem, I am looking around.

    Seems, that you don't know the odroid-forum?

    It is the official Hardkernel forum, when you look at the hardkernel site .


    The sensors, which are in trouble can be seen in the picture I appended at top. Zone 0 and 1.

    Temperature 115 degree is alarmed even if load is near 0, as in the picture...


    So I can now follow commands, which somebody (you) has published in a forum, to dig in to the problem.

    Neverless I will do that, to understand a bit more of Linux.


    But as a workaround I have also ordered a new XU4

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

  • Maybe I need to explain a bit more. Your kernel reports that one or more critical temperature thresholds are reached and that it is going to shut down your system as a protective measurement. You compare these reports with a temperature that is labelled "CPU temp". What I'm telling you is that you cannot compare these values. There is not one temperature sensor for all cores. Your device has 5 different thermal zones. So the value shown here can be from any sensor in the system. It could be completely unrelated or it could be the average value of all core temperatures. Even after reading the forum entry I cannot tell you what exactly it is. So it cannot be deducted that the kernel reports wrong temperature values.


    Now by applying the "workaround" all you do is hiding the real problem by hiding the symptom. Even the person suggesting the command says "Thermal will not work correctly." What your command does is to overwrite the real temperature with the value you've chosen. And hiding the symptom can lead to your system overheating and getting damaged. Maybe the cooling is not working correctly because your device is faulty, the heatsink is not placed properly, the chosen casing is suboptimal, the throttling is not applied, or because you need to tweak the fan settings. Maybe a process goes wild in an endless loop and causes the cores to heat up? There are plenty of possible explanations.


    So what in my opinion would be the right thing to do is to check the values of the thermal sensors and see which and when they go up. What do the other cores do? Are there any processes stressing the cores? Usually I recommend to install the lm-sensors package if not already installed and use it to get more information about the thermal output. But after reading some articles it seems it does not support the devices in this piece of hardware. But you can get the information from the /sys file system. I found plenty of articles about it:


    https://dietpi.com/phpbb/viewtopic.php?t=1463

    https://www.winstonyin.com/en/…-monitoring-with-netdata/

    https://www.hardkernel.com/blo…l-odroid-xu4-cooling-fan/

    https://wiki.odroid.com/odroid…/manually_control_the_fan

  • So a workaround preventing damage would be to "control the fan speed manually and set it to full speed".

    Copied from the blog entry https://www.hardkernel.com/blo…l-odroid-xu4-cooling-fan/

    Code
    # Set fan to manual mode
    $ echo 0 | sudo tee /sys/devices/platform/pwm-fan/hwmon/hwmon0/automatic
    Code
    # Set speed to 100%
    $ echo 255 | sudo tee /sys/devices/platform/pwm-fan/hwmon/hwmon0/pwm1

    omv 5.6.20-1 (usul) on RPi4/4GB with Kernel 5.10.63 and WittyPi 3 V2 RTC HAT

    2x 6TB HDD formatted with ext4 in Icy Box IB-RD3662-C31 / hardware supported RAID1

    For Read/Write performance of SMB shares hosted on this hardware see forum here

  • Hi,

    dleidert, mi-hol, it is quite interesting to read your comments.


    Let me still refer to my picture at top of the thread. Unfortunately the armbian monitoring is not included, but the "htop"-output is there. It shows (in the background of the error-messages), that the big-cores are all at zero, the little-cores are at zero, except one with 2.6% load. Also there is a value

    "load average" with 0.33 0.17 0.15. Since the xu4 is only used for NAS, and shutdown happend in the morning, there was no reason to overheat.

    Mechanically there have been no changes and an external Desktop-Fan is blowing additionally to the SoC-mounted fan.


    What I also assume is, that I know from the shutdown messages, that thermal zones 0 and 1 are describing the faulty sensors, is't it ? If I understand correct, zone 0 and 1 describe the big-little cores...


    (see appendix from kernel coders)[PATCH] ARM_ dts_ exynos_ Exynos5422 Odroid-XU_ incomplete thermal-zones definition.pdf

    (but let's say, I do not really understand the kernel path discussion)


    Goal for me is, to have a working controller for my NAS, not shutting down every 1 to 2 days.

    The workaround to just disable the thermal zone sensors


    echo disabled > /sys/devices/virtual/thermal/thermal_zone1/mode


    and use the last hints of mi-hol, to manually adjust the internal fan (and using additionally external fan) seems to be a good solution.


    Mechanically it could be even better to remove the internal fan and using passive cooling block with external fan (like XU4q); I do not trust that little fan in long term usage...


    dleidert : as you said yourself, your hints about the sensor reading where not really usable, but I have tried...


    Thank you for all your time so far, Pollin sent a message, that my workaround is on the road...

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

  • Let me still refer to my picture at top of the thread. Unfortunately the armbian monitoring is not included, but the "htop"-output is there. It shows (in the background of the error-messages), that the big-cores are all at zero, the little-cores are at zero, except one with 2.6% load. Also there is a value

    "load average" with 0.33 0.17 0.15. Since the xu4 is only used for NAS, and shutdown happend in the morning, there was no reason to overheat.

    The picture actually doesn't say much. It is the state of the system when sshd was shut down. The processes leading to the cores overheat may have been killed a second before that and then they won't appear in the picture. The system doesn't shut down all at once. It shuts down gradually. So you would have to monitor your system (services, processes, core usage, ec.) while trying to reproduce the issue. This could be done by watching the thermal zones and creating snapshots of the processes running when the cores start to overheat (just one way of many).


    Here is an interesting script to stress the CPUs: (seems it needs a script called burnCortexA too)

  • Thanks for the script.

    burnCortexA9 is part of package "cpuburn" ...

    For interest, I shall check that, as soon as my new XU4 is substituting my old XU4 in the NAS.

    For that, the second part of the usbimager will be tested by restoring the dd-image

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

  • Hi,

    as I mentioned already, Linux is not really my world...

    But of course OMV is worth to work with Linux a bit more in details.

    Nevertheless, it is not easy to get info about that "burnCortexA9". It is part of "cpuburn", but "cpuburn" does not belong to the normal packages. So where can i find it?

    Next problem, I found cpuburn at github as program code for several Arm-Cores.

    There is cpuburn-A9 and cpuburn-A53 available. Of course it has to be compiled first, next problem.


    But I got it. Executing cpuburn-A9 starts all 4 A9-cores, while cpuburn-A53 starts all 8 cores (4*A9, 4*A53) of the Exynos5422 .


    The script mentioned by dleidert has then to be adapted to 8 cores and (so far) unknown burnCortexA9, which I did not find


    Is it possible to get that easier?

    OMV 5.x , Odroid XU4 , external USB3 - Xystec-4-bay ,

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!