nvme disks failures

  • Dear All,

    after several days using OMV7 my Apacer nvme module(s) started to "disappear" in GUI (Storage - Disks), my raid1 md0 array shows "clean, degraded" in GUI (Storage - Multiple Device).

    First I thought that this was a nvmeHW storage failure, but after I had rebooted the Apacer nvme module(s) reappeared, raid array showed "clean" and all data were OK again. I did not even have to rebuild the md0 raid1 array as it showed "clean" straight away!


    But this takes another hour or two, sometimes 5 minutes, sometimes a day and the situation with missing Apacer module(s), degraded raid1 array and unreachable data repeats ......until next reboot :( Sometimes only one Apacer nvme module disappears, sometimes both.


    My modules are not overheating, all 4 blue slot diodes always show as active.

    I don't generate any load either. On Apacer modules (md0 RAID) I just run docker, docker data mapped and

    qbittorrent in Docker Compose.


    What should I do as part of root cause analysis? Could anybody give me hints please?


    Details follow:

    I have OMV7 on RPi5 using Suptronics - X1011 M.2 NVMe 4 SSD shield and I have the following nvme modules (status BEFORE the Apacer modules disappear):


    Status AFTER Apacer module(s) start to disappear:

    mdadm -D /dev/md0:

    Both Apacer modules missing until next restart:

    DMESG

    https://pastebin.com/SgpcfnHM (problems seem to start at [ 738.038272])

  • crashtest

    Approved the thread.
  • macom

    Approved the thread.
    • Official Post

    That problem makes me think of a power failure, maybe the power supply is not powerful enough. Units may become unavailable at some point.

  • That problem makes me think of a power failure, maybe the power supply is not powerful enough. Units may become unavailable at some point.

    Hello chente, thank you for your reply.

    I am using original RPI5 USB-C 27W power supply.

    According to the shield manufacturer's web page this setup is compatible and the alternative is 5Vdc / 5A (25W) source via DC power jack.


    As 27W is more than 25W I thought I was safe here. But it is worth trying (the DC power jack alternative thru X1011 NVMe shield).

    • Official Post

    nvme sticks can use up to 10W writing. So, Raid is about the worst thing you can do when trying to keep power consumption under 27W. If you have four sticks plus the rpi5, you could easily be trying to pull 48W. Hence why that hat is rated to deliver 10A to the nvme drives.

    omv 7.7.9-1 sandworm | 64 bit | 6.11 proxmox kernel

    plugins :: omvextrasorg 7.0.2 | kvm 7.1.7 | compose 7.6.7 | cterm 7.8.5 | cputemp 7.0.2 | mergerfs 7.0.5 | scripts 7.2


    omv-extras.org plugins source code and issue tracker - github - changelogs


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • nvme sticks can use up to 10W writing. So, Raid is about the worst thing you can do when trying to keep power consumption under 27W. If you have four sticks plus the rpi5, you could easily be trying to pull 48W. Hence why that hat is rated to deliver 10A to the nvme drives.

    I see, well I had no idea that X1011 NVMe shield with 4 nvmes can actually draw that much power from a RPi5!


    In that case I will definitely discard RPi5 USB-C original power supply for now and replace it with the custom 5V/10A DC source plugging it directly into the shield. According to the manufacturer that should also power the RPi5 as it is forbidden to use both 5V DC on the shield and RPi5 USB-C sources at the same time.


    I will report to this group the outcome. Thanks.

  • It seems like power supply issues might be at the core of your NVMe disks disappearing. Since you're already considering switching to a 5V/10A DC source for the X1011 NVMe shield, that could help ensure a stable power supply for all devices. Keep us posted on how that works out!

    • Official Post

    t seems like power supply issues might be at the core of your NVMe disks disappearing. Since you're already considering switching to a 5V/10A DC source for the X1011 NVMe shield, that could help ensure a stable power supply for all devices. Keep us posted on how that works out!

    You sound like a bot.

    omv 7.7.9-1 sandworm | 64 bit | 6.11 proxmox kernel

    plugins :: omvextrasorg 7.0.2 | kvm 7.1.7 | compose 7.6.7 | cterm 7.8.5 | cputemp 7.0.2 | mergerfs 7.0.5 | scripts 7.2


    omv-extras.org plugins source code and issue tracker - github - changelogs


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • So allow me to report back to this group regarding the progress:


    1. As indicated I bought a 50W (5V, 10A) extra power source.
    2. I stopped using original rpi5 USB-C 27W power source as advised by several colleagues here.
    3. I now power my RPi5 + Suptronics - X1011 M.2 NVMe 4 SSD shield setup solely with 5V DC pin using 50W source.
      • Remark: I do not have any USB / HDMI devices connected to RPi5, just ethernet cable.


    OUTCOME

    1. nvme disks no longer spontaneously disappear
    2. raid arrays stay "clean", data writes verified, consistency ok, speed/data transfers ok

    - SO FROm THIS POINT OF VIEW PROBLEM SOLVED!


    HOWEVER the whole setup has became totally unstable showing this in the dmesg and in the ssh console:

    Code
    [  601.112845] hwmon hwmon7: Undervoltage detected!
    [  603.128847] hwmon hwmon7: Voltage normalised

    System also RESTARTS from time to time which never happened before.


    So I am back at the beginning :(

  • OK so my deepest apologies - after several hours the problem with spontaneously disappearing nvme(s) is back, it just never took THAT long.

    Anyway I think I may have the real root cause and it is not about power/voltage nor has anything to do with OMV.


    When I run

    Code
    sudo lspci

    I get

    and YES - one of my nvme modules is actually equipped with a Phison controller.


    And here we go - according to this and this not only I should not use modules based on Phison, I should not use RAID 1 (I do have RAID 1 and one of modules is Phison). My bad, my fault, Mea culpa.


    So I will replace Phison module by Crucial and let you know.

  • So getting rid of nvme modules based on Phison controller solved the problem. My dmesg is now clean as elven ar*e.

    Even raid1 (mdx) keep steady - no more degrad events.


    Code
    0000:03:00.0 Non-Volatile memory controller: Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD (DRAM-less) (rev 01)
    0000:04:00.0 Non-Volatile memory controller: Micron/Crucial Technology P3 Plus NVMe PCIe SSD (DRAM-less) (rev 01)
    0000:05:00.0 Non-Volatile memory controller: Silicon Motion, Inc. SM2263EN/SM2263XT SSD Controller (rev 03)
    0000:06:00.0 Non-Volatile memory controller: Micron/Crucial Technology P3 Plus NVMe PCIe SSD (DRAM-less) (rev 01)



    You can consider this thread as SOLVED.

  • macom

    Added the Label resolved
  • macom

    Added the Label OMV 7.x

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!