OMV hangs, lsof tainted?

puterfixer · 28. Mai 2015

Hello,

Every few weeks, for an unknown reason, my OMV freezes. It is no longer accessible through the network, and it's headless so all I can do is hit the reset button.

This last time, however, it kept logging things so maybe this will help clarify the root cause.

From zero load, the CPU load jumped to a constant 1 in the afternoon of May 26, and stayed like this for nearly 30 hours. In the evening of May 27, the system stopped logging data - I assume that's when it stopped responding at all. Logging resumed in the evening of May 28 after a cold reboot.

Here are the messages that came out of the blue, alternating between CPU0 and CPU1 every 30 minutes. Any idea what is this, what's up with "lsof Tainted"?

This is a system with a low-power CPU (AMD BE-2350) and a massive heatsink with passive cooling in a case with well thought airflow, to minimize noise from an extremely underloaded system. I'd really like to stop these hangs from happening ever again, since the overheating can't do any good.

Anything I can try? Any information you need? It's running OMV 1.19 with all the updates to day, with OMVExtras and Transmission as extra plugins, and only SMB/SSH/Torrent services running.

Thanks in advance!

Code

May 26 13:45:03 openmediavault kernel: [764753.605325] CPU 0 
May 26 13:45:03 openmediavault kernel: [764753.605355] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables cpuid softdog quota_v2 quota_tree nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_hda_codec_hdmi radeon snd_hda_intel snd_hda_codec snd_hwdep snd_pcm ttm drm_kms_helper snd_page_alloc snd_timer drm snd power_supply i2c_algo_bit sp5100_tco soundcore psmouse i2c_piix4 powernow_k8 edac_mce_amd i2c_core pcspkr serio_raw evdev mperf k8temp shpchp edac_core processor button thermal_sys ext4 crc16 jbd2 mbcache dm_mod md_mod microcode ata_generic sg sd_mod crc_t10dif pata_atiixp ohci_hcd ehci_hcd r8169 mii libata hpsa scsi_mod usbcore usb_common [last unloaded: scsi_wait_scan]
May 26 13:45:03 openmediavault kernel: [764753.606290] 
May 26 13:45:03 openmediavault kernel: [764753.606317] Pid: 1242, comm: sh Not tainted 3.2.0-4-amd64 #1 Debian 3.2.65-1+deb7u2 Gigabyte Technology Co., Ltd. GA-MA69G-S3H/GA-MA69G-S3H
May 26 13:45:03 openmediavault kernel: [764753.606482] RIP: 0010:[<ffffffff8107d737>]  [<ffffffff8107d737>] acct_collect+0x4b/0x165
May 26 13:45:03 openmediavault kernel: [764753.606588] RSP: 0018:ffff8800bf155e98  EFLAGS: 00010282
May 26 13:45:03 openmediavault kernel: [764753.606655] RAX: bfff8800bf1ed608 RBX: 0000000000000000 RCX: 000000000000ac17
May 26 13:45:03 openmediavault kernel: [764753.606743] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88011fc928a0
May 26 13:45:03 openmediavault kernel: [764753.606831] RBP: ffff88012018ca00 R08: 0000000000000000 R09: ffffffffffffffa8
May 26 13:45:03 openmediavault kernel: [764753.606918] R10: 0000000000000202 R11: 0000000000000202 R12: 00000000003f0000
May 26 13:45:03 openmediavault kernel: [764753.607005] R13: ffff8801193411c0 R14: 0000000000000001 R15: 0000000000000094
May 26 13:45:03 openmediavault kernel: [764753.607093] FS:  00007f72d731a700(0000) GS:ffff880127c00000(0000) knlGS:0000000000000000
May 26 13:45:03 openmediavault kernel: [764753.607190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 26 13:45:03 openmediavault kernel: [764753.607263] CR2: 00007fb1d4b77e02 CR3: 00000000c13bf000 CR4: 00000000000006f0
May 26 13:45:03 openmediavault kernel: [764753.607350] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 26 13:45:03 openmediavault kernel: [764753.607438] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 26 13:45:03 openmediavault kernel: [764753.607527] Process sh (pid: 1242, threadinfo ffff8800bf154000, task ffff8801193411c0)
May 26 13:45:03 openmediavault kernel: [764753.607653]  ffff8801193411c0 0000000000000000 ffff88011fc92840 ffff88012018ca00
May 26 13:45:03 openmediavault kernel: [764753.607757]  0000000000000001 ffffffff81049fca 0000000000000000 000000000000007d
May 26 13:45:03 openmediavault kernel: [764753.607861]  ffff8801193416e0 ffffffff810fc23e ffff8800bf155fd8 ffff880123247280
May 26 13:45:03 openmediavault kernel: [764753.608004]  [<ffffffff81049fca>] ? do_exit+0x210/0x713
May 26 13:45:03 openmediavault kernel: [764753.609136]  [<ffffffff810fc23e>] ? fput+0x17a/0x1a1
May 26 13:45:03 openmediavault kernel: [764753.609136]  [<ffffffff8104a74d>] ? do_group_exit+0x74/0x9e
May 26 13:45:03 openmediavault kernel: [764753.609136]  [<ffffffff8104a786>] ? sys_exit_group+0xf/0xf
May 26 13:45:03 openmediavault kernel: [764753.609136]  [<ffffffff81355f92>] ? system_call_fastpath+0x16/0x1b
May 26 13:45:03 openmediavault kernel: [764753.609136]  RSP <ffff8800bf155e98>
May 26 13:45:03 openmediavault kernel: [764753.650173] ---[ end trace c8c796530e792d0f ]---


May 26 14:09:01 openmediavault kernel: [766191.784020] CPU 1 
May 26 14:09:01 openmediavault kernel: [766191.784020] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables cpuid softdog quota_v2 quota_tree nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc loop snd_hda_codec_hdmi radeon snd_hda_intel snd_hda_codec snd_hwdep snd_pcm ttm drm_kms_helper snd_page_alloc snd_timer drm snd power_supply i2c_algo_bit sp5100_tco soundcore psmouse i2c_piix4 powernow_k8 edac_mce_amd i2c_core pcspkr serio_raw evdev mperf k8temp shpchp edac_core processor button thermal_sys ext4 crc16 jbd2 mbcache dm_mod md_mod microcode ata_generic sg sd_mod crc_t10dif pata_atiixp ohci_hcd ehci_hcd r8169 mii libata hpsa scsi_mod usbcore usb_common [last unloaded: scsi_wait_scan]
May 26 14:09:01 openmediavault kernel: [766191.784020] 
May 26 14:09:01 openmediavault kernel: [766191.784020] Pid: 1394, comm: lsof Tainted: G      D      3.2.0-4-amd64 #1 Debian 3.2.65-1+deb7u2 Gigabyte Technology Co., Ltd. GA-MA69G-S3H/GA-MA69G-S3H
May 26 14:09:01 openmediavault kernel: [766191.784020] RIP: 0010:[<ffffffff8113f980>]  [<ffffffff8113f980>] show_map_vma+0x14/0x200
May 26 14:09:01 openmediavault kernel: [766191.784020] RSP: 0018:ffff8800b6553df8  EFLAGS: 00010292
May 26 14:09:01 openmediavault kernel: [766191.784020] RAX: ffff8801207877e0 RBX: bfff8800bf1ed608 RCX: ffff8801207877e0
May 26 14:09:01 openmediavault kernel: [766191.784020] RDX: ffff8800b6553ed0 RSI: bfff8800bf1ed608 RDI: ffff880120439e40
May 26 14:09:01 openmediavault kernel: [766191.784020] RBP: bfff8800bf1ed608 R08: 0000000000000bca R09: 00000000fffffff6
May 26 14:09:01 openmediavault kernel: [766191.784020] R10: 0000000000040b97 R11: 0000000000040b97 R12: ffff8801193411c0
May 26 14:09:01 openmediavault kernel: [766191.784020] R13: ffff880120439e40 R14: bfff8800bf1ed608 R15: 0000000000000000
May 26 14:09:01 openmediavault kernel: [766191.784020] FS:  00007f28e5449700(0000) GS:ffff880127c80000(0000) knlGS:0000000000000000
May 26 14:09:01 openmediavault kernel: [766191.784020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 26 14:09:01 openmediavault kernel: [766191.784020] CR2: 00007f28e5452000 CR3: 00000000cfbc9000 CR4: 00000000000006e0
May 26 14:09:01 openmediavault kernel: [766191.784020] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 26 14:09:01 openmediavault kernel: [766191.784020] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 26 14:09:01 openmediavault kernel: [766191.784020] Process lsof (pid: 1394, threadinfo ffff8800b6552000, task ffff88011fa5c840)
May 26 14:09:01 openmediavault kernel: [766191.784020]  ffff88010000002d ffffffff00000070 000000000001f000 ffffffff00000008
May 26 14:09:01 openmediavault kernel: [766191.936077]  ffff880100000011 0000000000040b97 ffff8800b6553e44 ffffffff8134fb64
May 26 14:09:01 openmediavault kernel: [766191.936077]  0000000000001000 0000003581350784 ffff88011fc92840 ffff880120439e40
May 26 14:09:01 openmediavault kernel: [766191.936077]  [<ffffffff8134fb64>] ? _cond_resched+0x7/0x1c
May 26 14:09:01 openmediavault kernel: [766191.936077]  [<ffffffff8113fd2e>] ? show_map+0x17/0x44
May 26 14:09:01 openmediavault kernel: [766191.936077]  [<ffffffff81113aa0>] ? seq_read+0x266/0x34c
May 26 14:09:01 openmediavault kernel: [766191.936077]  [<ffffffff810fb4bf>] ? vfs_read+0x9f/0xe6
May 26 14:09:01 openmediavault kernel: [766191.936077]  [<ffffffff810fb54b>] ? sys_read+0x45/0x6b
May 26 14:09:01 openmediavault kernel: [766191.936077]  [<ffffffff81355f92>] ? system_call_fastpath+0x16/0x1b
May 26 14:09:01 openmediavault kernel: [766191.936077]  RSP <ffff8800b6553df8>
May 26 14:09:01 openmediavault kernel: [766192.032727] ---[ end trace c8c796530e792d10 ]---

Alles anzeigen

ryecoaaron · 28. Mai 2015

Thoughts...

- Check your memory with memtest
- Switch to backports 3.16 kernel (button to install in omv-extras)
- What type of media is OMV installed on?

puterfixer · 28. Mai 2015

Memory is fine I'll run memtest on it in a loop again, just to be sure.

OMV is installed on a IDE hard drive, the only drive on-board the motherboard. All other (SATA) controllers are disabled. Storage is on a hardware RAID controller on PCI Express.

Backports? What does it do?

ryecoaaron · 28. Mai 2015

You are running the 3.2 kernel. The backports 3.16 is much newer (same kernel that Debian 8 Jessie uses) and may have better drivers for your motherboard. I use it on all of my systems.

puterfixer · 29. Mai 2015

Gotcha. Done. Although it's not very logical to me how a newer kernel can have improved support for a really old motherboard (2nd half of 2007)

And now, we wait.

Is there a monitoring/watchdog solution I can use, to alert me when CPU load stays in an elevated state for more than a few minutes?

Any other way to observe what's going on with the box when it freezes - would remote logging help?

ryecoaaron · 29. Mai 2015

Things do get better with time sometimes

OMV actually emails you when load is high if you have notifications setup. If it freezes quickly, it may not have time to send them though. I would look through the logs to see if any errors are showing.

puterfixer · 18. Juni 2015

Alright, with monitoring enabled it crashed a couple more times until I could catch this red-handed.

At 9:30pm sharp tonight, my box spiked up to 100% CPU load and stayed there for half an hour until I rebooted it. It did send me an alert that the CPU was way up, and then two other interesting mails with identical content:

From: Cron Daemon
Message: Segmentation fault
Subject: Cron <root@openmediavault> [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)

Indeed, processes list showed quite a few processes with >80% CPU load, all about generating graphs.

I'd point my finger at the cron job generating graphs, however... as I'm typing this, the system stopped responding again over the network - no web interface or SSH access. It did this as I was browsing through the pages of system logs.

Any clues? Is it a software bug somewhere? Is it a faulty drive?

puterfixer · 18. Juni 2015

*update: badblocks -v /dev/sdb returns 0 errors.

ryecoaaron · 18. Juni 2015

A drive can fail or start failing without bad blocks.

puterfixer · 2. Juli 2015

I did an upgrade to 2.1, but I still can't pinpoint the issue or trust the existing system disk. It's an old 20GB IDE drive, and I would love to replace it. Question is - how to back up my settings/users/shares etc. so that I can easily restore them on a clean install on a new drive?

Drat, the box powered up like 15 minutes ago just froze and did a cold reboot by itself. Not sure if linked to transmission traffic; a torrent was downloading, and the new Gigabit internet connection might be too much for it. (Yeah, 1Gbps internet connection, gigabit router with hardware PPPoE offloading, speed test returns between 930-970Mbps... All for €12/month. Wanna move here? )

puterfixer · 7. Juli 2015

The box froze again just after midnight, and has been staying at full load for 7 hours. What now, shut down the system every time I am not using it? I am really, really pissed off. Even Windows would be a more stable option than this.

Please provide instructions on how to backup the settings regarding users, shares and permissions, so that I can do a clean install from the ISO. I'm at the point where this is the last chance I'm giving OMV before moving on.

tekkb · 7. Juli 2015

I find your attitude most presumptuous and illogical. And lest you forget, the mods do not work here.

WastlJ · 7. Juli 2015

Zitat von puterfixer

I did an upgrade to 2.1, but I still can't pinpoint the issue or trust the existing system disk. It's an old 20GB IDE drive, and I would love to replace it. Question is - how to back up my settings/users/shares etc. so that I can easily restore them on a clean install on a new drive?

There is no way in backing up and restoring a config, yet.
You can back up your config by yourself and use it -just for referrence- to reconfigure your newly installed OMV. However, you still can backup your whole system disk via clonezilla for example and restore it onto your new HDD.

puterfixer · 24. Juli 2015

Apologies for the frustration @tekkb and all. It was just incredibly aggravating to not figure out where the issue was coming from, nor to be able to get some expert support on the matter. In hindsight, nobody else would have been able to pinpoint the root cause anyway

To close the loop, here's what happened meanwhile:

disabled the on-board NIC (Realtek) and installed an Intel add-on Gigabit Ethernet card with less CPU load - nada;
replaced the suspect drive with another one, connected to on-board SATA, then tried installing CentOS and Fedora Server - neither would work in graphical mode, and in low res graphical mode both froze during the install;
reduced the shared memory allocated to the graphics - no change;
tested the RAM exhaustively, passed 5 times over 9 hours - no issue here;
had a rare moment of divine inspiration and removed the HP Smart Array P410 hardware RAID controller from the PCI Express slot - installation of Fedora Server completed without glitches;
found out that HP actually released an advisory related to this card, they found out that in a particular combination of firmware + SATA drives + status polling the board would cause the system to freeze or reboot unexpectedly;
HP had published an updated firmware to correct the issue, and fortunately I just had installed Fedora which is on HP's list of supported OS's and could apply the RPM patch easily; (last time I had to install a Windows Server trial just to apply the firmware delivered as an executable, because it would run only under a Windows Server environment)
as I'm not familiar with either Fedora or CentOS and could not get Cockpit to work for remote server management, I gave up on that and returned to Debian;
I could not get OMV installed on top of Debian (a screenfull of dependencies which it couldn't handle), so I switched the hard drive again and performed a clean install of OMV 2.x.

It's been 12 hours now and the new system is still working, all settings/shares/users/plugins were manually added in like half an hour, and I'm keeping an eye on it to see if it freezes again. New hard drive has less than 1,000 hours of uptime and the long test has passed.

Sorry again for venting here. The issue was caused by a firmware bug in the hardware RAID controller, corrected by the vendor, and not related to lsof or another software component of OMV or Debian. I need to subscribe to support alerts for this piece of hardware.

tekkb · 24. Juli 2015

Anyone who uses computers long enough is going to get into one of these nasty issues. Hopefully you are done with it.

Jetzt mitmachen!