Hi guys,
tonight my openmediavault server informed/alerted me about load avarage matching the resource limit. I got round about 30 Emails.
So I googled around and found this thread: monit alert -- Resource limit matched/succeeded localhost But that thread doesn't really help, because I have a high load avarage and not high space usage.
I had the feeling that there may be a running scrub job. zpool status gives me the following output:
root@omv4:~# zpool status
pool: mediatank
state: ONLINE
scan: scrub in progress since Sun Oct 14 00:24:01 2018
20,5T scanned out of 34,7T at 601M/s, 6h54m to go
0B repaired, 58,93% done
config:
NAME STATE READ WRITE CKSUM
xxxxxxxxx ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx ONLINE 0 0 0
errors: No known data errors
Alles anzeigen
BINGO! top gives me the following output:
Tasks: 299 total, 2 running, 212 sleeping, 0 stopped, 12 zombie
%Cpu(s): 0,5 us, 29,0 sy, 0,0 ni, 70,4 id, 0,0 wa, 0,0 hi, 0,1 si, 0,0 st
KiB Mem : 65921912 total, 25970544 free, 37144352 used, 2807016 buff/cache
KiB Swap: 67054588 total, 67054588 free, 0 used. 28025340 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1149 root 20 0 0 0 0 D 29,7 0,0 176:48.10 txg_sync
2 root 20 0 0 0 0 S 12,9 0,0 83:35.78 kthreadd
919 root 0 -20 0 0 0 S 12,5 0,0 83:31.55 z_rd_int_0
920 root 0 -20 0 0 0 S 12,5 0,0 83:38.08 z_rd_int_1
926 root 0 -20 0 0 0 S 12,5 0,0 83:34.13 z_rd_int_7
921 root 0 -20 0 0 0 S 12,2 0,0 83:34.81 z_rd_int_2
922 root 0 -20 0 0 0 R 12,2 0,0 83:37.02 z_rd_int_3
925 root 0 -20 0 0 0 S 12,2 0,0 83:34.12 z_rd_int_6
924 root 0 -20 0 0 0 S 11,6 0,0 83:31.96 z_rd_int_5
923 root 0 -20 0 0 0 S 10,9 0,0 83:33.24 z_rd_int_4
448 root 0 -20 0 0 0 S 6,6 0,0 46:19.79 spl_dynamic_tas
23019 openmed+ 20 0 373432 16060 8316 S 3,3 0,0 0:17.77 php-fpm7.0
449 root 0 -20 0 0 0 S 1,3 0,0 7:03.54 spl_kmem_cache
4076 htpc 20 0 6759376 378792 63004 S 1,0 0,6 76:52.37 EmbyServer
19166 root 20 0 43088 3984 3092 R 1,0 0,0 0:00.09 top
5113 htpc 20 0 1801304 913232 34100 S 0,7 1,4 97:39.90 mongod
5746 root 20 0 0 0 0 I 0,7 0,0 0:05.67 kworker/2:1-eve
10087 root 20 0 0 0 0 I 0,7 0,0 0:05.59 kworker/7:2-eve
10342 root 20 0 0 0 0 I 0,7 0,0 0:05.63 kworker/1:1-eve
28159 root 20 0 0 0 0 I 0,7 0,0 0:02.69 kworker/0:1-eve
28248 root 20 0 0 0 0 I 0,7 0,0 0:02.46 kworker/5:2-eve
48 root 20 0 0 0 0 S 0,3 0,0 0:24.61 ksoftirqd/6
54 root 20 0 0 0 0 S 0,3 0,0 0:25.80 ksoftirqd/7
Alles anzeigen
I did not use "top" very often in the past, because I never had resource problems. I did not change the applications I use my openmediavault for. A little bit of home automation (fhem server), docker, tvheadend, unifi and emby.
But I updated the kernel, the zfs and the openmediavault versions. I use the latest kernel from backport. Maybe something changed.
root@omv4:~# uname -a
Linux omv4 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux
root@omv4:~# dpkg -l | grep zfs
ii libzfs2linux 0.7.11-1~bpo9+1 amd64 OpenZFS filesystem library for Linux
ii openmediavault-zfs 4.0.4 amd64 OpenMediaVault plugin for ZFS
ii zfs-dkms 0.7.11-1~bpo9+1 all OpenZFS filesystem kernel modules for Linux
ii zfs-zed 0.7.11-1~bpo9+1 amd64 OpenZFS Event Daemon
ii zfsutils-linux 0.7.11-1~bpo9+1 amd64 command-line tools to manage OpenZFS filesystems
Here a grep of monit messages in my syslog:
Oct 14 06:24:43 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:25:13 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:25:43 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:26:13 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:26:43 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:27:13 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:27:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:28:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:28:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:29:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.5 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:29:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:30:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.6 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:30:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:31:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:31:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:32:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:32:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:33:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.6 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:33:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.5 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:34:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:34:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:35:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:35:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:36:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:36:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:37:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:37:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:38:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.5 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:38:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 10.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:39:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 10.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:39:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 10.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:40:17 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:40:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:41:17 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:41:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:42:17 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:42:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:43:18 omv4 monit[2656]: 'omv4' loadavg(5min) check succeeded [current loadavg(5min)=8.0]
Oct 14 06:43:18 omv4 postfix/smtp[8230]: 25C121001DF: replace: header Subject: monit alert -- Resource limit succeeded omv4: Subject: [omv4.localdomain] monit alert -- Resource limit succeeded omv4
Alles anzeigen
Here are some Screenshots from the load avarage:
So, the high load avarage started tonight at 0 o'clock. And in the graphic "by year" you can see that there is an high load avarage every month, because there is a planned scrub job once a month. In June 2018 I replaced all disks of my zfs pool to larger ones, which also took a lot of system resources.
What do you think, what should I do?
- Increase the load avarage value to reduce the notification emails. Where is this possible?
- Decrease the resources the scrub job can use. Where is this possible?
Thanks for helping!
Regards Hoppel