ZFS scrub job uses a lot of system ressources -> email notifications -> monit alert -- Resource limit matched

hoppel118 · 14. Oktober 2018

Hi guys,

tonight my openmediavault server informed/alerted me about load avarage matching the resource limit. I got round about 30 Emails.

So I googled around and found this thread: monit alert -- Resource limit matched/succeeded localhost But that thread doesn't really help, because I have a high load avarage and not high space usage.

I had the feeling that there may be a running scrub job. zpool status gives me the following output:

Code

root@omv4:~# zpool status
  pool: mediatank
 state: ONLINE
  scan: scrub in progress since Sun Oct 14 00:24:01 2018
        20,5T scanned out of 34,7T at 601M/s, 6h54m to go
        0B repaired, 58,93% done
config:


        NAME                                    STATE     READ WRITE CKSUM
        xxxxxxxxx                               ONLINE       0     0     0
          raidz2-0                              ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0
            ata-WDC_WD100EFAX-68LHPN0_xxxxxxxx  ONLINE       0     0     0


errors: No known data errors

Alles anzeigen

BINGO! top gives me the following output:

Code

Tasks: 299 total,   2 running, 212 sleeping,   0 stopped,  12 zombie
%Cpu(s):  0,5 us, 29,0 sy,  0,0 ni, 70,4 id,  0,0 wa,  0,0 hi,  0,1 si,  0,0 st
KiB Mem : 65921912 total, 25970544 free, 37144352 used,  2807016 buff/cache
KiB Swap: 67054588 total, 67054588 free,        0 used. 28025340 avail Mem


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1149 root      20   0       0      0      0 D  29,7  0,0 176:48.10 txg_sync
    2 root      20   0       0      0      0 S  12,9  0,0  83:35.78 kthreadd
  919 root       0 -20       0      0      0 S  12,5  0,0  83:31.55 z_rd_int_0
  920 root       0 -20       0      0      0 S  12,5  0,0  83:38.08 z_rd_int_1
  926 root       0 -20       0      0      0 S  12,5  0,0  83:34.13 z_rd_int_7
  921 root       0 -20       0      0      0 S  12,2  0,0  83:34.81 z_rd_int_2
  922 root       0 -20       0      0      0 R  12,2  0,0  83:37.02 z_rd_int_3
  925 root       0 -20       0      0      0 S  12,2  0,0  83:34.12 z_rd_int_6
  924 root       0 -20       0      0      0 S  11,6  0,0  83:31.96 z_rd_int_5
  923 root       0 -20       0      0      0 S  10,9  0,0  83:33.24 z_rd_int_4
  448 root       0 -20       0      0      0 S   6,6  0,0  46:19.79 spl_dynamic_tas
23019 openmed+  20   0  373432  16060   8316 S   3,3  0,0   0:17.77 php-fpm7.0
  449 root       0 -20       0      0      0 S   1,3  0,0   7:03.54 spl_kmem_cache
 4076 htpc      20   0 6759376 378792  63004 S   1,0  0,6  76:52.37 EmbyServer
19166 root      20   0   43088   3984   3092 R   1,0  0,0   0:00.09 top
 5113 htpc      20   0 1801304 913232  34100 S   0,7  1,4  97:39.90 mongod
 5746 root      20   0       0      0      0 I   0,7  0,0   0:05.67 kworker/2:1-eve
10087 root      20   0       0      0      0 I   0,7  0,0   0:05.59 kworker/7:2-eve
10342 root      20   0       0      0      0 I   0,7  0,0   0:05.63 kworker/1:1-eve
28159 root      20   0       0      0      0 I   0,7  0,0   0:02.69 kworker/0:1-eve
28248 root      20   0       0      0      0 I   0,7  0,0   0:02.46 kworker/5:2-eve
   48 root      20   0       0      0      0 S   0,3  0,0   0:24.61 ksoftirqd/6
   54 root      20   0       0      0      0 S   0,3  0,0   0:25.80 ksoftirqd/7

Alles anzeigen

I did not use "top" very often in the past, because I never had resource problems. I did not change the applications I use my openmediavault for. A little bit of home automation (fhem server), docker, tvheadend, unifi and emby.

But I updated the kernel, the zfs and the openmediavault versions. I use the latest kernel from backport. Maybe something changed.

Code

root@omv4:~# uname -a
Linux omv4 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1 (2018-09-13) x86_64 GNU/Linux

Code

root@omv4:~# dpkg -l | grep zfs
ii  libzfs2linux                        0.7.11-1~bpo9+1                amd64        OpenZFS filesystem library for Linux
ii  openmediavault-zfs                  4.0.4                          amd64        OpenMediaVault plugin for ZFS
ii  zfs-dkms                            0.7.11-1~bpo9+1                all          OpenZFS filesystem kernel modules for Linux
ii  zfs-zed                             0.7.11-1~bpo9+1                amd64        OpenZFS Event Daemon
ii  zfsutils-linux                      0.7.11-1~bpo9+1                amd64        command-line tools to manage OpenZFS filesystems

Here a grep of monit messages in my syslog:

Code

Oct 14 06:24:43 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:25:13 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:25:43 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:26:13 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:26:43 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:27:13 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:27:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:28:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:28:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:29:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.5 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:29:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:30:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.6 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:30:44 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:31:14 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:31:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:32:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:32:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:33:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.6 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:33:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.5 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:34:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:34:45 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:35:15 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:35:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:36:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:36:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:37:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.4 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:37:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:38:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.5 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:38:46 omv4 monit[2656]: 'omv4' loadavg(5min) of 10.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:39:16 omv4 monit[2656]: 'omv4' loadavg(5min) of 10.0 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:39:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 10.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:40:17 omv4 monit[2656]: 'omv4' loadavg(5min) of 9.8 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:40:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:41:17 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.9 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:41:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.7 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:42:17 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:42:47 omv4 monit[2656]: 'omv4' loadavg(5min) of 8.2 matches resource limit [loadavg(5min)>8.0]
Oct 14 06:43:18 omv4 monit[2656]: 'omv4' loadavg(5min) check succeeded [current loadavg(5min)=8.0]
Oct 14 06:43:18 omv4 postfix/smtp[8230]: 25C121001DF: replace: header Subject: monit alert --  Resource limit succeeded omv4: Subject: [omv4.localdomain] monit alert --  Resource limit succeeded omv4

Alles anzeigen

Here are some Screenshots from the load avarage:

So, the high load avarage started tonight at 0 o'clock. And in the graphic "by year" you can see that there is an high load avarage every month, because there is a planned scrub job once a month. In June 2018 I replaced all disks of my zfs pool to larger ones, which also took a lot of system resources.

What do you think, what should I do?

Increase the load avarage value to reduce the notification emails. Where is this possible?
Decrease the resources the scrub job can use. Where is this possible?

Thanks for helping!

Regards Hoppel

hoppel118 · 14. Oktober 2018

Some more information about my system and it's usage:

My Xeon is not the most powerful, but it should be enough for my use cases.

There is also a lot of free memory:

In my opinion the cpu and the memory is not the problem.

Regards Hoppel

hoppel118 · 14. Oktober 2018

There are a lot of tunables for zol:

Code

root@omv4:~# ls /sys/module/zfs/parameters
dbuf_cache_hiwater_pct                 spa_load_verify_metadata       zfs_arc_sys_free                zfs_max_recordsize                    zfs_scan_ignore_errors                         zfs_vdev_queue_depth_pct
dbuf_cache_lowater_pct                 spa_slop_shift                 zfs_autoimport_disable          zfs_mdcomp_disable                    zfs_scan_min_time_ms                           zfs_vdev_raidz_impl
dbuf_cache_max_bytes                   zfetch_array_rd_sz             zfs_checksums_per_second        zfs_metaslab_fragmentation_threshold  zfs_scrub_delay                                zfs_vdev_read_gap_limit
dbuf_cache_max_shift                   zfetch_max_distance            zfs_compressed_arc_enabled      zfs_metaslab_segment_weight_enabled   zfs_send_corrupt_data                          zfs_vdev_scheduler
dmu_object_alloc_chunk_shift           zfetch_max_streams             zfs_dbgmsg_enable               zfs_metaslab_switch_threshold         zfs_send_queue_length                          zfs_vdev_scrub_max_active
ignore_hole_birth                      zfetch_min_sec_reap            zfs_dbgmsg_maxsize              zfs_mg_fragmentation_threshold        zfs_sync_pass_deferred_free                    zfs_vdev_scrub_min_active
l2arc_feed_again                       zfs_abd_scatter_enabled        zfs_dbuf_state_index            zfs_mg_noalloc_threshold              zfs_sync_pass_dont_compress                    zfs_vdev_sync_read_max_active
l2arc_feed_min_ms                      zfs_abd_scatter_max_order      zfs_deadman_checktime_ms        zfs_multihost_fail_intervals          zfs_sync_pass_rewrite                          zfs_vdev_sync_read_min_active
l2arc_feed_secs                        zfs_admin_snapshot             zfs_deadman_enabled             zfs_multihost_history                 zfs_sync_taskq_batch_pct                       zfs_vdev_sync_write_max_active
l2arc_headroom                         zfs_arc_average_blocksize      zfs_deadman_synctime_ms         zfs_multihost_import_intervals        zfs_top_maxinflight                            zfs_vdev_sync_write_min_active
l2arc_headroom_boost                   zfs_arc_dnode_limit            zfs_dedup_prefetch              zfs_multihost_interval                zfs_txg_history                                zfs_vdev_write_gap_limit
l2arc_noprefetch                       zfs_arc_dnode_limit_percent    zfs_delay_min_dirty_percent     zfs_multilist_num_sublists            zfs_txg_timeout                                zfs_zevent_cols
l2arc_norw                             zfs_arc_dnode_reduce_percent   zfs_delay_scale                 zfs_nocacheflush                      zfs_vdev_aggregation_limit                     zfs_zevent_console
l2arc_write_boost                      zfs_arc_grow_retry             zfs_delays_per_second           zfs_nopwrite_enabled                  zfs_vdev_async_read_max_active                 zfs_zevent_len_max
l2arc_write_max                        zfs_arc_lotsfree_percent       zfs_delete_blocks               zfs_no_scrub_io                       zfs_vdev_async_read_min_active                 zil_replay_disable
metaslab_aliquot                       zfs_arc_max                    zfs_dirty_data_max              zfs_no_scrub_prefetch                 zfs_vdev_async_write_active_max_dirty_percent  zil_slog_bulk
metaslab_bias_enabled                  zfs_arc_meta_adjust_restarts   zfs_dirty_data_max_max          zfs_object_mutex_size                 zfs_vdev_async_write_active_min_dirty_percent  zio_delay_max
metaslab_debug_load                    zfs_arc_meta_limit             zfs_dirty_data_max_max_percent  zfs_pd_bytes_max                      zfs_vdev_async_write_max_active                zio_dva_throttle_enabled
metaslab_debug_unload                  zfs_arc_meta_limit_percent     zfs_dirty_data_max_percent      zfs_per_txg_dirty_frees_percent       zfs_vdev_async_write_min_active                zio_requeue_io_start_cut_in_line
metaslab_fragmentation_factor_enabled  zfs_arc_meta_min               zfs_dirty_data_sync             zfs_prefetch_disable                  zfs_vdev_cache_bshift                          zio_taskq_batch_pct
metaslab_lba_weighting_enabled         zfs_arc_meta_prune             zfs_dmu_offset_next_sync        zfs_read_chunk_size                   zfs_vdev_cache_max                             zvol_inhibit_dev
metaslab_preload_enabled               zfs_arc_meta_strategy          zfs_expire_snapshot             zfs_read_history                      zfs_vdev_cache_size                            zvol_major
metaslabs_per_vdev                     zfs_arc_min                    zfs_flags                       zfs_read_history_hits                 zfs_vdev_max_active                            zvol_max_discard_blocks
send_holes_without_birth_time          zfs_arc_min_prefetch_lifespan  zfs_free_bpobj_enabled          zfs_recover                           zfs_vdev_mirror_non_rotating_inc               zvol_prefetch_bytes
spa_asize_inflation                    zfs_arc_pc_percent             zfs_free_leak_on_eio            zfs_recv_queue_length                 zfs_vdev_mirror_non_rotating_seek_inc          zvol_request_sync
spa_config_path                        zfs_arc_p_dampener_disable     zfs_free_max_blocks             zfs_resilver_delay                    zfs_vdev_mirror_rotating_inc                   zvol_threads
spa_load_verify_data                   zfs_arc_p_min_shift            zfs_free_min_time_ms            zfs_resilver_min_time_ms              zfs_vdev_mirror_rotating_seek_inc              zvol_volmode
spa_load_verify_maxinflight            zfs_arc_shrink_shift           zfs_immediate_write_sz          zfs_scan_idle                         zfs_vdev_mirror_rotating_seek_offset

Alles anzeigen

I found this issue from 2012 at github, where the developer @behlendorf listet some of the relevant tunables for scrub io performance:

Zitat von behlendorf

There are tunables for this, however we haven't gone to any great lengths to tune each to the exact right value. The current setting were brought over from OpenSolaris and may not be exactly right for Linux. And feedback you can proved on what the default should be would be helpful.
int zfs_top_maxinflight = 32; /* maximum I/Os per top-level */ int zfs_resilver_delay = 2; /* number of ticks to delay resilver */ int zfs_scrub_delay = 4; /* number of ticks to delay scrub */ int zfs_scan_idle = 50; /* idle window in clock ticks */ int zfs_scan_min_time_ms = 1000; /* min millisecs to scrub per txg */ int zfs_free_min_time_ms = 1000; /* min millisecs to free per txg */ int zfs_resilver_min_time_ms = 3000; /* min millisecs to resilver per txg */ int zfs_no_scrub_io = B_FALSE; /* set to disable scrub i/o */ int zfs_no_scrub_prefetch = B_FALSE; /* set to disable srub prefetching */

But I do not understand what I are these tunables for. So, at first I had a look for my configuration:

Code

root@omv4:~# cat /sys/module/zfs/parameters/zfs_top_maxinflight
32
root@omv4:~# cat /sys/module/zfs/parameters/zfs_resilver_delay
2
root@omv4:~# cat /sys/module/zfs/parameters/zfs_scrub_delay
4
root@omv4:~# cat /sys/module/zfs/parameters/zfs_scan_idle
50
root@omv4:~# cat /sys/module/zfs/parameters/zfs_scan_min_time_ms
1000
root@omv4:~# cat /sys/module/zfs/parameters/zfs_free_min_time_ms
1000
root@omv4:~# cat /sys/module/zfs/parameters/zfs_resilver_min_time_ms
3000
root@omv4:~# cat /sys/module/zfs/parameters/zfs_no_scrub_io
0
root@omv4:~# cat /sys/module/zfs/parameters/zfs_no_scrub_prefetch
0

Alles anzeigen

Since behlendorf commented the issue in 2012 nothing changed for the default configuration values. My configured values use the default values for all of these tunables.

Do you know some kind of documentation where these tunables are described in more detail?

How did you tune your zpool?

Regards Hoppel

geaves · 14. Oktober 2018

I haven't used ZFS for some time now, not since I used nas4free, however I did find some info based upon your title, and the problem could be related to the ZFS ARC Cache.

But I did find this site which has further links within the article which may be of some use.

crashtest · 15. Oktober 2018

That's a huge pool, so I guess you wouldn't want to hear anything about restoring from backup? (Just kidding )
___________________________________________________________________

On the E-mails, I simply unchecked the boxes under System, Notification, Notifications Tab, System. The rational behind that decision was the sporadic nuisance E-mails, along with what was being monitored. If something in the system itself is critical, (software or hardware) it's unlikely that an E-mail would give enough heads up to prevent an actual failure.
I set notifications for storage only, where noting hard drive SMART errors might be useful, before a drive fails completely.
__________________________________________________________________

What I find notable is the amount of memory in use on your server. In my experience, ZFS will runs a large Page cache but actual memory usage is not high. (I'm not running dedup.)
But,, having no experience with a pool the size of yours, I don't know. Perhaps memory usage scales with the size of the pool or you have other functions running.
___________________________________________________________________

I just set up a ZFS mirror (with 1.5TB of data), a few weeks back, on box with a 64 bit Atom processor and a mere 4GB of ram. It's fully up-to-date with OMV4, kernel 4.18.0 and the latest ZFS plugin. (ZOL v0.7.11-1)
I'm running a scrub manually, and will leave it up for a couple weeks or so to collect some performance stat's for comparison.

BTW: I don't think your CPU has anything to do with this either.
When I ran the first scrub on the Atom box, it actually ran faster than the i3 CPU on my primary server, which has 3 times the RAM. Since the pool size is identical between the two, I attributed the obvious difference in speed, to the speed of the drives involved. The Atom box has 7200 RPM drives, versus the i3 with 5400rpm drives.
While this is just a single side-by-side comparison, CPU speed and quantity of RAM doesn't have the impact one might think it would. (I could show you performance stat's from the i3 but, since it's still running OMV3, it probably wouldn't apply.)
____________________________________

Edit: the Atom processor box (OMV4), with 4GB ram, completed a scrub of 1.46TB in 3h16m. The same scrub was done with an i3 (OMV3) and 12GB RAM in 5h00m. Disk speed appears to be more of a factor than CPU or RAM.

tkaiser · 15. Oktober 2018

Zitat von hoppel118

In my opinion the cpu and the memory is not the problem

Sure, it's the concept of average load in general on Linux: http://www.brendangregg.com/bl…/linux-load-averages.html

In other words there is no problem: Simply ignore the notifications and if you want to find out why 'average load' is high I suggest installing sysstat package and then running 'iostat 60' in parallel next time the scrub runs.

crashtest · 18. Oktober 2018

With your pool being raidz, your parallel read/write throughput is much higher than my little mirror (effectively one disk).
While it's obviously not an apples to apples comparison, this is what my scrubs look like.

OM4 (kernel 4.18.0), 64 bit Atom, 4GB of ram, ZOL v0.7.11-1) 1.5TB zmirror.

While not applicable, the following weekly's are from OMV 3.0.99, running on an i3, 12G RAM, with the same mirror (drives are a bit slower). The 15th was when the last scrub kicked off.

hoppel118 · 28. Oktober 2018

Hi guys,

sorry for the late response. Didn't have much time the last weeks. Anyway, I want to thank you for your answers!

Zitat von flmaxey

That's a huge pool, so I guess you wouldn't want to hear anything about restoring from backup? (Just kidding )

Harharrrr

Zitat von flmaxey

On the E-mails, I simply unchecked the boxes under System, Notification, Notifications Tab, System. The rational behind that decision was the sporadic nuisance E-mails, along with what was being monitored. If something in the system itself is critical, (software or hardware) it's unlikely that an E-mail would give enough heads up to prevent an actual failure.
I set notifications for storage only, where noting hard drive SMART errors might be useful, before a drive fails completely.

Ok, but I like the idea, that my systems informs me about any problem. I do not like the idea to to uncheck the boxes under the notifications tab.

Zitat von flmaxey

What I find notable is the amount of memory in use on your server. In my experience, ZFS will runs a large Page cache but actual memory usage is not high. (I'm not running dedup.)
But,, having no experience with a pool the size of yours, I don't know. Perhaps memory usage scales with the size of the pool or you have other functions running.

Yeah, I also recognized that my server uses a high amount of memory and I am also not running dedup. But this was already the case, before I replaced the 8x4TB WD Red to 8x10TB WD Red in my raid-z2 in June of this year. On the other side it's only about 50% of my ram:

Code

root@omv4:~# free -h
              total        used        free      shared  buff/cache   available
Mem:            62G         29G         30G         43M        3,5G         33G
Swap:           63G          0B         63G

Do you know any command to check what uses the ram exactly?

Zitat von flmaxey

With your pool being raidz, your parallel read/write throughput is much higher than my little mirror (effectively one disk).
While it's obviously not an apples to apples comparison, this is what my scrubs look like.

Again I want to thank you for all the work you investigated into this.

My wd reds are also 5400rpm disks, but I have 8 of them. So, yes... my parallel read/write speed is much higher than with a mirror pool and it's not really comparable.

Zitat von tkaiser

Sure, it's the concept of average load in general on Linux: brendangregg.com/blog/2017-08-08/linux-load-averages.html

In other words there is no problem: Simply ignore the notifications and if you want to find out why 'average load' is high I suggest installing sysstat package and then running 'iostat 60' in parallel next time the scrub runs.

I will have a look at the link you posted and sysstat. I also understood that there is not really a problem. But my omv informed me with 30 emails about the high load average. This is simply to much. As I understand it now, I have to disable the check box and I won't see any emails about this again, but it's not possible to reduce that amount of emails.

It would be great if anybody has the answers to the following both questions:

Zitat von hoppel118

What do you think, what should I do?

- Increase the load avarage value to reduce the notification emails. Where is this possible?

- Decrease the resources the scrub job can use. Where is this possible?

Alles anzeigen

Thanks and regards Hoppel

hoppel118 · 15. Februar 2019

Hi guys,

this issue got solved by it self. The only thing changed is a kernel update from 4.18 to 4.19. I didn't get any emails regarding "ressource limit succeeded" from monit. There were already two scrubs with kernel 4.19 installed.

Thanks for all your suggestions and your help.

Regards Hoppel

Jetzt mitmachen!