NFS crash with exported mergerfs pool

    • OMV 4.x
    • NFS crash with exported mergerfs pool

      BACKGROUND INFORMATION

      Latest stable version of OMV:
      Linux omv 4.19.0-0.bpo.4-amd64 #1 SMP Debian 4.19.28-2~bpo9+1 (2019-03-27) x86_64 GNU/Linux

      Mergerfs pool containing 3x drives w/ ext4 filesystems - here's the fstab entry:
      /srv/dev-disk-by-label-WD6TBBAY1:/srv/dev-disk-by-id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1:/srv/dev-disk-by-label-WD3TBBAY2 /srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764 fuse.mergerfs defaults,allow_other,direct_io,use_ino,noforget,category.create=eplfs,minfreespace=10M 0 0

      mergerfs (and FUSE) versions:
      • mergerfs version: 2.26.2
      • FUSE library version: 2.9.7-mergerfs_2.26.0
      • fusermount version: 2.9.7
      • using FUSE kernel interface version 7.27


      There's a single shared folder called "media" in the root of the mergerfs pool.

      NFS export:
      /export/media *(fsid=1,rw,subtree_check,insecure,no_root_squash,anonuid=1000)

      relevant bit of df when things are working:

      label-WD6TBBAY1:id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1:label-WD3TBBAY2 12T 5.8T 5.8T 51% /srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764
      /dev/sdb1 2.7T 89M 2.7T 1% /srv/dev-disk-by-label-WD3TBBAY2
      /dev/sdd1 3.6T 3.2T 221G 94% /srv/dev-disk-by-id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1
      /dev/sda1 5.5T 2.6T 2.9T 48% /srv/dev-disk-by-label-WD6TBBAY1

      ...and from "mount":
      label-WD6TBBAY1:id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1:label-WD3TBBAY2 on /export/media type fuse.mergerfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other)

      THE PROBLEM

      I have had 2 random crashes of nfs - when I notice it, the mergerfs mountpoint has gone on the server too - here's an example of the relevant snippet of /var/log/syslog:


      May 22 00:06:25 omv kernel: [99245.846518] ------------[ cut here ]------------
      May 22 00:06:25 omv kernel: [99245.846523] nfsd: non-standard errno: -103
      May 22 00:06:25 omv kernel: [99245.846618] WARNING: CPU: 1 PID: 816 at /build/linux-tpKJY9/linux-4.19.28/fs/nfsd/nfsproc.c:820 nfserrno+0x65/0x80 [nfsd]
      May 22 00:06:25 omv kernel: [99245.846620] Modules linked in: msr softdog cpufreq_powersave cpufreq_userspace cpufreq_conservative radeon edac_mce_amd kvm_amd ccp ttm rng_core drm_kms_helper kvm evdev drm irqbypass k10temp ipmi_si pcspkr ipmi_devintf ipmi_msghandler sg i2c_algo_bit sp5100_tco button pcc_cpufreq acpi_cpufreq fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor uas usb_storage sd_mod raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ohci_pci ata_generic ahci libahci pata_atiixp libata tg3 libphy i2c_piix4 scsi_mod ohci_hcd ehci_pci xhci_pci ehci_hcd xhci_hcd usbcore usb_common
      May 22 00:06:25 omv kernel: [99245.846735] CPU: 1 PID: 816 Comm: nfsd Not tainted 4.19.0-0.bpo.4-amd64 #1 Debian 4.19.28-2~bpo9+1
      May 22 00:06:25 omv kernel: [99245.846737] Hardware name: HP ProLiant MicroServerr, BIOS O41 10/01/2013
      May 22 00:06:25 omv kernel: [99245.846762] RIP: 0010:nfserrno+0x65/0x80 [nfsd]
      May 22 00:06:25 omv kernel: [99245.846767] Code: 13 05 00 00 b8 00 00 00 05 74 02 f3 c3 48 83 ec 08 89 fe 48 c7 c7 7a 15 83 c0 89 44 24 04 c6 05 1c 13 05 00 01 e8 0b e8 47 c8 <0f> 0b 8b 44 24 04 48 83 c4 08 c3 31 c0 c3 0f 1f 00 66 2e 0f 1f 84
      May 22 00:06:25 omv kernel: [99245.846770] RSP: 0018:ffffb8c9c1607d98 EFLAGS: 00010282
      May 22 00:06:25 omv kernel: [99245.846774] RAX: 0000000000000000 RBX: ffff9572947e1008 RCX: 0000000000000006
      May 22 00:06:25 omv kernel: [99245.846777] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff957297c966a0
      May 22 00:06:25 omv kernel: [99245.846780] RBP: ffff9572947e1168 R08: 0000000000000001 R09: 000000000000032f
      May 22 00:06:25 omv kernel: [99245.846782] R10: ffffb8c9c63efd60 R11: 0000000000000000 R12: ffff95726f5020c0
      May 22 00:06:25 omv kernel: [99245.846784] R13: ffff957272bc5cc0 R14: 00000000ffffff99 R15: ffff957294027780
      May 22 00:06:25 omv kernel: [99245.846788] FS: 0000000000000000(0000) GS:ffff957297c80000(0000) knlGS:0000000000000000
      May 22 00:06:25 omv kernel: [99245.846791] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      May 22 00:06:25 omv kernel: [99245.846794] CR2: 00007fa4b2c55000 CR3: 000000014020a000 CR4: 00000000000006e0
      May 22 00:06:25 omv kernel: [99245.846796] Call Trace:
      May 22 00:06:25 omv kernel: [99245.846828] nfsd_rename+0x1ca/0x2b0 [nfsd]
      May 22 00:06:25 omv kernel: [99245.846855] nfsd3_proc_rename+0x9b/0x130 [nfsd]
      May 22 00:06:25 omv kernel: [99245.846878] nfsd_dispatch+0xb1/0x240 [nfsd]
      May 22 00:06:25 omv kernel: [99245.846930] svc_process_common+0x3bf/0x780 [sunrpc]
      May 22 00:06:25 omv kernel: [99245.846969] svc_process+0xe9/0x100 [sunrpc]
      May 22 00:06:25 omv kernel: [99245.846991] nfsd+0xe3/0x150 [nfsd]
      May 22 00:06:25 omv kernel: [99245.846999] kthread+0xf8/0x130
      May 22 00:06:25 omv kernel: [99245.847021] ? nfsd_destroy+0x60/0x60 [nfsd]
      May 22 00:06:25 omv kernel: [99245.847026] ? kthread_create_worker_on_cpu+0x70/0x70
      May 22 00:06:25 omv kernel: [99245.847032] ret_from_fork+0x22/0x40
      May 22 00:06:25 omv kernel: [99245.847037] ---[ end trace b2717fa65f13ab36 ]---
      May 22 00:06:32 omv collectd[1236]: statvfs(/srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764) failed: Transport endpoint is not connected
      May 22 00:06:42 omv collectd[1236]: statvfs(/srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764) failed: Transport endpoint is not connected
      May 22 00:06:46 omv monit[1226]: 'filesystem_srv_dev-disk-by-id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1' space usage 88.9% matches resource limit [space usage>85.0%]
      May 22 00:06:46 omv monit[1226]: Device /srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764 not found in /etc/mtab


      This leads to "transport endpoint is not connected" errors on the OMV server and clients attempting to read/write get input/output and/or stale file handle errors. If I reboot the OMV server service is restored for a while - but it's happened twice within 72 hours.

      I notice a similar problem on unraid's forums: forums.unraid.net/bug-reports/…60-nfs-kernel-crash-r199/

      I don't know if it's the kernel/NFS, FUSE or mergerfs at fault here.

      How can I debug this? mergerfs is about the only thing built into OMV that I can find that suits my needs and I HAVE to be able to export it over NFS since it offers the best performance for my Kodi clients (and for other Linux software/services I have elsewhere on the network reading/writing to the pool)