NFS crash with exported mergerfs pool

  • BACKGROUND INFORMATION


    Latest stable version of OMV:
    Linux omv 4.19.0-0.bpo.4-amd64 #1 SMP Debian 4.19.28-2~bpo9+1 (2019-03-27) x86_64 GNU/Linux


    Mergerfs pool containing 3x drives w/ ext4 filesystems - here's the fstab entry:
    /srv/dev-disk-by-label-WD6TBBAY1:/srv/dev-disk-by-id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1:/srv/dev-disk-by-label-WD3TBBAY2 /srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764 fuse.mergerfs defaults,allow_other,direct_io,use_ino,noforget,category.create=eplfs,minfreespace=10M 0 0


    mergerfs (and FUSE) versions:

    • mergerfs version: 2.26.2
    • FUSE library version: 2.9.7-mergerfs_2.26.0
    • fusermount version: 2.9.7
    • using FUSE kernel interface version 7.27


    There's a single shared folder called "media" in the root of the mergerfs pool.


    NFS export:
    /export/media *(fsid=1,rw,subtree_check,insecure,no_root_squash,anonuid=1000)


    relevant bit of df when things are working:


    label-WD6TBBAY1:id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1:label-WD3TBBAY2 12T 5.8T 5.8T 51% /srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764
    /dev/sdb1 2.7T 89M 2.7T 1% /srv/dev-disk-by-label-WD3TBBAY2
    /dev/sdd1 3.6T 3.2T 221G 94% /srv/dev-disk-by-id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1
    /dev/sda1 5.5T 2.6T 2.9T 48% /srv/dev-disk-by-label-WD6TBBAY1


    ...and from "mount":
    label-WD6TBBAY1:id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1:label-WD3TBBAY2 on /export/media type fuse.mergerfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other)


    THE PROBLEM


    I have had 2 random crashes of nfs - when I notice it, the mergerfs mountpoint has gone on the server too - here's an example of the relevant snippet of /var/log/syslog:



    May 22 00:06:25 omv kernel: [99245.846518] ------------[ cut here ]------------
    May 22 00:06:25 omv kernel: [99245.846523] nfsd: non-standard errno: -103
    May 22 00:06:25 omv kernel: [99245.846618] WARNING: CPU: 1 PID: 816 at /build/linux-tpKJY9/linux-4.19.28/fs/nfsd/nfsproc.c:820 nfserrno+0x65/0x80 [nfsd]
    May 22 00:06:25 omv kernel: [99245.846620] Modules linked in: msr softdog cpufreq_powersave cpufreq_userspace cpufreq_conservative radeon edac_mce_amd kvm_amd ccp ttm rng_core drm_kms_helper kvm evdev drm irqbypass k10temp ipmi_si pcspkr ipmi_devintf ipmi_msghandler sg i2c_algo_bit sp5100_tco button pcc_cpufreq acpi_cpufreq fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor uas usb_storage sd_mod raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ohci_pci ata_generic ahci libahci pata_atiixp libata tg3 libphy i2c_piix4 scsi_mod ohci_hcd ehci_pci xhci_pci ehci_hcd xhci_hcd usbcore usb_common
    May 22 00:06:25 omv kernel: [99245.846735] CPU: 1 PID: 816 Comm: nfsd Not tainted 4.19.0-0.bpo.4-amd64 #1 Debian 4.19.28-2~bpo9+1
    May 22 00:06:25 omv kernel: [99245.846737] Hardware name: HP ProLiant MicroServerr, BIOS O41 10/01/2013
    May 22 00:06:25 omv kernel: [99245.846762] RIP: 0010:nfserrno+0x65/0x80 [nfsd]
    May 22 00:06:25 omv kernel: [99245.846767] Code: 13 05 00 00 b8 00 00 00 05 74 02 f3 c3 48 83 ec 08 89 fe 48 c7 c7 7a 15 83 c0 89 44 24 04 c6 05 1c 13 05 00 01 e8 0b e8 47 c8 <0f> 0b 8b 44 24 04 48 83 c4 08 c3 31 c0 c3 0f 1f 00 66 2e 0f 1f 84
    May 22 00:06:25 omv kernel: [99245.846770] RSP: 0018:ffffb8c9c1607d98 EFLAGS: 00010282
    May 22 00:06:25 omv kernel: [99245.846774] RAX: 0000000000000000 RBX: ffff9572947e1008 RCX: 0000000000000006
    May 22 00:06:25 omv kernel: [99245.846777] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff957297c966a0
    May 22 00:06:25 omv kernel: [99245.846780] RBP: ffff9572947e1168 R08: 0000000000000001 R09: 000000000000032f
    May 22 00:06:25 omv kernel: [99245.846782] R10: ffffb8c9c63efd60 R11: 0000000000000000 R12: ffff95726f5020c0
    May 22 00:06:25 omv kernel: [99245.846784] R13: ffff957272bc5cc0 R14: 00000000ffffff99 R15: ffff957294027780
    May 22 00:06:25 omv kernel: [99245.846788] FS: 0000000000000000(0000) GS:ffff957297c80000(0000) knlGS:0000000000000000
    May 22 00:06:25 omv kernel: [99245.846791] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    May 22 00:06:25 omv kernel: [99245.846794] CR2: 00007fa4b2c55000 CR3: 000000014020a000 CR4: 00000000000006e0
    May 22 00:06:25 omv kernel: [99245.846796] Call Trace:
    May 22 00:06:25 omv kernel: [99245.846828] nfsd_rename+0x1ca/0x2b0 [nfsd]
    May 22 00:06:25 omv kernel: [99245.846855] nfsd3_proc_rename+0x9b/0x130 [nfsd]
    May 22 00:06:25 omv kernel: [99245.846878] nfsd_dispatch+0xb1/0x240 [nfsd]
    May 22 00:06:25 omv kernel: [99245.846930] svc_process_common+0x3bf/0x780 [sunrpc]
    May 22 00:06:25 omv kernel: [99245.846969] svc_process+0xe9/0x100 [sunrpc]
    May 22 00:06:25 omv kernel: [99245.846991] nfsd+0xe3/0x150 [nfsd]
    May 22 00:06:25 omv kernel: [99245.846999] kthread+0xf8/0x130
    May 22 00:06:25 omv kernel: [99245.847021] ? nfsd_destroy+0x60/0x60 [nfsd]
    May 22 00:06:25 omv kernel: [99245.847026] ? kthread_create_worker_on_cpu+0x70/0x70
    May 22 00:06:25 omv kernel: [99245.847032] ret_from_fork+0x22/0x40
    May 22 00:06:25 omv kernel: [99245.847037] ---[ end trace b2717fa65f13ab36 ]---
    May 22 00:06:32 omv collectd[1236]: statvfs(/srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764) failed: Transport endpoint is not connected
    May 22 00:06:42 omv collectd[1236]: statvfs(/srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764) failed: Transport endpoint is not connected
    May 22 00:06:46 omv monit[1226]: 'filesystem_srv_dev-disk-by-id-usb-WD_Elements_25A1_575833314434383746543045-0-0-part1' space usage 88.9% matches resource limit [space usage>85.0%]
    May 22 00:06:46 omv monit[1226]: Device /srv/ecaf2ef9-aa68-47d9-99ad-ac21d64d8764 not found in /etc/mtab



    This leads to "transport endpoint is not connected" errors on the OMV server and clients attempting to read/write get input/output and/or stale file handle errors. If I reboot the OMV server service is restored for a while - but it's happened twice within 72 hours.


    I notice a similar problem on unraid's forums: https://forums.unraid.net/bug-…60-nfs-kernel-crash-r199/


    I don't know if it's the kernel/NFS, FUSE or mergerfs at fault here.


    How can I debug this? mergerfs is about the only thing built into OMV that I can find that suits my needs and I HAVE to be able to export it over NFS since it offers the best performance for my Kodi clients (and for other Linux software/services I have elsewhere on the network reading/writing to the pool)

  • For anyone interested, since running a combination of the same setup as above but now for a good while with:


    kernel 4.19.0-0.bpo.5-amd64
    mergerfs version: 2.27.1
    FUSE library version: 2.9.7-mergerfs_2.27.0


    ...I haven't experienced this issue.


    Thanks trapexit!

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!