Btrfs, mergerfs, snapraid and shared folders

  • HI Folks,


    I recently lost quite a bit of data (my fault and unimportant files thankfully) and snapraid wasnt able to recover much of it as it spanned multiple drives.

    This got me thinking about changing my setup and I've been reading about zfs, btrfs and many combinations of options available.


    For me the following requirements exist:

    - recovery from disk failure

    - recover from accidental deletion (bar the cross disk issue which wont be possible)

    - drive pooling

    - different disk sizes


    My initial investigation has been into:

    • ZFS pools
      • I have disks of different sizes but enough to create multiple vdevs into a single pool which covers 3 of the above but I have no undelete option (unless I take regular snapshots)
    • btrfs raid
      • I can use many disk sizes but raid 5/6 aren't stable and undelete again isn't there (unless I take regular snapshots)
    • btrfs + mergerfs + snapraid
      • This one popped up in a self-host wiki page ( https://wiki.selfhosted.show/tools/snapraid-btrfs/ ) and seemed like an interesting idea as my biggest bug with Snapraid has been its sync which requires everything else to stand still until its complete, the option to sync from a snapshot whilst other apps do their thing is very useful.
      • This option fulfils every requirement except the guide isnt OMV specific and the way it creates/mounts the btrfs disks doesn't align with how OMV now creates shared folders as subvols.


    So I decided to have a play with the latest OMV and btrfs filesystems and shared folders + mergerfs + snapraid.


    The setup:

    Filesystems

    - 3 disks as btrfs (data)

    - 1 disk as ext (parity)



    Shared Folders

    - 3x data on the 3 btrfs disks

    - btrfs(1,2,3)

    - 3x content on the same btrfs disks

    - content



    This aligns with the snapraid-btrfs requirements to have a top level subvol and a content subvol but no longer aligns with the guide as the disks are mounted as full disks by OMV not as subvolumes as the guide shows so I had to diverge from the guide.


    To use mergerFS with this setup I have chosen to use absolute paths to the data subvols which are created by OMV:


    Finally I have setup SnapRAID config similar to the guide but NOT via the OMV plugin, its not possible so I've done it via CLI (you need to install snapraid outside OMV as well):


    From here I was able to create files in the mergerFS directory and see them populate the btrfs mounts as you would expect in any mergerFS setup.


    Shared Folders can now be create on top of the mergerFS filesystem as normal which maintains the single data subvol per disk (as required by snapraid-btrfs) but we still have some flexibility in the shared folders created and being able to use those for other plugins.


    There is also the option to create other shared folders on the btrfs disks but these will NOT be backed up by snapraid so you need to snapshot them via OMV and add backups in another way (or use multiple snapraid configs and run the snapraid commands at different times specifying the config required.. untested!).


    This is merely an exercise to see if it all works and it does but I want to see if anyone had any thoughts on this, is there an easier way, a smarter way or are there any pitfalls I've missed etc? :)


    This setup doesnt provide me a fix for the deletion across many disks but I dont think anything mentioned here is going to do that so I'm going to look at my folder structure to limit blast radius if such things happen again.


    CH

  • I suspect people reading this might think there's too much going on outside the normal OMV framework. What's the snapshot system going to be? Is that also outside of the default OMV framework?


    It's an interesting config which kind of promises a "have your cake and eat it" snapraid set up, getting closer to real-time RAID. Personally I'd bite the bullet and go for zfs with regular snapshots. A (poor?) second choice for me is BTRFS RAID1, it's such a pity BRTFS RAID5/6 parity raid is not stable. In OMV6/7 BTRFS RAID does have the advantage of being very simple to set up.


    But I suspect this adaption of snapraid-btrfs is not going to appeal to the average OMV user or convince them to use BTRFS. Is possible that the fact that no one else has left a comment on your post in six days is indicative of that?

  • It was primarily a proof of concept and is definitely not something I would recommend to the every day user as it requires some reasonable level of confidence in both CLI, Linux and the tools themselves.


    Any snapshots are either automated via snapper/snapraid-btrfs or manual via CLI, you could also snapshot the shared folder created as a subvol but thats going to be the whole disk not small portions which is what shared folders would typically be used for.

    Ideally if btrfs could sort raid 5/6 stability and that is then included in OMV then it would remove the need for snapraid and mergerfs altogether which would be great but for now this was simply an exercise to teach myself a few things and show whats possible.


    I only recently noticed the addition of btrfs to OMV and then spotted that guide whilst researching options, it works which is whats important which means hopefully some day it could be entirely GUI based or btrfs raid5/6 becomes the standard.


    CH

    • Official Post

    ZFS pools


    I have disks of different sizes but enough to create multiple vdevs into a single pool which covers 3 of the above but I have no undelete option (unless I take regular snapshots)

    Regular snapshots are not a big deal because they can be fully automated. I've been using zfs-auto-snapshot for years. This -> doc is a walk through for zfs-auto-snapshot setup AND for recovering deleted files. I've done more that a couple of file / folder recoveries without a single issue.

  • ..... I only recently noticed the addition of btrfs to OMV and then spotted that guide whilst researching options, it works which is whats important which means hopefully some day it could be entirely GUI based or btrfs raid5/6 becomes the standard.


    CH

    Not available via the WebUI, but I guess there's always the BTRFS RAID1C3. Somehow I think you could wait forever for BTRFS RAID5/6 to stable.

    • Official Post

    Somehow I think you could wait forever for BTRFS RAID5/6 to stable.

    I have to agree with this. I've been watching BTRFS for around 10 years, with no progress on the RAID 5/6 write hole issue. With that said, as it was with traditional RAID5/6, an UPS adds some protection.

  • So with ZFS I no longer need SnapRAID or mergerfs correct, the ZFS pool is a single mount like mergerfs would have created?


    With the addition of the auto snapshot I get a regular snapshot of the pool/filesystems that I can roll back, can I recover single files from it without rolling back?

    Are the snapshots similar to btrfs in that they take no real space overall and for the filesystems in that wiki, are those created in OMV GUI?


    My concerns with ZFS are:

    - Management, I've seen a lot of stuff about config, tuning etc that may not be necessary but can be off putting when starting out.

    - The risk of zpool loss, obviously that shouldnt be allowed to happen and is based on decisions made by me but it does concern me, losing a disk is inevitable but at least with mergerfs/snapraid the other disks are still just disks with the remaining data easily accessible.


    I'll be losing overall capacity with the following setup compared to mergerfs/snapraid:

    - RAIDZ2 vdev 5*6 TB disks

    - RAIDZ2 vdev 4*4TB disks


    In that model I get 26TB overall vs the snapraid/mergerfs where it would be 2*4TB in dual split parity and 5*6TB for storage so 30TB overall (minus the typical loss to actual available space on the disks in both instances).


    I'll have to test the zfs setup and get my head around it more.

    Things like enabling compression and other settings I have not yet looked into much so is there anything that I should focus on?


    CH

  • chaosprime In short, answers to your questions are :


    Yes, set at zpool creation time or later.

    Yes, from a hidden (the default setting) snapshot file.

    Yes, until the parent starts to deviate from the snapshot.


    Re: concerns:


    One practical aspect in OMV is to follow the recommendation of installing the PVE kernel before installing the zfs plugin. This avoids relying on the debian DKMS build for zfs and in the case of the 6.5 PVE kernel series gets you the latest openzfs version 2.2.2. Beware that the zfs plugin is a 3rd party plugin that is now maintained by ryecoarron and few if any new features can be expected. You will need to work out just what zpool/zfs commands the plugin supports and what can be safely done at the CLI, what is cumbersome to achieve via the webUI and when its quicker to use the CLI.


    A combo of snapshots, scrubs, smart monitoring and regular backups will to be used - that's no different to a BTRFS RAID pool. A RAIDZ2 remains operational with up to two disk failures. Unlike BRTFS , a RAIDZ2 pool will mount normally in a degraded state. Like BTRFS, replacing a failed disk in a zpool s a CLI activity.


    Think about what properties you want set on the pool when it is created, they will then be inherited by all datasets, e.g. compression, atime, relatime, mountpoint etc. You can create a pool at the CLI and it may be better to do so.


    OMV is predicated on all native linux filesystems supporting posix.1e ACL and ext4, BTRFS and XFS are mounted with ACL and xattr. In contrast openzfs defaults to acltype = off. You will have to think about the consequences of this and whether its really necessary to use "acltype= posix" which is normally paired with "xattr = sa" ( i.e. store acl info in "system" xattr namespace). There are two other related dataset properties "aclmode" and "aclinherit" which govern the behaviour of ACL inheritance in a dataset's file/folders and any child datasets and determines how chmod and ACLs interact. These can become important when using network protocols to share data held on a zpool. But the general recommendation in OMV is to avoid ACLs.


    The zfs filesytem was never designed for outright speed, hence all the stuff about tuning it to a given work load and using additional vdev types like cache, log and special devices to squeeze the best possible performance out a given zpool layout. A raidz2 pool is a good compromise for sequential read/writes, but not so good for highly random io. Good for media streaming streaming apps , not so good for VMs, database and such.


    I'm sure you'll read a lot about ECC or not ECC memory and how much memory or how memory hungry zfs is. Much of that is urban myth or just misunderstanding.

  • My concerns with ZFS are:

    1. - Management, I've seen a lot of stuff about config, tuning etc that may not be necessary but can be off putting when starting out.

    2. - The risk of zpool loss, obviously that shouldnt be allowed to happen and is based on decisions made by me but it does concern me, losing a disk is inevitable but at least with mergerfs/snapraid the other disks are still just disks with the remaining data easily accessible.

    1. Unless you want to get the *N'th* in performance, tuning is really not necessary. A single drive (or a zmirror) can saturate a 1GB network connection. RAIDZ2 can easily saturate a 2.5GB network connection. If you add an SSD for a ZIL, or an L2ARC, performance is even better, not that I would do either for a home server. It's simply not needed.

    The only thing I would recommend, after creating a pool, would be to run the following on the CLI, before moving data into a ZFS array. This is not "tuning", but it will enable Linux style ACL's and compression.
    (Substitute your pool name for ZFS1)


    zfs set aclmode=passthrough ZFS1

    zfs set aclinherit=passthrough ZFS1

    zfs set acltype=posixacl ZFS1

    zfs set xattr=sa ZFS1

    zfs set compression=lz4 ZFS1

    zpool set autoreplace=on ZFS1 #(optional - replace a failed drive automatically, when a blank drive is physically swapped in.)


    2. I'm running zmirrors on 2 servers and RAIDZ on a 3rd server. So far, I've swapped out two drives that were part of a zmirror's without issues. But realize, on a long enough time line, data disasters are inevitable. Nothing is perfect so you should have 100% backup. In my case, the second zmirror is a backup server for the first. To provide the security needed to prevent data loss, backup is a must. (Even server grade hardware can fail in a way to where all hard drive data is lost. All it takes is a bad stick of RAM.)

    I'll be losing overall capacity with the following setup compared to mergerfs/snapraid:

    - RAIDZ2 vdev 5*6 TB disks

    - RAIDZ2 vdev 4*4TB disks

    I really think that worrying about total "disk space" is the wrong approach. My concern would lay with how long it would take to collect and store data in the 26 to 30TB range, what it would take to preserve it AND how long it would take to restore it. Hours, days, longer? When looking at it this way, a couple of extra drives is cheap insurance.

    I'm running mergerfs and snapraid on an R-PI4. Why? MergerFS and SnapRaid work very well with USB. connected drives There's the "raid like" capability to restore an array member drive, a single mount point, and some data integrity abilities ("bit-rot protection" but that requires hands on and a good understanding of SnapRAID). To get the most from MergerFS & SnapRAID, the user is required to know some of the details of each package and have some understanding of how the two can be made to work together. With that stated, there's really no comparison between ZFS and Mergerfs with SnapRAID.

    As an example; MergerFS and SnapRAID will give you 1 layer of file "undelete" protection. If the "sync" command is run, which is often ran automatically with what is called a "Diff script", the deleted file is permanently gone. You'd have to retrieve the file from backup (assuming that backup captured it).

    On the other hand; ZFS snapshots can be setup to save the states of a filesystem and all the files and folders within it, at varying intervals, for up to a year (or even longer if customized). That means that many of the previous saved states of a file (a word processing doc for example) can be recalled and, if needed, the evolutionary differences of the file could be examined and a selected version could be restored. Snapshots also make data impervious to tampering by malware, encrypting viruses, etc. (Simply go back in time and restore untainted data, from before the attack.) Finally, with a "scrub" ZFS automatically restores files that may have bit-rot. In some cases, where CRC errors are noted in a scrub, ZFS may be able to indicate that a drive is beginning to go south before SMART detects it.

    I could go on with the relative differences but, in my opinion, I'd put it like this:
    - MergerFS & SnapRAID are great for USB connected drives that are typical with SBC's. (Traditional RAID or RAID variants on an SBC is a no-no.) The feature set, provided by the two packages, is outstanding. They're not just for SBC's. Many amd64 users are using MergerFS and SnapRAID.
    - For PC's with SATA or SAS ports, server hardware, etc., ZFS is a very good choice. It's professionally developed and maintained by organizations with DEEP pockets. It will be around and well supported for some time to come.

    ____________________________________________

    On using ZFS with the Debian backports kernel versus the Proxmox kernel:
    (Without talking about the longer build time associated with Debian kernels)


    ZFS works fine with either kernel, "but" the Debian backports kernel is upgraded from time to time, with upgrades pushed out to users. The problem is, the ZOL project (ZFS On Linux) does not, necessarily, keep up with the latest and greatest Debian kernel upgrades. So, it's possible to upgrade the kernel (VIA OMV's software updates) and lose contact with the installed pool. (This has happened to me.)
    The fix is revert to the old kernel. The long term fix is to disable backports Debian kernels but even that's not fully bullet proof.

    The best solution is to use the Proxmox kernel because each kernel upgrade will already have ZFS modules built into the kernel and ZFS utilities will be in the Proxmox userland. There's no guess work. Adding to that, the Proxmox kernel is "thoroughly tested", before it's deployed, for use in commercial servers. (It's safe and very stable.) Note that there are slight differences between Debian and Proxmox userlands. The Proxmox userland is a bit older and more on the conservative side, but I haven't run into any package issues with what I have installed.

    On the memory issue:
    As Krisbee has said, the memory hog nonsense is a myth. If dedup is off (that's the default) there's nothing to worry about. I have a backup server that's running a zmirror, with an Intel Atom CPU and 4GB ram, and it's doing fine. Will ZFS use a LOT of available ram? Yes it will. ZFS might set up a BIG file cache if you're running (for example) an rsync copy. However, if other processes need ram, ZFS will give it back. Rather than letting RAM sit idle and unused, this is efficient management of RAM.

  • I am running an almost identical configuration to what you quoted having just (in the last week) switched from omv snapraid script to the snapraid btrfs python driver script used in that page. I currently have 4 data drives and 2 parity drives and all drives have a /merged subdirectory that is connected with mergerfs. Most of my shares are subdirectories of merged.


    My basic configuration snapraid mergerfs share configuration has not changed from what I configured in omv 5.


    It seems to me that this configuration should be supportable in the snapraid addon or in a second alternate snapraid addon. Any thoughts on that from existing developers? It is possible that i could be convinced to take this on if the consensus is it is a good idea and would be accepted into omv-extras.

  • I went through an exercise of replacing a failed disk in the btrfs setup and it went as perfectly, snapraid restored all the data and mergerfs was easily updated to point to the new disk, I was very pleased with it.


    In terms of the ZFS setup, I've tested that as well now or at least the setup which was very smooth through the GUI.
    Adding multiple vdevs was easy and I liked the snapshot restoration option but have yet to test it for single file restore from.

    Replacing a degraded disk was also stupidly easy so overall it was a nice experience.


    I did notice that the version of ZFS is 2.1.14, does the version installed by the plugin ever change?
    Asking because I of course would like the 2.3 update when it comes with the new vdev expansion feature.

    • Official Post

    I liked the snapshot restoration option but have yet to test it for single file restore from.

    Restoration of files or folders, from a snapshot, is in the -> zfs-auto-snapshot doc.
    (Along with the processes for setting up automatic, self rotating and self purging snapshots.)

    I did notice that the version of ZFS is 2.1.14, does the version installed by the plugin ever change?

    Yes it does. Generally, updates come with (1) a kernel update and (2) package updates.

    Asking because I of course would like the 2.3 update when it comes with the new vdev expansion feature.

    I don't chase the latest package versions, that are outside of the default Userland of a kernel. (That may require pinning repo's and other high maintenance tasks). This policy keeps ongoing server maintenance "drama" to a minimum. (I had a multi-repo "Frankenstein" server, at one time. It wasn't fun.)

    Version 2.3 will come when the kernel/userland you're using supports it. You may be prompted to upgrade your pool in the process. That's easy enough. It's a single command.

    • Official Post

    chaosprime

    Just a few quick notes:
    - When you set up a pool, run the command lines from the post above.
    - Then create individual "filesystems" under the parent pool. Ideally, they'll contain specific data types. Music, Documents, Videos, etc.
    Creating a ZFS filesystem is far better that creating a regular Linux folder at the root of the pool. A filesystem has editable properties AND each filesystem can have it's own customizable snapshots.
    - These filesystems will be the "source" of your Shared Folders. They'll appear as if they're partitions of a hard drive so, once one is selected, the relative path will be / (a single slash).

  • I did notice that the version of ZFS is 2.1.14, does the version installed by the plugin ever change?
    Asking because I of course would like the 2.3 update when it comes with the new vdev expansion feature.


    OMV6 using pve kernel gives you this:




    See here, and I don't expect either zfs 2.2 or 23. to hit bullseye.

    Index of /debian/pve/dists/bullseye/pve-no-subscription/binary-amd64/


    OMV 7 with pve kernel 6. 5 gets you zfs 2.2.2 and maybe zfs 2.3 in the future

  • Amazing, thank you guys.

    Removed some of the fear of ZFS and I think I'm going to go down the ZFS root for my disk update.


    The tricky part is now creating enough space in a ZFS pool to hold my current data so I can migrate the remaining disks over to a second vdev :S

    • Official Post

    The tricky part is now creating enough space in a ZFS pool to hold my current data so I can migrate the remaining disks over to a second vdev

    After this migration, you really (really) should give some thought to full backup. While this is beginner guidance, backup concepts and ideas can be found -> here.

  • For a ZFS pool the plugin appears to set ashift to 0 if the user doesnt set it at creation time.


    Code
    tmpZFS ashift 0 default


    Should this actually be something like 12 from what I've read?


    Also out of curiosity, I have 2* 1TB NVMes that I planned to mirror and use for things like short term storage, docker volumes (some containing postgres/redis/sqlite) and then some duplicates of data that is also on the main long term pool.


    Is ZFS viable for this use case and are those same settings above applicable?
    I've read a few things that mention reduced life on SSD/NVMe with ZFS but it seems scattered so not sure if that's old info still floating around?


    • Official Post

    Should this actually be something like 12 from what I've read?

    I went with the default but I'm using spinning hard drives. (I can't speak to settings for SSD's or NVM's.) For my use case, which is largely static files, performance has been fine. Again, given that I'm using a 1GB network, I didn't try to tune up ZFS.

    Also out of curiosity, I have 2* 1TB NVMes that I planned to mirror and use for things like short term storage, docker volumes (some containing postgres/redis/sqlite) and then some duplicates of data that is also on the main long term pool.

    To my way of thinking, the reason to set up a zmirror is for mostly static storage and automatic self healing files when scrubbed. For hosting SQL DB's and Dockers (that uses a type of overlayfs), I'd go with a utility drive and use EXT4.

    The reasons:
    - SQL can "chatter" a lot. SQL keeps DB's open which may create what appears to be a ton of discrete file changes, that would be captured in snapshots. This can be offset by not using snapshots.

    - Docker's version of overlayfs (some years ago) used to appear as a ZFS "legacy filesystem" that couldn't be deleted. There's supposed to be a Docker driver for ZFS, to get past this issue, but I've never used it. (*Note that this information may be dated - I haven't followed up on this issue.*)

    For these reasons and more I use a discrete utility drive formatted to EXT4 for these purposes. If you want to back the drive up, you could rsync the utility volume, afterhours when SQL is quiet, to another drive.


    I've read a few things that mention reduced life on SSD/NVMe with ZFS but it seems scattered so not sure if that's old info still floating around?

    I use SSD's in Desktop clients for good local performance, fast app loading, etc. Again, I have no experience with SSD's or NVMe in a NAS server for one simple reason; the performance bottle neck, with 1GB Ethernet, is the network itself. A single spinning hard drive can easily saturate a 1GB connection. Therefore, the performance gains of SSD's or NVMe will not be fully realized when accessing a NAS server remotely.

    With that said, given how wear leveling works on solid state drives; I can see were an SQL app may reduce SSD life because it's "chatty" and SQL DB's can be quite large (depending on the application and use case). The net result of constantly rewriting a DB combined with capturing snapshots of a DB's past states has the potential to wear out an SSD.

    FYI: A quick explanation of -> Wear Leveling.

  • chaosprime


    Personally, I wouldn’t create a zpool via the plugin and certainly would not use the ashift default value of 0. For HDD ashift=9 (512 sectors), ashift=12 (4096 sectors), SSD ashift=12 or ashift=13 (needs benchmarking) and NVMe need benchmarking to select best ashift. ( See for example: https://www.reddit.com/r/zfs/c…_on_nvme_drives_with_zfs/ )


    As the pool’s ashift is immutable and fundamentally important for performance you need to get this right first time. As mentioned in a previous post above, I’d use the CLI to create the zfs with all settings you need which the plugin will then pick up.


    The (premature) wear of SSD/NVMe is a fact of life when using a copy on write filesystem due to the degree of “write amplification” - the read, modify, write cycle involved when making small files changes which require whole blocks to be read in order to modify a small part somewhere in it, calculate a new checksum, and then write the new block. Anything that writes to the pool and in particular generates sync writes (a kind of double write) will cause wear. This is where choosing the best recordsize for a dataset becomes very important. The default of 128K is not a good choice for a DB that’s using 16K or 64K pages. Picking the wrong recordsize will impact both performance and wear.


    People complain of premature wear od SSD/NVMe when using low endurance consumer grade products in enterprise settings and/or are simply unaware of how much write amplification their use is generating. If could easily be 7x or higher for applications such as VMs of DBs.


    For your intended use, docker can generate a multitude of snapshots due to its layered file system. Supposedly, zfs 2.2.2 has improved on that but I’ve not investigate that. Just create datasets with different properties – recordsize, sync=standard or sync=always, etc. - to segregate data by use.


    Further Reading, should you want it:


    Hardware — OpenZFS documentation

    Misadventures with ZFS + SSDs (both SATA and NVMe)
    What do I buy (or what do I do) ... to get NVMe levels of performance in TrueNAS..? What kind of performance can I expect from each drive? 200MB/s per SSD…
    www.truenas.com

    ZFS NVMe performances questions
    Hi, I'm currently trying to get the best performances I can on FreeBSD with NVMe drives, in order to know how many servers my workflow is going to require…
    forums.servethehome.com

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!