RAID5 over LVMs for a proper BTRFS handling?

  • Hello,


    First time on this forum, new to OMV but quite used to linux and servers in general, I decided to give a try to OMV on a new server array.
    Long time user of Synology and because they will have to comply with the upcoming "legal Australian backdoor", I preferred to move to an OS that don't have anything to sell! This way there is no reason for OMV to comply to these crazy laws. And let's face it, I like new challenges!


    I figure synology RAID is using BTRFS for quite sometime in their production machines, and they heavily advertise on it. I had absolutely no problems with the RAID5-type arrays I've been stress-testing over the past two years, no hickups, monthly integrity reports and build checks were showing no errors at all. And anyways, I don't think Synology would advertise in such way without making some heavy tests to back up their claims.
    However, the BTRFS kernel project is listing RAID5 and 6 as unstable, leaving its implementation to software integrators...


    This level of contradiction between these two made me wonder why there is such mismatch in the perceived maturation of this filesystem.
    I started by understanding how Synology RAID differs from classic RAID. Then I heard from Level1-forum that Synology RAID is on top of LVMs!!! I came to the conclusion that Synology configures a LVM of each disk right away, then manages the logical volume from that LVM to build the array.


    So, why not trying this approach with OMV?!?
    I have a new system, nothing stored on it yet... Let's do it.
    Last detail, THIS IS A TEST, it is not meant to be used on a production machine. Don't do it!



    Enough for the introduction, the newly assembled system is built around a C236 Intel chipset with a low power E3-1260L-V5 processor, 16GB of UDIMM-ECC. There is a RAID controller embedded in the system, but obviously, I won't use it here.
    One major constrain I have is about the migration of the data, I have to migrate from a synology RAID array, with a limited quantity of hard drive. So I figured I would create a degraded RAID5 array. As a side note / is set on a separate SSD.


    To make up this I had to install OMV (V4.1.21), then install the LVM plugin (omv-lvm2 V4.0.7-1 at the time of writing this) from the main repository.
    Then I used the following two links to help me out in during the setup and syntax of mdadm:
    http://blog.mycroes.nl/2009/02…ingle-disk-to-3-disk.html
    https://translate.google.com/t…raid-1-un-raid-5-sin.html


    Creating the LVM for each disk is easily done through the webUI. Very simple, take one disk, for instance /dev/sda and create a sda-LVM volume group then a HDD0-LVM logical volume. This way it matches the numbering I am used to put on the physical hard drive, good thing to avoid touching the fan.
    BTW, I am soooo pleased with this project. OMV rocks! Seriously, a new era of open-source-self-hosting is upon us. Thanks to OpenMediaVault, its active community and Volker!


    After that, I had to identify the volumes created using mdadm:


    The two logical volumes created are /dev/mapper/sd$--LVM-HDD#--LVM
    Let go ahead and create a new array. BTW, I tried to make a RAID6, but it failed, requesting for a minimum of 4 disks to be created.
    I could have forced it somehow maybe, but I figured it is not complicated to migrate a RAID5 to a RAID6.
    Anyhow, this is he syntax and the output:

    Then I let it stand still during the rebuilding/building of the array.
    Then volume formating in BTRFS, then created a shared folder.
    Copying some data at the moment, then I will stress a little these two new disk. I am confident about them, they have been tested a little beforehand.



    I hope this will help.
    Thanks for reading this! ;)
    Updates coming soon.
    Feel free to throw the hell out of my wrongdoing. :D

  • Thanks for your great answers!

    IIRC this is NOT recommend by the btrfs developers.
    The Raid5 problem with btrfs is the "write hole" and I think this also exists in mdadm.

    Here is the link that triggered my interest. Please take a look at it.
    https://btrfs.wiki.kernel.org/…agers_and_logical_volumes


    I am not trying to fix the "write hole" in this configuration, but only making an attempt in improving the stability of the array over the different every day life operation of a NAS-size server. There are so many post on the forum about RAID getting degraded/fail/problematic, that I ought to propose an alternative construction by making the BTRFS filsystem managing LVM block devices. I understand your doubts about how pertinent this proposal is.
    Indeed, I should have mentioned how important it is to use a properly configured UPS for such a small server box, because this is inherently the main issue with the "write hole" issue.

    Also this is quite interesting:
    https://lwn.net/Articles/665299/
    An SSD as journal to fix the write hole is supported by mdadm.
    I don't remember, whether btrfs supports it, but I think it was mentioned in the GitHub discussion above.

    Thank you very much! Another step to improve the behavior of the array. The system SSD used is way too big for the OMV install, so I can try to make a partition of it for that purpose. Any idea of the implementation/synthax of that feature?
    ;)

  • Quote

    Here is the link that triggered my interest. Please take a look at it

    I read it, but cannot see, what triggers your thoughts. I see no benefit of LVM beneath btrfs. Rather the contrary. A rebuild with pure btrfs will be way faster if the filesystem is not full, as only the used part is rebuilt.

  • I read it, but cannot see, what triggers your thoughts. I see no benefit of LVM beneath btrfs. Rather the contrary. A rebuild with pure btrfs will be way faster if the filesystem is not full, as only the used part is rebuilt.


    Here is the simple implementation that is suggested:

    create a single volume group, with two logical volumes (LVs), each backed by separate devices

    • create a btrfs raid1 across the two LVs
    • create subvolumes within btrfs (e.g. for /home/user1, /home/user2, /home/media, /home/software)
    • in this case, any one subvolume could grow to use up all the space, leaving none for other subvolumes
    • however, it performs well and is convenient
  • what would be the benefit over pure btrfs?

    And more importantly what's the benefit of putting a degraded RAID5 mdraid layer in between?


    Code
    mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/mapper/sda--LVM-HDD0--LVM /dev/mapper/sdb--LVM-HDD1--LVM

    Adding tons of complexity should serve a purpose worth the efforts...

  • In terms of the ssd journal: be aware a filesystem is useless junk of data without an existing journal. If you need a extrenal journal, make sure it is running on a raid.
    I just recently had a datacenter loosing the beegfs storage because of a broken journal, that was running in a 20 drive ssd raid5 configuration.

  • I was implicitly assuming a copy of the journal on the spinning disks.
    But that may defeat the purpose -at least partially:
    It will help against a loss of the SSD if no write hole happens at the same time.


    When having the SSDs as a Raid1, don't we have another write hole problem there?

  • In case of different data on the two drives, how do you determine the right data?


    If the parity (in RAID5) or the mirror copy (in RAID1) is not written correctly, it would be unnoticed until one of the array member disks fails. If the disk fails, you need to replace the failed disk and start RAID rebuild. In this case one of the blocks would be recovered incorrectly

  • The reason for write holes is, that data are getting written at different times. This will not happen on raid1. The question remains, what has to be done when the data are different between the two drives, caused by some disk error. This highly depends on the implementation. In general, a simple hw controller or mdadm cannot decide which one is correct and which one not. The easiest defence against this is a quorum, so if you use 3 disks, 2 should be the same. This can be handled by simple raid controllers.
    There is also the possibility of checksums, as they are implemented in zfs or ceph. Ceph for example askes for quorum AND correct check sum as default.
    A common way to go, if you only got two drives, is to actually check both disks by hand and decide which one is the good one, by reither smart values or chacking files or fs integrity. But there will not be a scrub to death scenario happening as they may happen in raid 456.
    Of course the only thing that may happen is a total disk fail and wrong data which were undetected before. To avaid that you must do diskverifications regulary.

  • With caching disabled, the drive has enough power in its capacators to finish the writes given to the controller and they should be synchronous enough in that matter, if you use the same type of disks. Talking about ssds here of course.
    It is of course quite as simple as this: every storage can blunder. Thats what we have different depths of backup for. Its just a matter of propabilities and it is way way lower with raid 1 than 4,5,6 to get write holes.

  • And more importantly what's the benefit of putting a degraded RAID5 mdraid layer in between?

    Code
    mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/mapper/sda--LVM-HDD0--LVM /dev/mapper/sdb--LVM-HDD1--LVM

    Adding tons of complexity should serve a purpose worth the efforts...

    Hello,


    I added a degraded layer because I don't have the disks to start it right off.


    The complexity is here to solve another fundamental problem with BTRFS: native management of logical volumes.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!