RAID5 over LVMs for a proper BTRFS handling?

    • RAID5 over LVMs for a proper BTRFS handling?

      Hello,

      First time on this forum, new to OMV but quite used to linux and servers in general, I decided to give a try to OMV on a new server array.
      Long time user of Synology and because they will have to comply with the upcoming "legal Australian backdoor", I preferred to move to an OS that don't have anything to sell! This way there is no reason for OMV to comply to these crazy laws. And let's face it, I like new challenges!

      I figure synology RAID is using BTRFS for quite sometime in their production machines, and they heavily advertise on it. I had absolutely no problems with the RAID5-type arrays I've been stress-testing over the past two years, no hickups, monthly integrity reports and build checks were showing no errors at all. And anyways, I don't think Synology would advertise in such way without making some heavy tests to back up their claims.
      However, the BTRFS kernel project is listing RAID5 and 6 as unstable, leaving its implementation to software integrators...

      This level of contradiction between these two made me wonder why there is such mismatch in the perceived maturation of this filesystem.
      I started by understanding how Synology RAID differs from classic RAID. Then I heard from Level1-forum that Synology RAID is on top of LVMs!!! I came to the conclusion that Synology configures a LVM of each disk right away, then manages the logical volume from that LVM to build the array.

      So, why not trying this approach with OMV?!?
      I have a new system, nothing stored on it yet... Let's do it.
      Last detail, THIS IS A TEST, it is not meant to be used on a production machine. Don't do it!


      Enough for the introduction, the newly assembled system is built around a C236 Intel chipset with a low power E3-1260L-V5 processor, 16GB of UDIMM-ECC. There is a RAID controller embedded in the system, but obviously, I won't use it here.
      One major constrain I have is about the migration of the data, I have to migrate from a synology RAID array, with a limited quantity of hard drive. So I figured I would create a degraded RAID5 array. As a side note / is set on a separate SSD.

      To make up this I had to install OMV (V4.1.21), then install the LVM plugin (omv-lvm2 V4.0.7-1 at the time of writing this) from the main repository.
      Then I used the following two links to help me out in during the setup and syntax of mdadm:
      blog.mycroes.nl/2009/02/migrat…ingle-disk-to-3-disk.html
      translate.google.com/translate…raid-1-un-raid-5-sin.html

      Creating the LVM for each disk is easily done through the webUI. Very simple, take one disk, for instance /dev/sda and create a sda-LVM volume group then a HDD0-LVM logical volume. This way it matches the numbering I am used to put on the physical hard drive, good thing to avoid touching the fan.
      BTW, I am soooo pleased with this project. OMV rocks! Seriously, a new era of open-source-self-hosting is upon us. Thanks to OpenMediaVault, its active community and Volker!

      After that, I had to identify the volumes created using mdadm:

      Source Code

      1. root@openmediavault:~# fdisk -l
      2. Disk /dev/sda: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
      3. Units: sectors of 1 * 512 = 512 bytes
      4. Sector size (logical/physical): 512 bytes / 4096 bytes
      5. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      6. Disk /dev/sdb: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
      7. Units: sectors of 1 * 512 = 512 bytes
      8. Sector size (logical/physical): 512 bytes / 4096 bytes
      9. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      10. Disk /dev/sdc: 111.8 GiB, 120034123776 bytes, 234441648 sectors
      11. Units: sectors of 1 * 512 = 512 bytes
      12. Sector size (logical/physical): 512 bytes / 512 bytes
      13. I/O size (minimum/optimal): 512 bytes / 512 bytes
      14. Disklabel type: dos
      15. Disk identifier: 0xc72f6d50
      16. Device Boot Start End Sectors Size Id Type
      17. /dev/sdc1 * 2048 217806847 217804800 103.9G 83 Linux
      18. /dev/sdc2 217808894 234440703 16631810 8G 5 Extended
      19. /dev/sdc5 217808896 234440703 16631808 8G 82 Linux swap / Solaris
      20. Disk /dev/mapper/sda--LVM-HDD0--LVM: 3.7 TiB, 4000783007744 bytes, 7814029312 sectors
      21. Units: sectors of 1 * 512 = 512 bytes
      22. Sector size (logical/physical): 512 bytes / 4096 bytes
      23. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      24. Disk /dev/mapper/sdb--LVM-HDD1--LVM: 3.7 TiB, 4000783007744 bytes, 7814029312 sectors
      25. Units: sectors of 1 * 512 = 512 bytes
      26. Sector size (logical/physical): 512 bytes / 4096 bytes
      27. I/O size (minimum/optimal): 4096 bytes / 4096 bytes
      Display All
      The two logical volumes created are /dev/mapper/sd$--LVM-HDD#--LVM
      Let go ahead and create a new array. BTW, I tried to make a RAID6, but it failed, requesting for a minimum of 4 disks to be created.
      I could have forced it somehow maybe, but I figured it is not complicated to migrate a RAID5 to a RAID6.
      Anyhow, this is he syntax and the output:

      Brainfuck Source Code

      1. root@openmediavault:~# mdadm --create /dev/md0 --level=6 --raid-devices=2 /dev/mapper/sda--LVM-HDD0--LVM /dev/mapper/sdb--LVM-HDD1--LVM
      2. mdadm: at least 4 raid-devices needed for level 6
      3. root@openmediavault:~# mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/mapper/sda--LVM-HDD0--LVM /dev/mapper/sdb--LVM-HDD1--LVM
      4. mdadm: Defaulting to version 1.2 metadata
      5. mdadm: array /dev/md0 started.
      6. root@openmediavault:~# cat /proc/mdstat
      7. Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
      8. md0 : active raid5 dm-1[2] dm-0[0]
      9. 3906883584 blocks super 1.2 level 5, 512k chunk, algorithm 2 [2/1] [U_]
      10. [>....................] recovery = 0.3% (12684256/3906883584) finish=430.1min speed=150866K/sec
      11. bitmap: 0/30 pages [0KB], 65536KB chunk
      12. unused devices: <none>
      13. root@openmediavault:~# mdadm --detail /dev/md0
      14. /dev/md0:
      15. Version : 1.2
      16. Creation Time : Mon Apr 15 15:12:28 2019
      17. Raid Level : raid5
      18. Array Size : 3906883584 (3725.89 GiB 4000.65 GB)
      19. Used Dev Size : 3906883584 (3725.89 GiB 4000.65 GB)
      20. Raid Devices : 2
      21. Total Devices : 2
      22. Persistence : Superblock is persistent
      23. Intent Bitmap : Internal
      24. Update Time : Mon Apr 15 15:14:38 2019
      25. State : clean, degraded, recovering
      26. Active Devices : 1
      27. Working Devices : 2
      28. Failed Devices : 0
      29. Spare Devices : 1
      30. Layout : left-symmetric
      31. Chunk Size : 512K
      32. Rebuild Status : 0% complete
      33. Name : openmediavault:0 (local to host openmediavault)
      34. UUID : 00000000:11111111:22222222:33333333
      35. Events : 29
      36. Number Major Minor RaidDevice State
      37. 0 253 0 0 active sync /dev/dm-0
      38. 2 253 1 1 spare rebuilding /dev/dm-1
      Display All
      Then I let it stand still during the rebuilding/building of the array.
      Then volume formating in BTRFS, then created a shared folder.
      Copying some data at the moment, then I will stress a little these two new disk. I am confident about them, they have been tested a little beforehand.


      I hope this will help.
      Thanks for reading this! ;)
      Updates coming soon.
      Feel free to throw the hell out of my wrongdoing. :D
    • Thanks for your great answers!

      henfri wrote:

      IIRC this is NOT recommend by the btrfs developers.
      The Raid5 problem with btrfs is the "write hole" and I think this also exists in mdadm.
      Here is the link that triggered my interest. Please take a look at it.
      btrfs.wiki.kernel.org/index.ph…agers_and_logical_volumes

      I am not trying to fix the "write hole" in this configuration, but only making an attempt in improving the stability of the array over the different every day life operation of a NAS-size server. There are so many post on the forum about RAID getting degraded/fail/problematic, that I ought to propose an alternative construction by making the BTRFS filsystem managing LVM block devices. I understand your doubts about how pertinent this proposal is.
      Indeed, I should have mentioned how important it is to use a properly configured UPS for such a small server box, because this is inherently the main issue with the "write hole" issue.

      henfri wrote:

      Also this is quite interesting:
      lwn.net/Articles/665299/
      An SSD as journal to fix the write hole is supported by mdadm.
      I don't remember, whether btrfs supports it, but I think it was mentioned in the GitHub discussion above.
      Thank you very much! Another step to improve the behavior of the array. The system SSD used is way too big for the OMV install, so I can try to make a partition of it for that purpose. Any idea of the implementation/synthax of that feature?
      ;)
    • henfri wrote:

      Here is the link that triggered my interest. Please take a look at it
      I read it, but cannot see, what triggers your thoughts. I see no benefit of LVM beneath btrfs. Rather the contrary. A rebuild with pure btrfs will be way faster if the filesystem is not full, as only the used part is rebuilt.

      Here is the simple implementation that is suggested:

      btrfs.wiki.kernel.org wrote:

      create a single volume group, with two logical volumes (LVs), each backed by separate devices
      • create a btrfs raid1 across the two LVs
      • create subvolumes within btrfs (e.g. for /home/user1, /home/user2, /home/media, /home/software)
      • in this case, any one subvolume could grow to use up all the space, leaving none for other subvolumes
      • however, it performs well and is convenient

    • In case of different data on the two drives, how do you determine the right data?

      http://www.raid-recovery-guide.com/raid5-write-hole.aspx wrote:

      If the parity (in RAID5) or the mirror copy (in RAID1) is not written correctly, it would be unnoticed until one of the array member disks fails. If the disk fails, you need to replace the failed disk and start RAID rebuild. In this case one of the blocks would be recovered incorrectly
    • The reason for write holes is, that data are getting written at different times. This will not happen on raid1. The question remains, what has to be done when the data are different between the two drives, caused by some disk error. This highly depends on the implementation. In general, a simple hw controller or mdadm cannot decide which one is correct and which one not. The easiest defence against this is a quorum, so if you use 3 disks, 2 should be the same. This can be handled by simple raid controllers.
      There is also the possibility of checksums, as they are implemented in zfs or ceph. Ceph for example askes for quorum AND correct check sum as default.
      A common way to go, if you only got two drives, is to actually check both disks by hand and decide which one is the good one, by reither smart values or chacking files or fs integrity. But there will not be a scrub to death scenario happening as they may happen in raid 456.
      Of course the only thing that may happen is a total disk fail and wrong data which were undetected before. To avaid that you must do diskverifications regulary.

      The post was edited 1 time, last by getName() ().

    • With caching disabled, the drive has enough power in its capacators to finish the writes given to the controller and they should be synchronous enough in that matter, if you use the same type of disks. Talking about ssds here of course.
      It is of course quite as simple as this: every storage can blunder. Thats what we have different depths of backup for. Its just a matter of propabilities and it is way way lower with raid 1 than 4,5,6 to get write holes.
    • tkaiser wrote:

      henfri wrote:

      what would be the benefit over pure btrfs?
      And more importantly what's the benefit of putting a degraded RAID5 mdraid layer in between?

      Source Code

      1. mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/mapper/sda--LVM-HDD0--LVM /dev/mapper/sdb--LVM-HDD1--LVM
      Adding tons of complexity should serve a purpose worth the efforts...
      Hello,

      I added a degraded layer because I don't have the disks to start it right off.

      The complexity is here to solve another fundamental problem with BTRFS: native management of logical volumes.