HowTo build an SHR – Sliced Hybrid Raid

  • HowTo build an SHR – Sliced Hybrid Raid


    Just in case somebody is interested how to build a Sliced Hybrid Raid. Here is what I did on one of my machines.


    What is a Sliced Hybrid Raid (SHR) – and why might you want it ?

    A Sliced Hybrid Raid is basically a Raid5 created with disks of different sizes. But instead of using only the size of the smallest disk on all the remaining disks, the advantage of an SHR is, that it uses as much of the available disk space as possible. It does so, by using a combination of partitions, software raid (using mdadm) and logical volumes (using lvm). Fortunately, all those tools are available in OMV, so we can take advantage of it.


    Note however, that you still need the largest HDD twice, since you need at least 2 partitions to create a slice. If your largest HDD only exists once, you will “loose” the amount of space, that it is larger than your second largest drive.


    Credits

    This is not my idea. I’m just documenting, what I did, to create an SHR on my OMV-NAS.


    I saw this the first time on my Synology NAS. They even call it “Synology Hybrid Raid”, but it can also be found on several places in the internet, where it is called “Sliced Hybrid Raid”, as I call it here also. I have no idea who came up with this first, but at the end it’s not a new technology, it’s just a (clever?) combination of a few old ones.



    Guideline to build an SHR on OMV


    Disclaimer

    One important warning upfront:


    This is no guideline you can just blindly follow. You need and must understand what all those commands do, since you definitely will and must adapt them to your specific environment. So carefully read and understand(!) the whole guideline before following it. If it contains commands, you are not familiar with, go read the man-page or a similar source of information. Only follow this guideline, if you fully understand each and every command. You must be able to detect an error, if you see one – or at least, you need to get suspicious...


    The commands used in this guideline can destroy all your data if you do a mistake. Therefor I urgently recommend to have a backup (on a different machine or at least on a disk, that is NOT attached to the machine, where you are going to build the SHR !!) and that you practice those steps in a VM before doing it on your real machine. I shall not be held responsible in case you destroy any of your data using this guideline. You have been warned.



    Prerequisites:

    Make sure you have installed the OMV Plugins for md and lvm.

    pasted-from-clipboard.png

    In Debian (on the command line) make sure gdisk and lsblk are installed / available. If they aren’t, install using apt-get.


    Getting started

    In my case I use 6 HDDs; two of size 3 TB, two of size 5 TB and two of size 10 TB. I have deleted all the partitions that were previously on those disks. So, for the start of this guideline, they are empty.


    So, this will basically be my setup:


    pasted-from-clipboard.png



    This is what the 6 HDDs look like using lsblk:


    Code
    root@ts8-nas:~# lsblk -o name,partlabel,parttypename,size /dev/sd[abcdef]
    NAME PARTLABEL PARTTYPENAME  SIZE
    sda                          9.1T
    sdb                          9.1T
    sdc                          2.7T
    sdd                          2.7T
    sde                          4.5T
    sdf                          4.5T


    Hands on – the partitions

    I start with the smallest HDD I have. In my case that’s the 3 TB HDD. I create a partition, that fills the whole space of that disk and I create a partition of exactly the same size on all other disks.



    Code
    root@ts8-nas:~# sgdisk --largest-new=0 /dev/sdc
    Creating new GPT entries in memory.
    The operation has completed successfully.


    Note:

    If you don’t get the result “The operation has completed successfully.”, you did something wrong (sgdisk doesn’t show error messages).


    The --info command gives me the size in sectors of my newly created partition.


    Code
    root@ts8-nas:~# sgdisk --info=1 /dev/sdc
    Partition GUID code: 0FC63DAF-8483-4772-8E79-3D69D8477DE4 (Linux filesystem)
    Partition unique GUID: C88F36E9-F2EB-4FB6-9F3B-581B96F4E6C5
    First sector: 2048 (at 1024.0 KiB)
    Last sector: 5860533134 (at 2.7 TiB)
    Partition size: 5860531087 sectors (2.7 TiB)
    Attribute flags: 0000000000000000
    Partition name: ''


    I’m using these values in this command sgdisk --new=0:0:+5860531087 /dev/sdx to create a partition of the same size on all other disks:



    And that’s the result:



    These are the partitions for the first slice. Now I’m going to create the partitions for the second slice. This works basically similar to the first slice:



    And this is the result



    These are the partitions for the second slice. Now I’m going to create the partitions for the third slice:



    And this is the result



    Ok, the first part is now finished. All the needed partitions are created. Now I’m going to create the slices.

  • Creating the Slices - Raid5 Arrays with mdadm

    A slice is a Raid-5 Array made from all the partitions of the same size. So the first slice consists of a Raid-5 Array with /dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1, /dev/sde1, /dev/sdf1. The second slice consists of a Raid-5 Array with /dev/sda2, /dev/sdb2, /dev/sde2, /dev/sdf2. And the third slice consists of a Raid-5 Array with /dev/sda2, /dev/sdb2.


    Code
    root@ts8-nas:~# mdadm --create /dev/md1 --name=slice1 --level=5 --raid-devices=6 --consistency-policy=ppl /dev/sd[abcdef]1
    mdadm: Defaulting to version 1.2 metadata
    mdadm: array /dev/md1 started.
    root@ts8-nas:~# mdadm --create /dev/md2 --name=slice2 --level=5 --raid-devices=4 --consistency-policy=ppl /dev/sd[abef]2
    mdadm: Defaulting to version 1.2 metadata
    mdadm: array /dev/md2 started.
    root@ts8-nas:~# mdadm --create /dev/md3 --name=slice3 --level=5 --raid-devices=2 --consistency-policy=ppl /dev/sd[ab]3
    mdadm: Defaulting to version 1.2 metadata
    mdadm: array /dev/md3 started.

    Note:

    the part „--consistency-policy=ppl“ is needed, because mdadm is suffering from the same write-hole they always claim to be a problem in BTRFS. This param solves it for mdadm.


    //Edit:

    Please note, that the parameter "-consistency-policy=ppl" can cost you performance. If combined with the wrong chunk size, performance degradation can be about 1/10 of the original data rate. See "Update 1" here.

    If you have don't care about the write hole (e.g., because you have a UPS in place), then just omit it.



    This is what I have now:



    Our slices are now already visible in the OMV Web-UI, albeit marked as degraded.


    pasted-from-clipboard.png

    (Don’t be surprised, if the device files (/dev/md*) have different numbers. They might change after a reboot.)



    Now I wait until the initial raid build has finished. (i.e. in my case about 24h later). Go get some coffee J


    Eventually the Raids are up and running…


    pasted-from-clipboard.png

    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

    Edited once, last by Quacksalber ().

  • Putting it all together - LVM

    The remaining steps are now caried out in the Web-UI of OMV:


    Declare the three slices (those raid-5 arrays) as physical volumes in LVM:


    pasted-from-clipboard.png


    Press “Create” button.


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    Select one of the raids.


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    Repeat for the other two slices.


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    In some places it is recommended to align the chunks of the raid array with the physical extents from LVM to get better performance. I have tried this with no measurable effect. If any, the aligned configuration was a little bit slower than the standard configuration.

    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

  • And finally create one logical volume (which is then the SHR):


    pasted-from-clipboard.png


    Press the „Create“ button.


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    Select the volume group we just created and assign all the space to it.


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png



    Looking at it from the command line:


    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

  • Now you only have to format it with the file system of your choice. For this example I’ll use XFS.


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    Wait some seconds or maybe a few minutes... (In case of Ext4 you’d need to wait for hours)


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png

    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

  • votdev

    Approved the thread.
  • Now I have disks of different sizes with one big file system and all the advantages of a traditional Raid5 and without the Write-Hole.


    pasted-from-clipboard.png



    Since technically this is a typical block device, I could use any files system on it even BTRFS. But since BTRFS has its own capabilities regarding Raid and Volume Management, this SHR stuff wouldn’t make any sense at all, if there weren’t the fact, that Raid-5 is claimed to unstable in BTRFS since ages. So, if we want to use Raid-5 with BTRFS and have the advantage to be able to use disks of various sizes, we must also fall back to this SHR, but we can short cut the creation of the SHR. There is no need to use LVM. This can be done by BTRFS itself.


    For usage with BTRFS we still need the 3 slices (i.e. /dev/md1, /dev/md2, /dev/md3). Then the only thing we need to do is to create a file system using BTRFS like this:


    Press the „Create“ button“


    pasted-from-clipboard.png


    Select BTRFS.


    pasted-from-clipboard.png


    You could either select single or Raid0 here. I did some performance tests and in my setup Raid0 was slightly better than single.


    pasted-from-clipboard.png


    Select all the 3 slices.


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    Start formatting and then go through the usual steps…


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png


    pasted-from-clipboard.png

    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

  • pasted-from-clipboard.png



    pasted-from-clipboard.png


    Now I have BTRFS on top of the 3 slices also with most of the advantages from BTRFS (I miss correction of bit rot in this setup, but I still have the detection).


    Regarding performance, I did some extensive testing on my machine. The short version is:

    For small files (i.e. files smaller than about 4 GB) there is hardly any difference measurable (due to the file cache in Linux). For big multi gigabyte files (bigger than 8 GB and also for backups with btrfs send/receive, since they would usually also consist of several GB of data) there is a significant degradation visible for SHR. Performance of SHR becomes almost normal, if you enable the cache of lvm with some SSD.


    3 slices with BTRFS Raid-0 is still slower than e.g. plain Raid-5, but performance is acceptable.


    I have published the results of my performance measurements in this thread: link

    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

  • Outlook:

    All the information hereafter is untested. I have experimented with it in a VM (that’s why names are maybe slightly different), but I haven’t used it in a real system. Therefore: Use with care and double check.


    What to do if a disk fails:

    Exchange the faulty drive


    Manually create the partitions (as above)


    Add those partitions to the degraded raids with commands like this


    Code
    mdadm --manage /dev/md/ts8-nas\:shr1 --add /dev/sdf1
    mdadm --manage /dev/md/ts8-nas\:shr2 --add /dev/sdf2
    mdadm --manage /dev/md/ts8-nas\:shr3 --add /dev/sdf3


    to see progress


    Code
    cat /proc/mdstat 


    How to grow SHR by exchanging a small disk by a bigger one:

    Please note:

    For this to work, the new drive must either match the size of an existing drive (to be able to create all the partitions for the slices) or it must be bigger than the largest HDD.


    In case you still have a slot for an additional disk left in your NAS-case:

    add a spare


    Code
    mdadm /dev/$MYRAID --add /dev/sdix


    now replace the smaller disk


    Code
    mdadm /dev/$MYRAID --replace /dev/sdcx


    after copying the data to the new disk, the small one automatically gets set to „faulty“


    then remove disk


    Code
    mdadm /dev/$MYRAID --remove /dev/sdcx



    Of course, you need to do this for every partition.


    In case there is no slot left for an additional disk and you first need to remove the existing one:

    remove small disk from the raid(s). Since it’s the small disk, I assume it’s only one raid where we need to remove it:


    Code
    mdadm /dev/md/ts8-nas\:shr1 --fail /dev/sdd1 --remove /dev/sdd1


    check the result


    Code
    mdadm --detail /dev/md1


    display the existing partitions of the biggest disk

    (I have some additional partitions here (1-3). For the sake of this example just ignore them)


    Code
    root@ts8-nas:~# sgdisk --print /dev/sdg
    Number  Start (sector)    End (sector)  Size       Code  Name
       1            2048           10239   4.0 MiB     EF02  grub
       2           10240         4204543   2.0 GiB     8200  swap
       3         4204544        37758975   16.0 GiB    8300  system
       4        37758976      9767541134   4.5 TiB     8300
       5      9767542784     27344764894   8.2 TiB     8300
    root@ts8-nas:~#


    plug in new disk and rebuild partitions on that new disk (as above). In case the new drive is bigger than the existing ones, remember to create an additional partition for the remaining space.


    Code
    sgdisk --new=1:2048:10239 --change-name=0:grub --typecode=0:0xef02 /dev/sdd
    sgdisk --new=2:10240:4204543 --change-name=0:swap --typecode=0:0x8200 /dev/sdd
    sgdisk --new=3:4204544:37758975 --change-name=0:system --typecode=0:0x8300 /dev/sdd
    sgdisk --new=4:37758976:9767541134 --typecode=0:0x8300 /dev/sdd
    sgdisk --new=5:9767542784:27344764894 --typecode=0:0x8300 /dev/sdd
    sgdisk --largest-new=6 /dev/sdd


    add partitions to the raid(s):


    Code
    mdadm --manage /dev/md/omv-x64\:shr1 --add /dev/sdd4
    mdadm --manage /dev/md/omv-x64\:shr2 --add /dev/sdd5


    wait until rebuild has finished


    Code
    cat /proc/mdstat



    if you are exchanging more than one disk, repeat those steps for the next disk. (Only replace one disk at a time. You would lose your raid (and all your date) otherwise)


    In case you added more than one disk and both were bigger than your existing disk, we need to create a new raid5 from the new partitions:


    Code
    mdadm --create /dev/md/shr3 --name=shr3 --level=5 --raid-devices=2 /dev/sd[d-e]6



    Check progress using


    Code
    cat /proc/mdstat
    • Add that new Raid5 as physical volume to LVM
    • Add the new physical volume to the existing Volume Group
    • Grow the size of your logical volume
    • Partition all new disks with the same partitions as your largest disk
    • Add another partition containing the remaining space on the new disks
    • Add all but the last partition to their corresponding Raids
    • Create a new Raid5 for the last partition on the new disks
    • Add that new Raid5 as physical volume to LVM
    • Add the new physical volume to the existing Volume Group
    • Grow the size of your logical volume


    If you just simply want to add one or more additional disks to your SHR (assuming those disks are larger than those already in the SHR):



    You should find the necessary commands in the above descriptions.

    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

  • Update 1:

    I realized, that "-consistency-policy=ppl" can (and in my case DID) cost performance. Especially if combined with the wrong chunk size.


    So in case you don't care about the write hole (e.g., because you have a UPS in place), then just omit it.

    In case you want to keep that param, make sure you choose the right chunk size.

    For details see "Update 1" here.

    Qnap TS-853A

    Syno DS-1618+

    Qnap TS-451+

    Qnap TS-259 Pro+

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!