What I learnt regarding SSD+TRIM and (mdadm)RAID5+LVM+ext4

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • What I learnt regarding SSD+TRIM and (mdadm)RAID5+LVM+ext4

      Have re-purposed 4x WDC WDS100T1B0A (1TB) SSD Drives running (mdadm)RAID5+LVM+ext4.

      This is just a collection of what I learned - so I can find it next time I go looking... and it may also be useful for someone else.

      Using re-purposed consumer gear is normally OK for home use - as long as you test, test, test and test some more.
      Left the disks running in a (mdadm)RAID5 only array for a few months on light duties - in that time one drive controller died; was replaced under warranty.
      Ran the array for a another few months to get past the infant mortality hump.

      Setup the latest OMV - setting up full disk (mdadm)RAID5+LVM+ext4 with a 800GB LV for VM images.

      However, when checking TRIM support - I get fstrim providing "not supported errors"... which is not surprising given the lsblk-D below (DISC-GRAN/DISC-MAX=0=unsupported):

      Source Code

      1. root@mama:~# lsblk -D
      2. NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
      3. sda 0 512B 2G 0
      4. └─md0 0 256K 2G 0
      5. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 0B 0B 0
      6. sdb 0 512B 2G 0
      7. └─md0 0 256K 2G 0
      8. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 0B 0B 0
      9. sdc 0 512B 2G 0
      10. └─md0 0 256K 2G 0
      11. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 0B 0B 0
      12. sdd 0 512B 2G 0
      13. └─md0 0 256K 2G 0
      14. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 0B 0B 0
      15. sde 0 512B 2G 0
      16. ├─sde1 0 512B 2G 0
      17. ├─sde2 0 512B 2G 0
      18. └─sde5 0 512B 2G 0
      Display All

      The WD drives support Discard:

      Source Code

      1. root@mama:~# !hdparm
      2. hdparm -I /dev/sd[abcde] | grep TRIM
      3. * Data Set Management TRIM supported (limit 8 blocks)
      4. * Deterministic read ZEROs after TRIM
      5. * Data Set Management TRIM supported (limit 8 blocks)
      6. * Deterministic read ZEROs after TRIM
      7. * Data Set Management TRIM supported (limit 8 blocks)
      8. * Deterministic read ZEROs after TRIM
      9. * Data Set Management TRIM supported (limit 8 blocks)
      10. * Deterministic read ZEROs after TRIM
      11. * Data Set Management TRIM supported (limit 8 blocks)
      12. * Deterministic read ZEROs after TRIM
      Display All
      From google-foo research:
      LVM has supported Discards since: 2011(?)
      (mdadm)RAID456 has supported Discards since: 2016(?)

      Tried different kernels - lsblk -D displayed different confusing results including showing support... but fstrim still gave errors.

      A lot of the guides unearthed by Google-foo - say that lvm.conf needs to be modified... in reality not. Only needed if you wish for Discards to be issued during lvremove/vgremove. Has no bearing on LVM transparently passing Discards down the stack. Under testing (with my setup) takes ~1.5hrs to fstrim a 800G LV and ~3.5hrs to lvremove a 1.9TB LV. So best to only use LVM issue_discards=1 when you _really_ need to then disable. (i.e. immediately after creating the RAID5+LVM stack for a clean/forced empty array).

      Source Code

      1. /etc/lvm/lvm.conf
      2. issue_discards = 1
      After more google-foo - came across current.workingdirectory.net/posts/2016/ssd-discard/ which points out devices_handle_discard_safely=Y
      I also learn that RAID5 disables this as default as there is no way (mdadm)RAID5 can test the SSD Disks to confirm correct handling so enabling is a manual admin action. Verification/testing is the responsibility of the admin.

      Source Code

      1. /etc/modprobe.d/raid456.conf
      2. options raid456 devices_handle_discard_safely=Y
      Any changes to /etc/lvm.conf and/or /etc/modprobe.d/raid456.conf will need the following before any new/different settings take effect:

      Source Code

      1. update-initramfs -u
      2. reboot
      Enabled devices_handle_discard_safely can be verified enabled by:

      Source Code

      1. root@mama:~# cat /sys/module/raid456/parameters/devices_handle_discard_safely
      2. Y
      and

      Source Code

      1. root@mama:~# lsblk -D
      2. NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
      3. sda 0 512B 2G 0
      4. └─md0 0 256K 2G 0
      5. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 256K 2G 0
      6. sdb 0 512B 2G 0
      7. └─md0 0 256K 2G 0
      8. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 256K 2G 0
      9. sdc 0 512B 2G 0
      10. └─md0 0 256K 2G 0
      11. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 256K 2G 0
      12. sdd 0 512B 2G 0
      13. └─md0 0 256K 2G 0
      14. └─vg_ssd_raid5_1-lv_raid5_xen_guests_1 0 256K 2G 0
      15. sde 0 512B 2G 0
      16. ├─sde1 0 512B 2G 0
      17. ├─sde2 0 512B 2G 0
      18. └─sde5 0 512B 2G 0
      Display All
      After setting up the RAID5 array - doing something like below will make sure the array is clean/forced empty (may take a _long_ time to complete; so run via screen).

      Source Code

      1. vgcreate vg_realname /dev/md0
      2. lvcreate -l 100%FREE -n lv_trimtest vg_realname
      3. lvremove lv_trimtest
      Then test, test, test and test some more before committing real data.
    • heady wrote:

      4x WDC WDS100T1B0A (1TB) SSD Drives running (mdadm)RAID5+LVM+ext4
      This is really dangerous. You can't adopt concepts made for spinning rust to modern flash storage.

      HDD die for completely different reasons than modern flash storage. A HDD will usually die without warning due to physical damage. With flash storage products they will suffer either from firmware bugs or wear out or physical damage as well (power spike or something like this).

      The risk of a firmware bug and the principle of wearing out identical with identical access patterns will result in a bunch of identical SSDs dying at (almost) the same time (at least when running identical firmwares). And that's something traditional/anachronistic RAID won't protect against. Same with RAID-1 or mirrors made out of identical SSDs --> bad idea. RAID-5 even makes no sense at all in my opinion...
    • tkaiser wrote:

      This is really dangerous. You can't adopt concepts made for spinning rust to modern flash storage.
      HDD die for completely different reasons than modern flash storage.
      Currently you'd find EMC, NetApp, HDS, HP, Huawei & IBM would seem to disagree...

      I'd be interested in any real modeling or actual large scale statistics you may have to substantiate your opinion.

      HDDs have firmware bugs too.. always have done.

      I agree that HDDs die for different reasons than SSDs - both though, follow the same bathtub life-cycle curve which _can_ be modeled the same way for either.

      I have plugged some numbers into a reliability network and I have a result I'm happy with.
      I'm happy to redo if you have any real numbers to challenge.

      I'll keep doing what I'm doing thanks.
    • heady wrote:

      I'll keep doing what I'm doing
      • tomshardware.com/news/intel-32…g-8mb-firmware,13250.html -- this firmware bug affects SSDs losing power (power loss, UPS failure). Try to imagine what happens if you make up an array out of identical SSDs that are all affected by the same bug
      • forums.crucial.com/t5/Crucial-…-once-an-hour/ta-p/130218 -- this firmware bug affects SSDs running more than 5184 hours then becoming unresponsive until next power cycle. Try to imagine what happens if you make up an array out of identical SSDs that are all affected by the same bug
      • Insert random firmware bug here. Try to imagine what happens if you make up an array out of identical SSDs that are all affected by the same bug
      Up to you to think about. The contractors we work with always ensure that HDDs we put into our arrays are from different batches (they learned from two nasty occasions that led to negotiation problems between Infortrend/EMC RAID controllers and disks that led to a bunch of HDDs being kicked out of arrays at the same time). For whatever reasons they don't do the same with the caching SSDs (might be one of the areas where people only learn by having to make their own experiences)