SSD Cache (bcache) for OpenMediaVault?

darkarn · 9. Oktober 2017

Hi, just wondering, is bcache supported in any manner in OpenMediaVault as of this version? I want to see if this is viable or not

Thanks!

ness1602 · 10. Oktober 2017

What do you need it for?

darkarn · 11. Oktober 2017

Zitat von ness1602

What do you need it for?

Simply put, to make my SMB share go a bit faster

subzero79 · 11. Oktober 2017

THere is bcache tools in Debian Stretch and bcache module in bpo stretch kernel. There is nothing implemented in omv UI or backend to handle bcache.

subzero79 · 11. Oktober 2017

also I am new to bcache, I heard it but i still don’t know what’s so good about it. Can You elaborate on the benefits ?

tkaiser · 11. Oktober 2017

Zitat von darkarn

make my SMB share go a bit faster

That means

you have a SMB performance problem?
you analyzed the various potential bottlenecks already?
and came to the conclusion that random IO performance is too low?

darkarn · 11. Oktober 2017

Zitat von subzero79

THere is bcache tools in Debian Stretch and bcache module in bpo stretch kernel. There is nothing implemented in omv UI or backend to handle bcache.

Ah I see, thanks!

Zitat von subzero79

also I am new to bcache, I heard it but i still don’t know what’s so good about it. Can You elaborate on the benefits ?

Frankly, I am new to this too, what I understand so far is that I can just add a SSD to the NAS and can use this as a cache for the HDDs so that reads and writes can be faster

Zitat von tkaiser

That means
you have a SMB performance problem?
you analyzed the various potential bottlenecks already?
and came to the conclusion that random IO performance is too low?

More like I realised that my bottleneck was the HDDs themselves since I am using 10 Gbe now (initial burst of 300 to 400 MB/s, constant 150 MB/s), current options include using Windows Server OS that has SMB improvements like SMB Direct instead of OMV, changing to RAID or ZFS, using bcache, or simply living with this.

tkaiser · 11. Oktober 2017

Zitat von darkarn

More like I realised that my bottleneck was the HDDs themselves since I am using 10 Gbe now

Yeah, that explains why you're thinking about bcache but I doubt it will help that much since by default it recognizes sequential writes and passes them through. I's a great addition if you rely on anachronistic RAID modes (eg. a RAID6) since such a RAID shows high sequential performance but sucks totally at random IO. This is where bcache can shine since it accelerates especially random writes.

Then bcache acts somewhat similar to a L2ARC cache with ZFS caching most accessed stuff on SSD(s) that doesn't fit into ARC (physical memory dedicated as cache). In other words: this is something that works fine on servers but not that good with most (home) OMV installations where a different data usage pattern applies.

Without changing disk topology I fear the only area where bcache could help is improving NAS write performance as long as the data fits onto the SSD (as you see it now with 300-400MB/s write speed going into filesystem buffer in RAM and once this is exceeded slowing down to disk write performance). But only if you configure bcache to accelerate sequential writes too. Reads will be bottlenecked by disks if the data is not already in the cache so it depends on your usage pattern which route to go (SMB Direct won't help BTW).

ness1602 · 11. Oktober 2017

BY default anywhere, sequetial read doestn get cached,because why would anyone do that?
You get most out of cache on random write.

nhenk001 · 14. Oktober 2017

Zitat von subzero79

THere is bcache tools in Debian Stretch and bcache module in bpo stretch kernel. There is nothing implemented in omv UI or backend to handle bcache.

This appears to be accurate, from what I can tell I set up bcache correctly in CLI and OMV even recognizes it. As soon as I try to mount it for a shared folder it crashes, though.

I tried LVM's caching feature before bcache because it seemed like it was more reliable and closer to business grade. OMV can mount the logical volumes set up in this program. Unfortuntately, actually performed worse while caching, than with a stand alone spinning disk. I could not get documents to promote into SSD. This blog appears to be correct regarding performance based on my own testing: Nikolaus Rath's Website.

Zitat von tkaiser

I's a great addition if you rely on anachronistic RAID modes (eg. a RAID6) since such a RAID shows high sequential performance but sucks totally at random IO. This is where bcache can shine since it accelerates especially random writes.

This is exactly why I was also looking into OMV/bcache and how I found this thread. I am working on a cloud migration but my company works with very large files (including 4K video) and also super small files that need to be built into sprites. We still need a local server for the speed and upload caching (hybrid solution).

OMV seems more consumer-targeted but has all of the security features we should need. It also seems much easier to use than ClearOS which is marketed to businesses as a MS Server alternative.

It's so close to working with bcache... if I was a developer I might look under the hood and assess how much time it would take to resolve the situation. For any curious users that want to look at the stack trace:

Error #0:
exception 'OMV\Exception' with message 'Device '/dev/bcache01' is not available after a waiting period of 10 seconds.' in /usr/share/php/openmediavault/system/blockdevice.inc:486
Stack trace:
#0 /usr/share/openmediavault/engined/rpc/filesystemmgmt.inc(609): OMV\System\BlockDevice->waitForDevice(10)
#1 /usr/share/php/openmediavault/rpc/serviceabstract.inc(528): OMVRpcServiceFileSystemMgmt->{closure}('/tmp/bgstatusIY...', '/tmp/bgoutputrj...')
#2 /usr/share/openmediavault/engined/rpc/filesystemmgmt.inc(642): OMV\Rpc\ServiceAbstract->execBgProc(Object(Closure), NULL, Object(Closure))
#3 [internal function]: OMVRpcServiceFileSystemMgmt->create(Array, Array)
#4 /usr/share/php/openmediavault/rpc/serviceabstract.inc(124): call_user_func_array(Array, Array)
#5 /usr/share/php/openmediavault/rpc/rpc.inc(86): OMV\Rpc\ServiceAbstract->callMethod('create', Array, Array)
#6 /usr/sbin/omv-engined(536): OMV\Rpc\Rpc::call('FileSystemMgmt', 'create', Array, Array, 1)
#7 {main}

I really wish this worked, a couple of us at the office were really excited to do a Linux file-server deployment... now we're a little less excited because we have to sell the value/benefits of very large SSD's to the decision makers, haha.

subzero79 · 14. Oktober 2017

Can you describe how did you set it up? I can test in a vm and see how it goes

nhenk001 · 14. Oktober 2017

Thank you for having a look!

Bcache is really simple to set up, about 10 commands including directory changes once you have bcache installed. I used these guides nearly verbatim (logged in as root):

Tech-G - After setting up with this guide I thought that maybe I did something wrong. Getting the drives disconnected was a real challenge so I found the a second guide
Kernel.org (better setup instructions) - Included instructions for "probing initialization failed: Device or resource busy" issue. Instructions for setup are nearly identical in this guide, but they seem a bit more clear

OMV could see my bcache volume after each setup. It only failed when I attempted to create a shared folder; that's the stage where I got the stack trace from.

Other setup information:

OMV installed in 8.9 Jessie with GNOME desktop
Virtual machine running in VirtualBox: 2 cores & 2GB of ram
Very few packages installed (sudo, bcache, lvm, gparted, maybe one or two more)

subzero79 · 15. Oktober 2017

did not have issues at all mounting in omv the /dev/bcache0 device. I used two virtual disks of 2GB. This is debian9+omv4 in server, not with gnome installed, using the bcache tools from debian repos. A full gnome desktop env should not be installed among omv.

Spoiler anzeigen

darkarn · 16. Oktober 2017

Zitat von tkaiser

Yeah, that explains why you're thinking about bcache but I doubt it will help that much since by default it recognizes sequential writes and passes them through. I's a great addition if you rely on anachronistic RAID modes (eg. a RAID6) since such a RAID shows high sequential performance but sucks totally at random IO. This is where bcache can shine since it accelerates especially random writes.
Then bcache acts somewhat similar to a L2ARC cache with ZFS caching most accessed stuff on SSD(s) that doesn't fit into ARC (physical memory dedicated as cache). In other words: this is something that works fine on servers but not that good with most (home) OMV installations where a different data usage pattern applies.

Without changing disk topology I fear the only area where bcache could help is improving NAS write performance as long as the data fits onto the SSD (as you see it now with 300-400MB/s write speed going into filesystem buffer in RAM and once this is exceeded slowing down to disk write performance). But only if you configure bcache to accelerate sequential writes too. Reads will be bottlenecked by disks if the data is not already in the cache so it depends on your usage pattern which route to go (SMB Direct won't help BTW).

Thanks for this write up! I was setting up another Windows 10 box when I decided to test this out by borrowing the 10 Gbe card from one of my NASes and do a simple Windows 10 to Windows 10 network drive copying test.

Writing to Workstation (HDD to SSD): 300 to 400 MB/s
Writing to Test Box (SSD to HDD): 70MB/s to 120MB/s

This is regardless of SMB Direct and Multichannel as you rightly predicted. RDMA capability detection is pretty iffy though.

Looks like maybe I should just get the fastest HDDs (8 TB SSDs are just way too expensive lol) and/or reconfigure the topology?

tkaiser · 16. Oktober 2017

Zitat von darkarn

Looks like maybe I should just get the fastest HDDs (8 TB SSDs are just way too expensive lol) and/or reconfigure the topology?

Depends on the use case I would say

If you want to tune for no reason (nice looking benchmark numbers) I would start to investigate why you're only getting 400 MB/s NAS network performance (since most probably this is 'just' tweaking settings here and there without additional costs). If you then want to overcome the HDD performance bottleneck you can achieve this without loosing storage capacity with a RAID0 (stupid idea of course) or you have to weigh features / needs.

If you're only looking after high sequential IO numbers a RAID6 or RAIDZ2 could do but they're usually pretty slow when it's about random IO (this is where bcache could now jump in).

If you have a bunch of disks available, don't care much about the money spent on a setup and also need high random IO (IOPS write and read values and low latency) then at least I would use ZFS with mirrored vdevs in a single pool (both random and sequential IO performance scale more or less linearly with count/type of disks, througput limited then by host/bus bottlenecks of course, so don't expect exceeding 600 MB/s sequential NAS performance on average servers)

darkarn · 16. Oktober 2017

Zitat von tkaiser

Depends on the use case I would say
If you want to tune for no reason (nice looking benchmark numbers) I would start to investigate why you're only getting 400 MB/s NAS network performance (since most probably this is 'just' tweaking settings here and there without additional costs). If you then want to overcome the HDD performance bottleneck you can achieve this without loosing storage capacity with a RAID0 (stupid idea of course) or you have to weigh features / needs.

If you're only looking after high sequential IO numbers a RAID6 or RAIDZ2 could do but they're usually pretty slow when it's about random IO (this is where bcache could now jump in).

If you have a bunch of disks available, don't care much about the money spent on a setup and also need high random IO (IOPS write and read values and low latency) then at least I would use ZFS with mirrored vdevs in a single pool (both random and sequential IO performance scale more or less linearly with count/type of disks, througput limited then by host/bus bottlenecks of course, so don't expect exceeding 600 MB/s sequential NAS performance on average servers)

I thought 400 MB/s for sustained is on par when the destination disk is an SSD, how fast should I expect when properly tuned?

Yes, due to this realisation I am now thinking quite hard on what I exactly want. It's a matter of having the upgrade flexibility and recovery flexibility with decent but not perfect data checksumming if I continue with SnapRAID (which will spin up only one HDD at a time), or having real time checksumming and potentially faster speeds but no upgrade and recovery flexibility if I choose to go with ZFS

This will also impact if I should continue using 4TB drives or start moving up to 8TB or even 10TB (which has more disk cache)

tkaiser · 16. Oktober 2017

Zitat von darkarn

I thought 400 MB/s for sustained is on par when the destination disk is an SSD, how fast should I expect when properly tuned?

The 400 MB/s in your setup are 'peak SMB performance from client to FS buffer (memory)'. If your hardware is somewhat decent then a lot more should be possible.

I tested a few months ago with new ZFS filer boxes (dual socket multi core Xeons, 256GB DRAM, LSI SAS HBAs and a bunch of 3.5" SAS HDDs as 'large zpool made of mirrored vdevs'). The clustered storage has been exported via NFS and iSCSI and imported into the vSphere cluster using X540 10GbE NICs on all nodes. First test wrt sequential IO performance exceeded 600 MB/s in both directions (I stopped immediately since 'fast enough'), first random IO test showing IOPS known from consumer SSDs (I stopped immediately since 'fast enough'). NFS was faster than iSCSI (a little bit weird) but we decided simply to not even start to look into tuning since 'fast enough' and spent the time on testing through failover scenarios instead (on the storage cluster now EVERYTHING is running including a bunch of virtualized servers so the only real performance concern was random IO anyway and... behaviour when one of the cluster nodes went offline).

So again: it's a matter of use cases and weighing options. And at least I personally prefer storage separation (not storing all the data in the same place using same topology except a whole business runs on the thing and it's as easy as outlined above thanks to ZFS and hardware gotten that cheap in the meantime).

If you're able to define different storage locations for 'cold data' and 'hot data' you can eg. still use Snapraid for the cold data (with rather poor sequential and random IO performance) and put hot data on a RAID10 made out of 2 different (!) fast SSDs. Then put a checksumming CoW filesystem on top (be it btrfs or ZFS) and enjoy really high random as well as sequential IO performance (+1000 MB/s even on SATA ports as long as you take care that both SATA ports do not have to share bandwidth) while you're protected against a single device fail and get a certain level of data integrity if you backup this data to another place and run scrubs regularly. Of course no self-healing capabilities in this scenario, this would require a zmirror or btrfs' RAID1 but then you would need at least 4 SSDs to get both protection and doubled sequential performance (mdraid's RAID10 mode with just 2 devices is my only remaining use case for mdraid any more in 2017).

darkarn · 16. Oktober 2017

Zitat von ness1602

BY default anywhere, sequetial read doestn get cached,because why would anyone do that?
You get most out of cache on random write.

Thanks, I see. Maybe bcache is simply not for me if I am mostly transferring and playing files to and from the NAS?

Zitat von tkaiser

The 400 MB/s in your setup are 'peak SMB performance from client to FS buffer (memory)'. If your hardware is somewhat decent then a lot more should be possible.
I tested a few months ago with new ZFS filer boxes (dual socket multi core Xeons, 256GB DRAM, LSI SAS HBAs and a bunch of 3.5" SAS HDDs as 'large zpool made of mirrored vdevs'). The clustered storage has been exported via NFS and iSCSI and imported into the vSphere cluster using X540 10GbE NICs on all nodes. First test wrt sequential IO performance exceeded 600 MB/s in both directions (I stopped immediately since 'fast enough'), first random IO test showing IOPS known from consumer SSDs (I stopped immediately since 'fast enough'). NFS was faster than iSCSI (a little bit weird) but we decided simply to not even start to look into tuning since 'fast enough' and spent the time on testing through failover scenarios instead (on the storage cluster now EVERYTHING is running including a bunch of virtualized servers so the only real performance concern was random IO anyway and... behaviour when one of the cluster nodes went offline).

So again: it's a matter of use cases and weighing options. And at least I personally prefer storage separation (not storing all the data in the same place using same topology except a whole business runs on the thing and it's as easy as outlined above thanks to ZFS and hardware gotten that cheap in the meantime).

If you're able to define different storage locations for 'cold data' and 'hot data' you can eg. still use Snapraid for the cold data (with rather poor sequential and random IO performance) and put hot data on a RAID10 made out of 2 different (!) fast SSDs. Then put a checksumming CoW filesystem on top (be it btrfs or ZFS) and enjoy really high random as well as sequential IO performance (+1000 MB/s even on SATA ports as long as you take care that both SATA ports do not have to share bandwidth) while you're protected against a single device fail and get a certain level of data integrity if you backup this data to another place and run scrubs regularly. Of course no self-healing capabilities in this scenario, this would require a zmirror or btrfs' RAID1 but then you would need at least 4 SSDs to get both protection and doubled sequential performance (mdraid's RAID10 mode with just 2 devices is my only remaining use case for mdraid any more in 2017).

Wow, my hardware is clearly way weaker than what you are testing, the difference is huge to say the least!

I do prefer storage separation too, found it way easier to work with by letting each machine do what it can do best instead of trying to shoehorn everything into one box. I have two NASes (two separate machines, NASes built using mostly "consumer" parts), one main NAS where I will work with at most times ("hot") and a backup NAS for weekly backups ("cold"). SnapRAID is definitely what I will use for the backup NAS since all I need to do is to send the rsnapshot from the main NAS to the backup NAS and then do scrubbing on the backup NAS as I continue to work on the main NAS. Storage efficiency on SnapRAID is a bonus too. I wanted SnapRAID for the main NAS too to really ensure data parity but now with this new revelation, I am starting to consider ZFS (your mirror vdev idea is a good one!) for this instead but it is not quite efficient on storage, which has implications on what HDDs I can go for next and even whether I need to start considering changing the casing or not

nhenk001 · 17. Oktober 2017

Zitat von subzero79

did not have issues at all mounting in omv the /dev/bcache0 device. I used two virtual disks of 2GB. This is debian9+omv4 in server, not with gnome installed, using the bcache tools from debian repos. A full gnome desktop env should not be installed among omv.

You are right, the GUI was messing with my ability to properly mount the bcache volume. Did not think it would matter enough for a prototype "proof of concept," but apparently it did. Thank you for your help, bcache is now set up.

Zitat von darkarn

Wow, my hardware is clearly way weaker than what you are testing, the difference is huge to say the least!

It might take an enterprise-grade system to register performance increases even with random IO. I am testing a substantial performance hit from a single cached disk on my platform (similar to what I experienced with LVM). The problem seems to happen even when transferring large numbers of small files.

I need to discuss with my head of IT and decide if we want to:

"YOLO" it and hope it works -or-
Frankenstein old hardware into a semi-realistic test system (probably going to go with this)

Thank you subzero79 for for your help!

darkarn · 17. Oktober 2017

Zitat von nhenk001

You are right, the GUI was messing with my ability to properly mount the bcache volume. Did not think it would matter enough for a prototype "proof of concept," but apparently it did. Thank you for your help, bcache is now set up.
It might take an enterprise-grade system to register performance increases even with random IO. I am testing a substantial performance hit from a single cached disk on my platform (similar to what I experienced with LVM). The problem seems to happen even when transferring large numbers of small files.
I need to discuss with my head of IT and decide if we want to:
"YOLO" it and hope it works -or-
Frankenstein old hardware into a semi-realistic test system (probably going to go with this)
Thank you subzero79 for for your help!

I wonder what exact enterprise-grade stuff will I need to get such performance lol...

Jetzt mitmachen!