I wonder what exact enterprise-grade stuff will I need to get such performance lol...
Well, the following is made with toy-grade hardware (a dual core ARM SoC with a consumer grade M.2 SSD connected to one of the SoC's native SATA ports and another 2.5" consumer grade SSD attached to a Marvell 88SE9215 SATA controller). These are results in KB/s from an 'iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2' benchmark:
random random
kB reclen write rewrite read reread read write
102400 4 65688 80881 110730 118944 37860 77260
102400 16 161094 185891 241206 257926 104111 167703
102400 512 281322 280213 324536 364836 348267 289555
102400 1024 285398 293523 552984 569205 542326 299000
102400 16384 264123 286763 744096 761679 743206 312469
This is a RAID10 made out of 2 devices to combine the benefits of mirroring with the performance of striping (mdadm --create /dev/md16 --level=10 --metadata=0.90 --raid-devices=2 --layout=f2 /dev/sda1 /dev/sdb1).
Up to 750 MB/s sequential reads and close to 300 MB/s write performance. Reads were bottlenecked by one SSD being behind the Marvell controller (the 'somewhat below 400 MB/s' throughput limit here is responsible for the 750 MB/s), write performance bottlenecked by the SSDs themselves (cheap consumer crap). And the random IO numbers could be way better with a smaller RAID chunk size since I used defaults:
root@clearfogpro:/mnt/md16# cat /proc/mdstat
Personalities : [raid10]
md16 : active raid10 sdb1[1] sda1[0]
117210112 blocks 512K chunks 2 far-copies [2/2] [UU]
[=========>...........] resync = 49.5% (58067200/117210112) finish=4.7min speed=205614K/sec
bitmap: 1/1 pages [4KB], 65536KB chunk
BTW: while the resync was running I measured too and got still +100/+500 MB/s write/read and almost 75% of the random IO numbers. With any x64 box, two SATA 3.0 ports that have not to share bandwidth and with 2 SSDs that are known to exceed the SATA 3.0 throughput limitation (+550 MB/s) we would talk about +1000MB/s in both read and write direction and with smaller RAID chunk size also about impressive random IO numbers.
That's what I meant above: storage separation. Put all the stuff that needs high performance on devices with an appropriate topology (eg. 2 Samsung Pro 850 SSDs in such a RAID10 topology, or 3 still with RAID10 and far layout). Or when it's about huge amounts of data simply throw in a few more disks and use an appropriate storage topology: Mirrored vdevs in one large zpool.
Somewhat decent x64 boxes have plenty of IO bandwidth and if you use a bleeding edge ZoL version you don't need that much DRAM any more and can use also somewhat weak but energy efficient CPUs as long as they support QAT, Intel's QuickAssist Technology -- see the performance section here: https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.0
Boards like Gigabyte's MA10-ST0 or SuperMicro's A2SDI-H-TP4F-O rely on the 16-core C3958 Denverton with QAT support built into the PCH, allow for a sufficient amount of ECC memory, provide a couple of 10GbE ports and 12 to 16 SATA ports.