I've been testing a zfs based NAS. Not much about zfs here and it was new to me. The system we're testing stumbled out of the gate, at least with our very aggressive B4M Fork growing file workflow, but after much trial and error we discovered a key setting that unlocked performance.
zfs_txg_synctime_ms This sets how often the cache dumps to disk. This defaulted to 3sec but changing it to 1sec made all the difference.
zfs_vdev_max_pending changed from 10 to 4
atime, sync, and compression are all disabled.
Disabling sync may not have been a contributing factor. No difference either way it seemed.
We initially set up the vdevs (LUNS in RAID speak) as 20 mirrored pairs but found a 6x6 Z2 (RAID6 in RAID) to be superior.
The Oracle server is a beast. 16 cores and 256GB of RAM with 4 10Gig ports with more as an option.
There is no time involved stripping vdevs which can take days with RAID. Also when a drive fails only the data that was present on the drive is rebuilt, not the entire drive like with RAID.
All parity calculation is done in software and, despite 35 streams of PRSQ reads and writes across 36 disks, the CPU was only 17% busy exporting NFS to 33 Mac 10.9 clients.
I am impressed with the reporting available and how little impact it has on the system. Compared to Xsan this thing is an open book and I could see when it approached it limits.
Not too much zfs based storage out there. I think it's an option with Small Tree and Nexsan may use it on their NAS offering. I may be wrong here as searching "zfs" on either's site returns nothing.
So far I'm a zfs fan.
Interesting stuff, thanks for sharing John.
[John Heagy] "We initially set up the vdevs (LUNS in RAID speak) as 20 mirrored pairs but found a 6x6 Z2 (RAID6 in RAID) to be superior."
Would it be possible to share the test results that showed Z2 to be superior?
(Side note: vdevs in "RAID-speak" are usually called RAID volumes, groups, etc. - don't think they're LUNs.)
[John Heagy] "All parity calculation is done in software and, despite 35 streams of PRSQ reads and writes across 36 disks, the CPU was only 17% busy exporting NFS to 33 Mac 10.9 clients. "
Have you noticed significant difference in CPU utilization between reads (no parity calculations) and writes (parity calculations), and which processes were using the most cycles? Is it possible most of the those cycles were used servicing TCP/IP and file system requests rather than parity calcs?
Agreed that there isn't much info on ZFS out there - yet perhaps it's worth mentioning that ZFS does have significant performance issues and limitations (no dynamic restriping in online vdev expansion - i.e. no performance increase when adding disks to existing zpools) - making it less suitable for applications that need to squeeze every ounce of performance out of the drives.
hah, that sounds awesome John!
In my view ZFS is the future.
Have you started to drill down into how the cache is performing? Video is quite a different kettle of fish to serving up small files (obviously), so it can be interesting to see how ARC handles a working-set that is so unpredictable/large.
I also see you have 256GB RAM! NICE. Have you experimented with L2ARC or ZIL?
+1 disabling atime, its one of the first things I would have done too. Also, been staying away from dedup.
One of my favourite aspects of ZFS is that you can quite easily import the pool elsewhere if you suffer from HBA failure (for example). We implemented basic support in the indiestor command line for ZFS, although we're playing with ZOL as we're Linux based.
Also a ZFS fan :-)
[alex gardiner] "Have you started to drill down into how the cache is performing?"
Other than seeing only 27GB free of 256GB, not really. It does like RAM!
[alex gardiner] "Have you experimented with L2ARC or ZIL?"
We have both. The ZIL made a huge difference early on. Nearly unusable without it. The L2ARC did not make any difference except when I move to playback only testing. Since I was using the same media over and over it started caching the playback data. The drives went from 300 reads/sec to less than 50 but the L2ARC when to 600!. During normal edit activities I don't see that making much difference. It would help during an After Effects session where one is using the same files all day. I'd imagine one's entire project would eventually move to the cache. Of course there's little need for performance while using After Effects when it come to source files, but every little bit helps I suppose.
[Alex Gerulaitis] "Would it be possible to share the test results that showed Z2 to be superior?"
The Z2 6x6 was not only faster with less disks (36 vs 40) it had a 63% yield versus of the mirror's 50%.
The 20 mirrored pair zpool did 14 ingests and 14 playbacks all PRSQ via B4M's Fork running on 10.9 Xserves. The 40 streams was really pushing it as the playback buffers where very active trying to stay ahead of the storage.
The 36 6x6 Z2 zpool did 16 ingest and 19 playback with calm playback buffers indicating it still had some headroom. I normally like to keep piling on until it drops but at that time I had exhausted all available resources.
[Alex Gerulaitis] "(Side note: vdevs in "RAID-speak" are usually called RAID volumes, groups, etc. - don't think they're LUNs.)"
Below is what I'm mainly basing my terminology on. Xsan/Stornext also refers to them as stripe groups. These are basically analogous to vdevs although Xsan supports LUN groups as Storage Pools as well and then finally Volumes or Filesystems in Stornext speak.
[Alex Gerulaitis] "Have you noticed significant difference in CPU utilization between reads (no parity calculations) and writes"
Less than 10%
[Alex Gerulaitis] "ZFS does have significant performance issues and limitations (no dynamic restriping in online vdev expansion - i.e. no performance increase when adding disks to existing zpools)"
No vdev expansion, but one can easily add matching vdevs which I believe is more valuable performance wise while not increasing disk yield like vdev expansion could.
I'm suspect of comparing the many open sourced zfs based system to 100% kosher Oracle zfs running on Oracle hardware. The system we have is pure and uncut Oracle zfs!
John, I was wondering if you could comment on the level of performance increase after you had disabled compression. I would expect a slight performance increase with lzjb compression enabled on your zfs filesystem. Oracle ZFS does not support lz4 compression, unfortunately, which is even better and faster.
Turning off atime will most certainly bump up the performance. Disabling sync should only affect writes.
Thanks for sharing your findings. ZFS is awesome!
Lucid Technology, Inc. / 801 West Bay Dr. Suite 465 / Largo, FL 33770
"Enterprise Data Storage for Everyone!"
We had compression off from day one so we never tested with compression. The video files we are recording are compressed as ProRes and we've found that compressing already compressed media yields no space savings.
Btrfs is still under much development, so it's not production comparable to ZFS yet, but is getting closer to stable. The Btrfs single, raid0, 1, 10 code has been in the linux mainline kernel since 2008. The raid5/6 code went in almost a year ago and still needs some work.
File system creation takes less than a second. There is no parity initialization. Additional disks of any size and number can be added, online. The device add occurs in less than a second. Online restripe is supported, but not required. Upon new device add, the allocator writes across all present drives. There's preliminary work on supporting object level raid, I'm not sure if the idea is per file granularity or only at the subvolume level. Devices can also be replaced online which causes their data to be migrated to the replacement.
Nice, useful information on ZFS tuning.
We are building a similar setup i.e. ZFS + XSAN and wanted to know a few more details.
Did you all do any OS level tunings like increasing TCP Buffers, Disk nr_requests, read_ahead etc.?
How many SSDs are you using for ZIL and Cache?
Did you all partition the same SSDs for ZIL and Cache or did you all allocate different disks for each?
Also are your clients on 10G NICs?
Were there any OS tunings on the client side?
Could you give some idea about the client configuration too?
Thanks and Regards,
No TCP tuning on the client.
The ZIL is very small only 8GB the Cache is much larger and Flash.
Turns out I was wrong about needing either. While we are using the read cache it's showing no real benefit unless you're using the same files over and over. We ended up disabling sync which means the ZIL is not being used either.
The vast majority of our testing was 1Gig and we threw enough traffic to max out the single 10Gig link from the NAS head. We got sustained 1,222MB/sec from 48 streams and various file copies.
We deploy a 10Gig client and saw brief processes pull 650MB from the single client. Sustained file processing managed 350MB/sec read/write.
Clients are a mixture of Xserves (ingest) running 10.9.0 and playback also Xserves running 10.8.5.
This is the mount cmd used:
mount -o resvport,locallocks,intr,soft,proto=tcp -t nfs 10.205.88.21:/zpool01/ingest /Volumes/ingest
Tried NFSv4 but was crap.
You mentioned zfs + Xsan... separate volumes I assume?