You are here

It is the worst case scenario under heavy load.
And exactly what e-mail and database servers have to cope with under really heavy load.

The ZFS performs very well while the synchronous and asynchronous writes, and reads are contained in the L1 ARC or at least the reads in the L2 ARC.
It yet performs very well while the synchronous and asynchronous writes occur in bursts, yet contained in L1 ARC and in time periods enough to have enough time to flush to stable storage pool area before the next burst.
In summary, the ZFS groups and serializes the writes in order to write to the disks at friendliest way to rotational disks technology.
But what happens when a heavy load multithreaded random small files synchronous writes is sustained for a long time period, exceeding the ZIL VDEV size and L1 ARC size?
In order to answer this, you need to understand some of the factors that impact ZFS performance.
You need to read the bibliography at the article bottom.

We observed during the tests that ZFS stores in the transactional log aproximately 1 KB metadata for each block.
If you use the minimal block allowed, 512 Bytes, the percentual overhead will be too high and the ZIL VDEV will fill up unexpectedly before you imagined.
If you use the maximal block allowed, 128 KB, knowing that your files will be always small, the Copy On Write feature will cause throughput waste.
Enabling compression will allow the use of dynamic size blocks, reducing waste, and the analysis become statistical.
We did not executed tests with compression enabled.http://www.opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdfhttp://ftp.bruningsystems.com/zfs_ondisk_slides.pdf

ZFS serializes synchronous random writes at MOST of situations:

We posted at a ZFS technical forum:http://www.nexentastor.org/boards/5/topics/7695
Hello, given that zfs serializes and group the writes in transaction groups TXG, even small files RANDOM synchronous writes, having a separated ZIL vdev (low latency, high iops ram ssd), actually converting random sync writes to disk sequential async writes on stable media pool area, reducing fragmentation alsohttp://wildness.espix.org/index.php?post/2011/06/09/ZFS-Fragmentation-issue-examining-the-ZIL, it should be expected that when dimensioning a pool, one could use the disk sustained SEQUENTIAL throughoutput to calculate the sustained sequential writes disk IOPS for the stable pool area.
Here, http://dtrace.org/blogs/brendan/2009/06/26/slog-screenshots/ the author measured such behaviour, but at a very small period to be conclusive.
If the sync write flood stays enough to fill 50% of L1 ARC, or the ZIL vdev fills complete, the ZFS starts an I/O throtling and speed will be limited by pool's disks sustained SEQUENTIAL write IOPS, depending on VDEV chosen configuration.

It was mostly true for old ZFS pool versions, but incomplete.
For the pool v28, the threshold is 1 / 8 of physical memory.
Warning: it is not L1 ARC size.
The ZIL vdev must have enough free space available, of course.

Our tests registered below also shows approximately the same behaviour, having a bigger time window to be significant.
AT MOST OF THE TIME, being bursts of synchronous writes that satisfy the requirements listed ahead, the inference is correct.
At the situations where the ZIL write throttle engages too aggressively, the VDEV random write IOPS specification will be valid.

At this article you will see a sintetic test, using the minimal allowed block size (512 Bytes) in order to maximize the collateral effects of write throttle action.

The observed effect is that when firing write throttle by sustained small file synchronous writes when the ZIL vdev has very high IOPS (like RAM SSD) and the stable storage area has low IOPS (SATA 7K2 RPM or green 5K4 RPM) the write throttle could even stop accepting new synchronous write calls for ZIL VDEV and start to write synchronously at the stable storage area, with the low disk random write IOPS specification.https://blogs.oracle.com/partnertech/entry/zfs_write_throttle_observations
Remember that ZIL is a transaction log, not a cache.
Synchronous writes in BURSTS that occur in the period ( zfs_txg_timeout ) and that could get SERIALIZED inside the max allowed time ( zfs_txg_synctime_ms ) DO NOT SUFFER write throttle and are pratically limited by write SEQUENTIAL throughput of disk INTERNAL ELECTRONICS and its INTERFACE TECHNOLOGY over the rotational speed of the disk (with less impact)
It was because of this that we observed during the tests SYNCHRONOUS WRITES being recorded on the stable storage at 840 IOPS on a consumer grade SATA disk.
But these are not all the criteria yet.

When ZIO decides to bypass ZIL VDEV and directly writes to the stable storage pool area

If the pool has the “logbias” parameter adjusted for “throughput” then ( immediate_write_sz) will be equal to ZERO.
If not ( immediate_write_sz) will be equal to ( zfs_immediate_write_sz) default 32 KB.
If the pending write block is greater than ( immediate_write_sz)
And if the pending write block is less than or equal to the filesystem block size ( zp->z_blksz ) default 128 KB.
And f the pool does not have a separated ZIL VDEV AND NOT have the “logbias” parameter adjusted to “latency”.
THEN it writes directly to stable storage pool area.

Remark: we did not take in account the dynamic block size when using compression, for a less complex analysis.

I/O load profile at each ZFS pool.

You must analyze the I/O load profile of your applications.
Ideally, each different I/O load profile should be at a different pool, configured for such profile.
As a drastic example, a gzip-9 compressed very wide raidz3 zfs pool for green disks backup purposes should not have an e-mail server data in production.
Or a zfs pool configured (uncompressed multiple RAM SSD mirror3 logical vdevs, 8 KB blocks) for a postgresql database server should not have the data of a video streaming server.

Calculating the ZFS sustained write IOPS and VDEV quantity for a RAIDZ* pool

Ex.:
(1500 / 80)= 19 logical_vdev
Each logical_vdev could be a RAIDZ* or a MIRROR*, built from physical VDEVS.

If you opt for creating each logical VDEV as a MIRROR2, then we would need 38 disks building 19 VDEVS.
Remember that for growing a ZFS pool composed of MIRROR2 logical VDEV, it is good practice to attach another equal MIRROR2 VDEV.
To grow a ZFS pool composed of RAIDZ3 VDEV it is good practice to attach another equal RAIDZ3 VDEV.
ZFS pool “can” operate with less performance if you use asymmetrical logical VDEVS in the same STRIPED pool.
The space and performance will be limited by the less capacity and less performance VDEV.http://rskjetlein.blogspot.com.br/2009/08/expanding-zfs-pool.html
But you can not attach a different drive to a MIRROR* or RAIDZ* VDEV without high risk. Watch out for sector count less than your previous drive.

We are dimensioning the zfs pool for the worst case, where the I/O load profile does not satisfy the already analyzed conditions to be sequentially written to the disks.
IF you could guarantee that each pool, with their different I/O load profiles, satisfy all analyzed conditions, you could use:

Sustained specifications of the drives to not depend on internal cache, which could not honor the write barrier instructions and lead to data loss.
This is less usual in enterprise grade drives, but evaluate carefully your drive specifications.

We reiterate that is only viable to use drive_sequential_sustained_write_iops IF you could guarantee that each different I/O load profile over each ZFS pool configured for it satisfy ALL the previous conditions.

Or that at least you could control predictably WHEN such all conditions will not be satisfied.
For example, configuring database vacuuming and analyzing cronjob or indexing e-mail server at given hour.

Another criteria to take in account when dimensioning is that for resilvering a RAIDZ* ZFS will have to read ALL data already written to pool and rebuild the replacement disk with data and checksum. All pool disks will be activated. Depending on data written in the pool this could take weeks. So, do not configure a too big RAIDZ* pool if you could not afford to operate with low performance while resilvering.
Also, disks from same batch tend to fail at same time, reducing MTTDL Mean time to data lost.
A MIRROR* VDEV resilvering is faster, because data is simply copied from consistent disk to the replacement disk.

LATENCY

The number of input and output operations per second IOPS is important, but not the MOST important.
It is one of the parameters for dimensioning a ZFS pool.
Your data storage server could accomplish many thousands of simultaneous IOPS, given its parallelism. But yet each individual operation could be “slow” for the application.
For the final application, what matters most is the LATENCY of each I/O operation.http://blog.richardelling.com/2012/03/iops-and-latency-are-not-related-hdd.html
So, choose your drives carefully.

The tests were executed directly on the host machine, to remove network factors from analysis.
We used simple Intel i5 machines, 8 GB RAM, 1 consumer grade SATA for simplifying analysis and watch more easily the effects.

At the end of article we attached iozone spreadsheets with some graphs you could reformat.