Using strncpy() to copy strings is suboptimal because strncpy writes a
bunch of additional unnecessary null bytes. Use snprintf() instead of
strncpy(). An additional advantage of snprintf() is that it guarantees
that the output string is '\0'-terminated.

This patch is an improvement for commit 32e31c8c5f7b ("Fix string copy
compilation warnings").

When fio reports write bytes or read bytes, it rounds the number with
units MiB or KiB to fit the number within limited number of digits.
This results in rounding errors of the reported bytes and sometimes
causes test failures for test case #17 in test-zbd-support
which reports incorrect total I/O bytes in case both of write bytes
and read bytes are rounded up.

To avoid the rounding error, increase the number of digits from default
value 4 to 10 to keep precision. For example, a number "256MiB" will be
reported as "267911168B" with this change.

If a large request arrives when pool->next_non_full points to empty
space that is insufficient to satisfy the request, pool->next_non_full
will be inappropriately advanced when the free space is followed by
lines of fully allocated space. The free space originally pointed to by
pool->next_non_full will be unavailable unless a subsequent sfree() call
frees allocated space above it. Resolve this issue by advancing
pool->next_non_full only outside the search loop and only when it points
to fully allocated space.

If a block size percentage is ispecified as 0 in bssplit, the block size
defined is not ignored by the loop in get_next_buflen(). In particular,
if the first (smallest) block size specified has a 0 percentage, the
loop is existed and that block size used as the next IO size, resulting
in a behavior equivalent to specifying 100%. E.g. using
--bssplit=64k/0,1024k/100 results in 100% of issued IOs to be 64KB
instead of 1MB.

Fix this by ignoring bssplit entries that have a 0 percentage. This is
safe as the initialization of the bssplit array ensure that the sum of
all percentages is always 100, guaranteeing that a block size will be
chosen for the next IO size.

The io_uring, libaio, and posixaio ioengines actually carry out
synchronous trim operations, but latency timestamps are recorded as if
the trims were issued asynchronously. This patch fixes how timestamps
are recorded for trim operations issued by these ioengines.

librbd will run up to 20% faster under small IO workloads when fio is linked
against tcmalloc. By automatically linking against tcmalloc, it avoids
the need to use LD_PRELOAD to pull in a faster memory management toolset.

fio may be executed concurrently with a block device revalidation by
the kernel. Device revalidation may lead to the block device capacity
to be changed to a smaller value (device changed) or to even 0 in case
of revalidation failure. In such case, the BLKREPORTZONE ioctl
executed from read_zone_info() may report a success with an empty zone
report when the start sector for the report is above the new capacity
of the device. This leads to an infinite loop inside parse_zone_info()
and the fio run never terminating.

Fix this problem by returning -EIO from read_zone_info() thus
avoiding the infinite loop in parse_zone_info(). This change does not
affect the normal case with a stable device as read_zone_info() is
always called with a valid start sector that does not result in an
empty zone report.

The Linux specific transparent hugepage memory advisory has potentially
significant implications for how the memory management behaves. If the
platform supports it, add a new mmap ioengine specific option that advises
HUGEPAGE on an mmap'ed range. The option availability is detected during
configure. If the option is set, fio can test THP when used with private
anonymous memory (i.e. mmap /dev/zero).

When fio runs multiple jobs on servers, it is possible for the "All
clients" output to appear in the middle of output for the individual
jobs. This patch puts the "All clients" output into a separate buffer
and displays it after the output for all the individual jobs.

With zonemode=zbd, for a multi-job workload using asynchronous I/O
engines with a deep I/O queue depth setting, a job that is building a
batch of asynchronous I/Os to submit may end up waiting for an I/O
target zone lock held by another job that is also preparing a batch.
For small devices with few zones and/or a large number of jobs, such
prepare phase zone lock contention can be frequent enough to end up in a
situation where all jobs are waiting for zone locks held by other jobs
and no I/O being executed (so no zone being unlocked).

Avoid this situation by using pthread_mutex_trylock() instead of
pthread_mutex_lock() and by calling io_u_quiesce() to execute queued
I/O units if locking fails. pthread_mutex_lock() is then called to
lock the desired target zone. The execution of io_u_quiesce() forces
I/O execution progress and so zones to be unlocked, avoiding job
deadlock.

For a zoned block device with zonemode=zbd, the lock on the target zone
of an I/O is held from the time the I/O is prepared with
zbd_adjust_block() execution in fill_io_u() until the I/O is queued in
td_io_queue(). For a sync I/O engines, this means that the target zone
of an I/O operations is locked throughout the liftime of an I/O unit,
resulting in the serialization of write request preparation and
execution, as well as serialization of write operations and reset zone
operations for a zone, avoiding error inducing reordering.

However, in the case of an async I/O engine, the engine ->commit()
method falls outside of the zone lock serialization for all I/O units
that will be issued by the method execution. This results in potential
reordering of write requests during issuing, as well as simultaneous
queueing of write requests and zone reset operations resulting in
unaligned write errors.

Fix this by refining the control over zone locking and unlocking.
Locking of an I/O target zone is unchanged and done in
zbd_adjust_block(), but the I/O callback function zbd_post_submit()
which updates a zone write pointer and unlocks the zone is split into
two different callbacks zbd_queue_io() and zbd_put_io().
zbd_queue_io() updates the zone write pointer for write operations and
unlocks the target zone only if the I/O operation was not queued or if
the I/O operation completed during the execution of the engine
->queue() method (e.g. a sync I/O engine is being used). The execution
of this I/O callback is done right after executing the I/O engine
->queue() method. The zbd_put_io() callback is used to unlock an I/O
target zone after completion of an I/O from within the put_io_u()
function.

To simplify the code the helper functions zbd_queue_io_u() and
zbd_put_io_u() which respectively call an io_u zbd_queue_io() and
zbd_put_io() callbacks are introduced. These helper functions are
conditionally defined only if CONFIG_LINUX_BLKZONED is set.

The test-zbd-support script fails to execute for partition devices with
the error message "Open /dev/sdX1 failed (No such file or directory)"
when libzbc tools are used by the script to open the specified
partition device. This is due to libzbc also opening a partition holder
block device file, which when closed causes a partition table
revalidation and the partition device files to be deleted and
recreated by udev through the RRPART ioctl.

To avoid the failure, default to using blkzone for zone report and
reset if supported by the system (util-linux v2.30 and higher) as this
tool does not open the older device and avoids the same problem.
To obtain the device maximum number of open zones, which is not
advertized by blkzone, use sg_inq for SCSI devices and use the default
maximum of 128 for other device types (i.e. null_blk devices in zone
mode).

Removal of the message "No I/O performed" when fio does not execute any
I/O broke zbd tests 2 and 3 as this message is looked after to test for
success. Fix this by looking for a "Run status" line starting with
"WRITE:" for test 2 and "READ:" for test 3. The run status lines are not
printed when no I/O is performed. Testing for the absence of these
strings thus allows to easily test if I/Os where executed or not.

To allow t/zbd/tests-zbd-support test script to run correctly on
partitions of zoned block devices, fix access to the device properties
through sysfs by referencing the sysfs directory of the holder block
device. Doing so, the "zoned", "logical_block_size" and "mq" attributes
can be correctly accessed.

Getting and setting values in SCSI commands and descriptors,
which are big endian, in SG driver can use a bit of cleanup.
This patch simplifies SG driver code by introducing a set of
accessor functions for reading raw big endian values from SCSI
buffers and another set for properly storing the local values
as big endian byte sequences.

The patch also adds some missing endianness conversion macros
in os.h.

Some SCSI devices (very large disks or SMR zoned disks in particular)
do not support the READ CAPACITY(10) command and only reply
successfully to the READ CAPACITY(16) command. This patch forces the
execution READ CAPACITY(16) if READ CAPACITY(10) fails with
CHECK CONDITION.

For fio to correctly handle the zonemode=zbd mode with partitions of
zoned block devices, the partition block device file must be identified
as a zoned disk. However, partition block device files do not have
a zoned sysfs file. This patch allows a correct identification of the
device file zone model by accessing the sysfs "zoned" file of the
holder disk for partition devices.

Change get_zbd_model() function to resolve the symbolic link to the
sysfs path to obtain the canonical sysfs path. The canonical sysfs
path of a partition device includes both of the holder device name and
the partition device name. If the given device is a partition device,
cut the partition device name in the canonical sysfs path to access
the "zoned" file in the holder device sysfs path.

In some cases, the 100th percentile latency is not correctly identified
because of problems with double precision floating point arithmetic.
Use long doubles instead in the while loop condition to reduce the
likelihood of encountering this problem.

Also, print an error message when latency percentiles are not
successfully identified.

The error -5 is a Z_BUF_ERROR, and references are available at
https://zlib.net/zlib_how.html and https://www.zlib.net/manual.html It
seems that when decompressing the buffer, if the buffer chunk is the
same size as remaining data in the buffer, the Z_BUF_ERROR can safely be
ignored. So one idea is to ignore the safe errors noting the zlib
references:

"inflate() can also return Z_STREAM_ERROR, which should not be possible
here, but could be checked for as noted above for def(). Z_BUF_ERROR
does not need to be checked for here, for the same reasons noted for
def(). Z_STREAM_END will be checked for later.

The way we tell that deflate() has no more output is by seeing that it
did not fill the output buffer, leaving avail_out greater than zero.
However suppose that deflate() has no more output, but just so happened
to exactly fill the output buffer! avail_out is zero, and we can't tell
that deflate() has done all it can. As far as we know, deflate() has
more output for us. So we call it again. But now deflate() produces no
output at all, and avail_out remains unchanged as CHUNK. That deflate()
call wasn't able to do anything, either consume input or produce output,
and so it returns Z_BUF_ERROR. (See, I told you I'd cover this later.)
However this is not a problem at all. Now we finally have the desired
indication that deflate() is really done, and so we drop out of the
inner loop to provide more input to deflate()."