Otherwise we might destroy data that isn't
even touched by the guest request in case of a crash.

The failure scenarios are either that the cluster is leaked in which
case, the old version of the data is still present or the cluster is
orphaned because the L2 entry is written, in which case the old version
of the data is present.

Hm, how does the latter case work? Or rather, what do mean by "orphaned"?

Are you referring to a scenario where the cluster is partially written
because the data is present in the write cache and the write cache isn't
flushed on power failure?

The case I'm referring to is a COW. So let's assume a partial write to
an unallocated cluster, we then need to do a COW in pre/postfill. Then
we do a normal write and link the new cluster in the L2 table.
Assume that the write to the L2 table is already on the disk, but the
pre/postfill data isn't yet. At this point we have a bad state because
if we crash now we have lost the data that should have been copied from
the backing file.

In this case QED_F_NEED_CHECK is set and the invalid cluster offset
should be reset to zero on open.
However, I think we can get into a state where the pre/postfill data
isn't on the disk yet but another allocation has increased the file
size, making the unwritten cluster "valid". This fools consistency
check into thinking the data cluster (which was never written to on
disk) is valid.
Will think about this more tonight.

It's fairly simple to add a sync to this path. It's probably worth
checking the prefill/postfill for zeros and avoiding the write/sync if
that's the case. That should optimize the common cases of allocating
new space within a file.

My intuition is that we can avoid the sync entirely but we'll need to
think about it further.