If an IO driver is implemented properly then it will batch up requests for
the controller, and gets IRQ-notified on a (sub-)batch of buffers
completed.

If there's any spinning done then it should be NAPI-alike polling: a
single "is stuff completed" polling pass per new block of work submitted,
to opportunistically interleave completion with submission work.

I don't see where active spinning brings would improve performance
compared to a NAPI-alike technique. Your numbers obviously show a speedup
we'd like to have, I'm just wondering whether the same speedup (or even
more) could be implemented via:

... which would almost never result in IRQ driven completion when we are
close to CPU-bound and while not yet saturating the IO controller's
capacity.

The spinning approach you add has the disadvantage of actively wasting CPU
time, which could be used to run other tasks. In general it's much better
to make sure the completion IRQs are rate-limited and just schedule. This
(combined with a metric ton of fine details) is what the networking code
does in essence, and they have no trouble reaching very high throughput.