If your ABI provides, say, 8 registers for parameter passing
(exemplified by for instance the MIPS 64 bit and n32 ABI's) you can
have functions of eight parameters that are register absed, rather
just two, when one parameter requires four registers.

Are any of your benchmarks structured as a realistic program with
function calls going around among separately compiled external module
(and possibly separately loaded at run-time?)

How much leverage are you getting in the benchmarks from just
accessing those parts of the reference that are relevant to the
computation (such as the principal pointer to the underlying data),
avoiding situations where you have to load the whole thing?

In real-world programs, there do end up copies of references to heap
data. Even if the heap itself doesn't have multiple references to an
object (i.e. most objects in the heap have a unique parent object in
the heap), the program can still be juggling lots of copies in the
registers, and throughout the call stack, whose frames may be spread
among external functions.

Some copies may not even be semantic references. If a function needs
to save some callee-saved registers to backing memory and re-load
them, a copy is made twice, even though the backing memory has no
semantic role other than holding those registers. Fat references will
create more register pressure of this type.

Say you need a whole bunch of registers to load some fat references
from memory. Oops, you've run out of the callee-clobbered registers,
and need to use callee-saved ones. So now your function has to save
those to memory and then restore them before returning. You've just
copied the parent frame's data twice, even though you aren't working
with that data.