but with source and destination exchanged, we'd have a quick zero-extend of the inputs, fetching only what you need from memory.

I'd like to quote the docs, but copy&paste is "forbidden by drm". I'm too lazy to figure out how to circumvent it. Anyhoo, they actually mention that you can use punpck* for this purpose with a source operand of 0.

This greatly benefits applications which store data in registers and zero-extend results before writing them to memory. By writing to the same memory location, data compression (lossy) of almost %100 is possible, while still being able to correctly reconstuct 50% of the input.