I've been playing a bit with the disassembly made by Jede, and I gave a shot at changing a bit the code structure.
Basically, I made 64 copies of the volume conversion table (so the "and #3" is not necessary anymore), I moved the code that sets the PSG to accept data on register 8 out of the main loop, and I swapped the bits around to make it more efficient to extract the two bit values.

The "nops" are only there to keep the music to play at the same speed (give or take) so basically this new code sounds the same, but thanks to the changes is 16 nops faster for each byte being replayed.

What that means, is that it's probably viable to put the routine in an interrupt so other things can be done at the same time.

PS: There are other optimizations possible, I just wanted to check the main "bang for the buck" ones