Personally I wouldn't use clock cycles for timing since a clock rate can be changed, except for very small intervals for which rounding is an issue, and then I'm going to need to integrate it with operations around it.

You can't get in and out of a function in much under a microsecond, so if you are looking for better resolution it will need to be inline. For very short delays you can...

... for as many nops as you need at one cycle per nop. That could be a significant code size burden for longer delays. If you have a register to spare there are probably two cycle instructions that will make better code density. At some point it is better to loop and count.

If you are needing this sort of resolution, then you will also want to use objdump to disassemble your code and see what the compiler did. It has some amusing reordering rules that can put initialization computations inside your time critical areas.

How would other ASM code be included? Could I write a byte to a port in one clock, then NOP for a while, and then loop, all in ASM?

I want to make very precise timing (I will be generating between 30 and 60 khz plus 60-120hz) for VGA, and if I have some spare uC time, I'd muck around with other stuff. I just want this stuff to be über efficient.