Beamhacking: Cycle Counting

We've determined that we can get in sync
with the raster on a TRS-80 Model 3. But to stay in sync there's
nothing for it but to keep track of all the cycles our program uses.
At a high level the process is simple enough. Write up the code that
changes the screen on the fly. Then add code that controls the display
whether that means moving stuff around, checking for user input or
whatever. Then put in some code to use up any remaining T-states
before the whole process must repeat again. Hopefully everything goes
right and you're rewarded with a solid display. Typically there's
some error in calculation and the display "drifts" as the program
slides slowly (or sometimes rapidly) out of sync with the beam.

For the moment let's ignore the detail of counting T-states because
even once we know how many T-states our code uses there's still the
problem of using up the excess. What we really need is a subroutine
to use up any given number of T-states. Z-80 programmers may wish
to stop at this point and try and write such a subroutine — it's
a very fun challenge. You may find this chart
useful.

I went through several variants before my current version which is small
and, most importantly, has relatively little overhead. The subroutine
obviously can't use zero T-states if asked. In fact, there will be
a minimum number of T-states it uses no matter what. What we end up
with is something that burns some passed in number of T-states plus
a fixed overhead. Here's the best I've got so far.

The numbers in angle brackets are a bit of legacy from a semi-automated
cycle counting system, but they also show how the extra cycles get added
as a function of HL. The first column of numbers gives instruction times
if a branch isn't taken, the second column when the branch is taken. Looking
at the first section of wA we see that if bit 0 of A is clear it takes
4+7+4==15 T-states. But if bit 0 is set then we pay 4+12+0==16 T-states.
Similar machinations occur down to wHL_16 to effect A mod 16
extra T-states. At wHL_16 A counts the number of 16 cycle chunks
that remain which we can burn up with a simple loop. We're pretty
lucky to have jr; using jp would use 14 T-states
with no way of balancing it to 16 T-states — there is no instruction
that uses less that 4 T-states. The first column T-state counts from wA
to the end add up to 81. Thus wA will use A + 81 T-states.

It's no accident that the entire first column adds up to 100 and represents
the minimal cost of the subroutine. Calling wHL with HL=0
will cost 100 T-states. The last bit of code we need to examine is at the
entry point. If H == 0 if falls through to wA with
A == L. For non-zero H we jump back to wHL256
which counts down H and wastes 256 T-states using wA. The trick
is figuring out the proper value of A that accounts for the overhead
in wA and the other time spent passing through this sub-loop of
the code.

Nothing more to say. wHL first wastes H*256 T-states
and then L T-states which is exactly the desired effect as
HL==H*256+L.

wHL is really handy for soaking up a few thousand cycles
to pad a program's time to a full frame. It won't do you much good
in balancing the action when the screen is being modified. At
10 T-states to load HL, another 17 to call wHL
and 100 minimum there's only 1 T-state left per line. The screen
update code (or graphics kernel, if you will) must be crafted by
hand with the odd instruction added here and there for timing
considerations. Put it this way. With each line you'll likely
spend 40 or 50 T-states loading up graphics data and then another
40 or 50 writing it to the screen. Using up a remainder of say,
28 T-states is something best done by consulting a
table of Z-80 instruction timings.
I suggest you use LD (0),HL followed by JR $+2
to burn 16 and 12 T-states respectively.

What about the rest of the code? Here I recommend my
version of zmac which has T-state counting
built in. Each instruction assembled adds its instruction time to a
running total of T-states. The t(mem) operator returns this
total for all instructions assembled just before memory location mem.
You can change this running total using the tstate n psuedo-op
which sets the current total to n. Instruction timing on the
Z-80 is a little complicated because some instructions have fast and slow
paths depending on input conditions. Assembled instructions always add
the minimum number of T-states. Two addition operators help in accounting.
tilo(mem) and tihi(mem) return the fast and slow
T-cycle counts of the instruction at memory location mem.

The simplest thing to do is count the T-states of a linear section of code.

code: ld a,5
ld (hl),a
inc a
inc hl
ld (hl),a
cost equ t($)-t(code)

cost will be the number of T-states used by those 5 instructions. We
could define a bunch of symbols like this to measure the various blocks of our
code and add them up later but its much easier to let the assembler keep its
running total and adjust it when there's something it doesn't know. For
example, here we add in 100 T-states because we know that's the minimum cost
of wHL.

call wHL
tstate t($)+100

Obviously a macro that adds to the current T-state count would be preferable
and in this case we'd even want a macro for calling wHL that hides
it all. But for now let's keep everything exposed. Putting these two
techniques together we can now easily write a loop that will execute in exactly the
same time as a Model 3 frame.

Note that t($) accounts for the first iteration of the loop. We
must add one less than the number of iterations to balance things out. A
simpler expression can be obtained by working with the start of the loop as
our basis.

tstate t(lp)+[t($)-t(lp)]*cnt

The same loop using jr instead of jp is a bit trickier. Only
one iteration of the loop incurs the low T-state cost of jr; all the
rest are the maximum. zmac counts the low cost so those cnt-1
interations must have the difference between the high and low cost added.
We also need to know what the jr instruction is two bytes long in
order to query its high and low users — that's where the $-2
comes from.

What about loops that execute a variable number of times? I don't
recommend them. Your program has a fixed amount of time before it must get
back to the business of hammering screen memory. The beam waits for no man.
Any loop that executes a variable number of times must have a maximum and
you can't run out of time even if that maximum is hit. Thus it is just as
well to process the maximum every time to guarantee that the program stays
in sync in that case.

Unless you're getting fancy. And why not? The most agressive technique I've
used was in Doubled Dancing Demon (and to a lesser
extent in the bload simulator). Even if we take away
sound the demon's animation needs to be updated every frame. He might
need to move to the left, or change to a different frame or do nothing. Making
the various small subroutines take exactly the same amount of time is tedious
(but more feasible since I added assert to zmac). Instead they're
all expected to subtract the number of T-states they use from HL.
We'll assume there's some subroutine which updates animstep to point
to the animation subroutine for this frame. I can level them out to using
3000 T-states by doing something like this.

Since the Z-80 doesn't have CALL (IX) I had to fool around with
pushing the return address I want on the stack and doing a jump. Each
animation subroutine does the T-state accounting at run time. For example:

These examples adequately describe the basic mechanisms needed to track
T-states. Like any programming problem you may be best off contemplating
the fundamental operations and putting them together in your own way. It all
gets tied up in how the rest of the program operates.