The MIPS R4000, part 15: Code walkthrough

Today we're going to take a relatively small function and watch
what the compiler did with it.
The function is this guy from the C runtime library,
although I've
simplified it a bit to avoid some distractions.

On entry, the parameters to a function are passed in a0
through a3.
This function has only one parameter, so it goes in a0.

We reserve some stack space.
The first 16 bytes of that stack space are going to be used
as home space for the functions we call, so our usable
bytes start at offset 0x10.
We save the s0 register (because we're going to use
it as a local variable) and the return address (because it will
be modified when we call to other functions).

We mask off all but the _IOSTRG bit and see if it's nonzero.
If so, then we branch.
This branch uses the l "likely" suffix,
so the instruction in the branch delay slot executes only if the
branch is taken.
Since the true branch of the if is only
one instruction long, the entire contents fit inside the delay slot.
How convenient.
We can put the true branch in the branch delay slot
and jump right to the function exit code.
If the branch is not taken, then the instruction in the branch
delay slot is suppressed.
(This suppression
behavior is the case only for l-type branches.)

To calculate the pointer difference, we need to subtract the raw
pointers, and in order to do that, we need to load the 32-bit
address of the _iob array.
That takes two instructions.
And then we subtract the raw pointers to get the byte difference.
And then we divide by sizeof(FILE) to get the index.
We're lucky that the size of a FILE is a power of 2,
so a shift instruction can be used instead of a full division.

Now that we've calculated the index, set it up as the argument
for the _lock_str function and call it.
But just before we go, we save a1 (which is the
stream parameter) on the stack so we don't lose it.
The saving of a1 goes into the branch delay slot, so
it executes before the branch is taken, even though it
comes after the branch in the instruction stream.

(I don't know why the compiler bothered with a1.
It could have saved a0 on the stack sooner and put
the move a0, s0 in the branch delay slot.)

The next thing to do is to call _fclose_lk,
and in this case, we load its argument in the branch delay slot.
Seeing work happen in the branch delay slot takes getting used to.
It always takes a period of adjustment whenever
I switch to MIPS after
working with some other processor without branch delay slots.

After the _fclose_lk, we call _unlock_str,
and this time we use the branch delay slot to save the return value from
_fclose_lk onto the stack before we lose it.
(Though the compiler could have done a little better and
saved it in s0,
since index is a dead variable at this point.)

; }
lw v1,0x24(sp) ; recover result so we can return it

After _unlock_str returns, we put result
into v1 because that's where our cleanup code expects it.

Note that in the instruction stream, you see a store immediately
followed by a load from the same location.
This makes no sense at first,
until you realize that there's a function
call in between them, because the store is in the branch delay slot.
Even though the store and load immediately follow each other
in the instruction stream,
there's an entire function call that happens in between!
The store happens before the function call,
ad the load happens after.

We set the return value to the result,
and then we enter the epilogue.
In the epilogue, we restore the s0 register
we had been using to hold index,
and then we load up the return address and jump back to it.
We destroy the stack frame in the branch delay slot.

This concludes our tour of the MIPS R4000 processor.
I never had to do any significant work with it,
so I probably won't be able to answer interesting questions.
The focus was on learning enough to be able to read valid
compiler output,
with a few extra notes on the architecture to call out
what makes it different.¹

My first trick is to reuse the home space.
The compiler-generated version didn't use the home space
for anything other than saving the stream
parameter.
Look, people, it's free memory!
We need three words of stack, one for the return address,
one to save the preserved register s0,
and one to save the index.
We get four words of home space, so we can just use that.
The actual stack frame needed by our function is just
the home space for the outbound call.

(I wonder whether it's legal to overlap your inbound home
space with your outbound home space.
If our function had needed only two words of stack,
would it have been okay for us to write
addiu sp, sp, -8?)

I'm precalculating the result in anticipation
of the early-out.
This instruction is basically free because it comes in the
load delay slot.
If we had tried to use the value in t6 immediately,
the processor would have stalled for a cycle, so we may as well
use that cycle productively, even if only speculatively.

Calling _fclose_lk is simpler because we can
move the argument from a register rather than from memory.
That way, if the first thing that _fclose_lk does
is try to use the stream, it won't suffer a load delay stall.
The first instruction of the called function executes
immediately after the branch delay slot.
If you put a load instruction in the branch delay slot,
then the first instruction of the called function is executing
in a load delay slot,
and it probably isn't expecting that.

So that thinking tipped the scales in favor of
keeping stream as the register variable.
(Of course, that thinking is also based on the older
MIPS implementation, which was not dual-issue.
The MIPS R4000 processes one instruction every half-cycle.
This alters the micro-optimization considerations for both
branch delays and load delays.)

Now that you’re done with your tour of MIPS (which brings back memories of my university Computer Architecture course), would you consider writing a similar series about x86? I know that it’s the ubiquitous PC ISA and you’ve already written much about it, but having a similar systematic treatment of its register set, memory models, and arithmetic/logic/control-flow instructions, would help illustrate what makes x86 unique.

I’d actually find that really interesting, having never worked with x86/x86-64 assembly myself and always wondered how it differed. The few times I’ve had to dive into assemblyland has mostly been JVM microcode and MIPS assembly.

Hey you can use arm_now to quickly deploy a MIPS virtualmachine, within 30 secondes (install + setup + run).https://github.com/nongiach/arm_now
arm_now support a lot of cpu architectures, you can write similar series for whatever arch you want

While reading the code, I have found the 40 byte stack frame reserved by the compiler (0x28, or ten words) is excessive. Some slots are left untouched (offsets 0x14 and 0x20), some could be saved (AFAIK, the specification allows you to freely use the slots reserved for non-defined parameters, in this case offsets 0x04 through 0x0c, so you use them instead of booking 0x24 and 0x28), and finally, some are plainly wasted (offset 0x10 saves sp, but if sp somehow got overwritten, how would you recover it?). All in all, such a simple function would be able to do most of its work in the parameter back up space.

Raymond’s version is much more reasonable: it uses 16 bytes less while still playing by the rules. You would expect modern compilers to do a better job…

What makes you think this is a modern compiler? The other articles in the series have referenced the 1992 and 1995 compilers, so I’d expect it to be one of those (and by previous accounts, the 1992 compiler didn’t do quite a few things that it could have.)