11.8.Â Buffered Input and Output

We can improve the efficiency of our code by buffering our
input and output. We create an input buffer and read a whole
sequence of bytes at one time. Then we fetch them one by one
from the buffer.

We also create an output buffer. We store our output in it until
it is full. At that time we ask the kernel to write the contents
of the buffer to stdout.

The program ends when there is no more input. But we still need
to ask the kernel to write the contents of our output buffer
to stdout one last time, otherwise some of our output
would make it to the output buffer, but never be sent out.
Do not forget that, or you will be wondering why some of your
output is missing.

We now have a third section in the source code, named
.bss. This section is not included in our
executable file, and, therefore, cannot be initialized. We use
resb instead of db.
It simply reserves the requested size of uninitialized memory
for our use.

We take advantage of the fact that the system does not modify the
registers: We use registers for what, otherwise, would have to be
global variables stored in the .data section. This is
also why the UNIXÂ® convention of passing parameters to system calls
on the stack is superior to the Microsoft convention of passing
them in the registers: We can keep the registers for our own use.

We use EDI and ESI as pointers to the next byte
to be read from or written to. We use EBX and
ECX to keep count of the number of bytes in the
two buffers, so we know when to dump the output to, or read more
input from, the system.

Not what you expected? The program did not print the output
until we pressed ^D. That is easy to fix by
inserting three lines of code to write the output every time
we have converted a new line to 0A. I have marked
the three lines with > (do not copy the > in your
hex.asm).

Note:

This approach to buffered input/output still
contains a hidden danger. I will discuss—and
fix—it later, when I talk about the
dark
side of buffering.

11.8.1.Â How to Unread a Character

Warning:

This may be a somewhat advanced topic, mostly of interest to
programmers familiar with the theory of compilers. If you wish,
you may skip to the next
section, and perhaps read this later.

While our sample program does not require it, more sophisticated
filters often need to look ahead. In other words, they may need
to see what the next character is (or even several characters).
If the next character is of a certain value, it is part of the
token currently being processed. Otherwise, it is not.

For example, you may be parsing the input stream for a textual
string (e.g., when implementing a language compiler): If a
character is followed by another character, or perhaps a digit,
it is part of the token you are processing. If it is followed by
white space, or some other value, then it is not part of the
current token.

This presents an interesting problem: How to return the next
character back to the input stream, so it can be read again
later?

One possible solution is to store it in a character variable,
then set a flag. We can modify getchar to check the flag,
and if it is set, fetch the byte from that variable instead of the
input buffer, and reset the flag. But, of course, that slows us
down.

The C language has an ungetc() function, just for that
purpose. Is there a quick way to implement it in our code?
I would like you to scroll back up and take a look at the
getchar procedure and see if you can find a nice and
fast solution before reading the next paragraph. Then come back
here and see my own solution.

The key to returning a character back to the stream is in how
we are getting the characters to start with:

First we check if the buffer is empty by testing the value
of EBX. If it is zero, we call the
read procedure.

If we do have a character available, we use lodsb, then
decrease the value of EBX. The lodsb
instruction is effectively identical to:

mov al, [esi]
inc esi

The byte we have fetched remains in the buffer until the next
time read is called. We do not know when that happens,
but we do know it will not happen until the next call to
getchar. Hence, to "return" the last-read byte back
to the stream, all we have to do is decrease the value of
ESI and increase the value of EBX:

ungetc:
dec esi
inc ebx
ret

But, be careful! We are perfectly safe doing this if our look-ahead
is at most one character at a time. If we are examining more than
one upcoming character and call ungetc several times
in a row, it will work most of the time, but not all the time
(and will be tough to debug). Why?

Because as long as getchar does not have to call
read, all of the pre-read bytes are still in the buffer,
and our ungetc works without a glitch. But the moment
getchar calls read,
the contents of the buffer change.

We can always rely on ungetc working properly on the last
character we have read with getchar, but not on anything
we have read before that.

If your program reads more than one byte ahead, you have at least
two choices:

If possible, modify the program so it only reads one byte ahead.
This is the simplest solution.

If that option is not available, first of all determine the maximum
number of characters your program needs to return to the input
stream at one time. Increase that number slightly, just to be
sure, preferably to a multiple of 16—so it aligns nicely.
Then modify the .bss section of your code, and create
a small "spare" buffer right before your input buffer,
something like this:

section .bss
resb 16 ; or whatever the value you came up with
ibuffer resb BUFSIZE
obuffer resb BUFSIZE

You also need to modify your ungetc to pass the value
of the byte to unget in AL:

ungetc:
dec esi
inc ebx
mov [esi], al
ret

With this modification, you can call ungetc
up to 17 times in a row safely (the first call will still
be within the buffer, the remaining 16 may be either within
the buffer or within the "spare").