16.4.5.4 Customized Input Parsers

By default, gawk reads text files as its input. It uses the value
of RS to find the end of the record, and then uses FS
(or FIELDWIDTHS or FPAT) to split it into fields (see Reading Files).
Additionally, it sets the value of RT (see Built-in Variables).

If you want, you can provide your own custom input parser. An input
parser’s job is to return a record to the gawk record-processing
code, along with indicators for the value and length of the data to be
used for RT, if any.

To provide an input parser, you must first provide two functions
(where XXX is a prefix name for your extension):

awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);

This function examines the information available in iobuf
(which we discuss shortly). Based on the information there, it
decides if the input parser should be used for this file.
If so, it should return true. Otherwise, it should return false.
It should not change any state (variable values, etc.) within gawk.

awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);

When gawk decides to hand control of the file over to the
input parser, it calls this function. This function in turn must fill
in certain fields in the awk_input_buf_t structure and ensure
that certain conditions are true. It should then return true. If an
error of some kind occurs, it should not fill in any fields and should
return false; then gawk will not use the input parser.
The details are presented shortly.

Your extension should package these functions inside an
awk_input_parser_t, which looks like this:

The fields can be divided into two categories: those for use (initially,
at least) by XXX_can_take_file(), and those for use by
XXX_take_control_of(). The first group of fields and their uses
are as follows:

const char *name;

The name of the file.

int fd;

A file descriptor for the file. If gawk was able to
open the file, then fd will not be equal to
INVALID_HANDLE. Otherwise, it will.

struct stat sbuf;

If the file descriptor is valid, then gawk will have filled
in this structure via a call to the fstat() system call.

The XXX_can_take_file() function should examine these
fields and decide if the input parser should be used for the file.
The decision can be made based upon gawk state (the value
of a variable defined previously by the extension and set by
awk code), the name of the
file, whether or not the file descriptor is valid, the information
in the struct stat, or any combination of these factors.

Once XXX_can_take_file() has returned true, and
gawk has decided to use your input parser, it calls
XXX_take_control_of(). That function then fills
either the get_record field or the read_func field in
the awk_input_buf_t. It must also ensure that fd is not
set to INVALID_HANDLE. The following list describes the fields that
may be filled by XXX_take_control_of():

void *opaque;

This is used to hold any state information needed by the input parser
for this file. It is “opaque” to gawk. The input parser
is not required to use this pointer.

int (*get_record)(char **out,

struct awk_input *iobuf,

int *errcode,

char **rt_start,

size_t *rt_len);

This function pointer should point to a function that creates the input
records. Said function is the core of the input parser. Its behavior
is described in the text following this list.

ssize_t (*read_func)();

This function pointer should point to a function that has the
same behavior as the standard POSIX read() system call.
It is an alternative to the get_record pointer. Its behavior
is also described in the text following this list.

void (*close_func)(struct awk_input *iobuf);

This function pointer should point to a function that does
the “teardown.” It should release any resources allocated by
XXX_take_control_of(). It may also close the file. If it
does so, it should set the fd field to INVALID_HANDLE.

If fd is still not INVALID_HANDLE after the call to this
function, gawk calls the regular close() system call.

Having a “teardown” function is optional. If your input parser does
not need it, do not set this field. Then, gawk calls the
regular close() system call on the file descriptor, so it should
be valid.

The XXX_get_record() function does the work of creating
input records. The parameters are as follows:

char **out

This is a pointer to a char * variable that is set to point
to the record. gawk makes its own copy of the data, so
the extension must manage this storage.

struct awk_input *iobuf

This is the awk_input_buf_t for the file. The fields should be
used for reading data (fd) and for managing private state
(opaque), if any.

int *errcode

If an error occurs, *errcode should be set to an appropriate
code from <errno.h>.

char **rt_start

size_t *rt_len

If the concept of a “record terminator” makes sense, then
*rt_start should be set to point to the data to be used for
RT, and *rt_len should be set to the length of the
data. Otherwise, *rt_len should be set to zero.
gawk makes its own copy of this data, so the
extension must manage this storage.

The return value is the length of the buffer pointed to by
*out, or EOF if end-of-file was reached or an
error occurred.

It is guaranteed that errcode is a valid pointer, so there is no
need to test for a NULL value. gawk sets *errcode
to zero, so there is no need to set it unless an error occurs.

If an error does occur, the function should return EOF and set
*errcode to a value greater than zero. In that case, if *errcode
does not equal zero, gawk automatically updates
the ERRNO variable based on the value of *errcode.
(In general, setting ‘*errcode = errno’ should do the right thing.)

As an alternative to supplying a function that returns an input record,
you may instead supply a function that simply reads bytes, and let
gawk parse the data into records. If you do so, the data
should be returned in the multibyte encoding of the current locale.
Such a function should follow the same behavior as the read()
system call, and you fill in the read_func pointer with its
address in the awk_input_buf_t structure.

By default, gawk sets the read_func pointer to
point to the read() system call. So your extension need not
set this field explicitly.

NOTE: You must choose one method or the other: either a function that
returns a record, or one that returns raw data. In particular,
if you supply a function to get a record, gawk will
call it, and will never call the raw read function.

gawk ships with a sample extension that reads directories,
returning records for each entry in a directory (see Extension Sample Readdir). You may wish to use that code as a guide for writing
your own input parser.

When writing an input parser, you should think about (and document)
how it is expected to interact with awk code. You may want
it to always be called, and to take effect as appropriate (as the
readdir extension does). Or you may want it to take effect
based upon the value of an awk variable, as the XML extension
from the gawkextlib project does (see gawkextlib).
In the latter case, code in a BEGINFILE section
can look at FILENAME and ERRNO to decide whether or
not to activate an input parser (see BEGINFILE/ENDFILE).