1. Overview of combine

1.1 Introduction

combine is primarily a program for merging files on a common key.
While such things as table joins on keys are common in database systems
like MySQL, there seems to be a marked failure in the availability for
such processing on text files. combine is intended to fill that
gap.

Another way of looking at combine is as the kernel of a database
system without all the overhead (and safeguards) associated with such
systems. The missing baggage that appeals most to me is the requirement
to load data into somebody else’s format before working with it.
combine’s design is intended to allow it to work with most data
directly from the place where it already is.

In looking around for existing software that wanted to do what I wanted
to do, the closest I came was the join utility in GNU and other
operating systems. join has some limitations that I needed
to overcome. In particular, in matching it works on only one field each
in only two files. For such a match, on files whose fields are
separated by a delimiter, I’m sure that join is a more efficient choice.
Someday I’ll test that assumption.

Once I started writing the program, I had to come up with a name. Given
that one of the earliest applications that I imagined for such a program
would be to prepare data for storage in a data warehouse, I thought to
where the things that are stored in physical warehouses come from. At
that point, I came up with the name DataFactory. Unfortunately, just as
I got ready to release it, I noticed that someone else has that as a
registered trademark.

As a result, I have come up with the name combine. Like the farm
implement of the same name, this program can be used to separate the wheat
from the chaff. I also like it because it has a similarity to join
reminiscent of the similarity of function.

1.1.1.1 Encrypted Data

Suppose you have a file that you would normally want to keep encrypted
(say, your customer information, which includes their credit card
numbers). Suppose further that you occasionally receive updates of some
less sensitive information about the customers.

To update the sensitive file, you could unencrypt the file, load all the
data from both files into a database, merge the data, extract the result
from the database (making sure to wipe the disk storage that was used),
and re-encrypt. That leaves the information open to DBAs and the like
during the process, and it sure seems like a lot of steps.

Using combine and your favorite encryption program, the data never
has to touch the disk unencrypted (except in a swap file) so casual
observers are unlikely to run across sensitive information. In addition
an updated encrypted file can be created in one command line, piping the
input and output of the sensitive data through your encryption program.

1.1.1.2 Data Validation

Another example that is often important when receiving data from the
outside world is data validation.

Perhaps you receive product sales data from a number of retailers, which
you need to combine with product information you have elsewhere. To
confirm that the file you received is valid for reporting, you might need
to check the product codes, ZIP codes, and retailer codes against your
known valid values.

All the comparisons can be done in a single command, which will result in
a status which will flag us down if anything did not match to our lists
of expected values.

That’s probably enough if we are pretty sure that the incoming data is
usually clean. If something does not match up, we can investigate by
hand. On the other hand, if we expect to find differences a little more
frequently, we can make some small changes.

The following command makes the match optional, but puts a constant ’$’
at the end of the record for each match among the three keys to be
validated. When there isn’t a match, combine puts a space into the
record in place of the ’$’. We can then search for something other
than ’$$$’ at the end of the record to know which records didn’t match.

1.2 Reporting Bugs

When you find a bug, please be so kind to report it in a way that will
help the maintainer to understand why it is a bug and how to make it
happen (so that the correction can be tested). The information
necessary to make such a determination will include such things as a
description of the symptoms (not just your impression of the cause), the
command line you used to run the program, the options that were used
when building the program, and anything else that you are not absolutely
sure is irrelevant. You can also include the things that you are
absolutely sure are irrelevant, as one is often sure without good
reason.

Thank you for taking advantage of this free software. Your information
on bugs in your use of the software are very valuable in keeping it as
useful as possible for all users. Please report bugs to
<bug-combine@gnu.org>

2.1 General Options

The general options are not specific to any particular file and apply to
the entire current run of the program. They are listed below.

‘-R’

‘--check-all-reference’

‘--no-check-all-reference’

For each data record, check for a match with each of the reference
files. This option is set by default. When turned off with
‘--no-check-all-reference’, processing of that data file record
will be stopped as soon as a required match is not fulfilled. This is
a shortcut for speed, but it will change the meaning of any flags or
counters set in reference-file-based input.

‘-f’

‘--flag’

In any output based on a reference file, print a flag with the value
‘1’ if the record was matched by a data record or ‘0’
otherwise.

‘-n’

‘--count’

In any output based on a reference file, print a new field which is a
count of the number of data records that matched the current reference
record.

‘-z number’

‘--counter-size=number’

If counters or sums of data fields are being written into
reference-based output, make them number bytes long.

‘--statistics’

‘--no-statistics’

At the end of the processing, write to stderr counts of the
number of records read and matched from each file processed. The option
is enabled by default. When turned off with ‘--no-statistics’
print no statistics.

‘--verbose’

Write all the information that is possible about the processing to
stderr. This includes a summary of the options that have been
set and running counters of the number of records processed.

2.2 File Options

There are some attributes that apply to any file. They include such
things as the way to tell one record from another, the way to identify
fields within a record, and the fields in a file that you want to write
out.

‘-d string’

‘--output-field-delimiter=string’

Print string between each pair of fields written to output files.
This can be a comma, a tab character, your mother’s maiden name, or the
full text of the GNU General Public License. It is your choice. Of course,
it is most useful for it to be something that would not normally occur in
your data fields and something that your target system will understand.
Because combine doesn’t care about such things, you can even use the
null character (\0) or the newline character (\n).

‘--no-output-field-delimiter’

This means that there is to be no delimiter between fields in the output file
or files. It will prevent the input field delimiter from being taken as the
output field delimiter. In cases like this, it might behoove you either to
ensure that the fields have a fixed length or to use Guile to insert your own
delimiters before they are written. When specified along with
‘--output-field-delimiter’, the option that appears later will override
the prior one.

‘-b string’

‘--output-record-delimiter=string’

Similarly here string is whatever string you need to use to tell
the records apart in output files. If you don’t specify it, combine
will assume that the usual text format applies, the newline on GNU
operating systems. (As of this writing, no port has been made to systems
where a newline is not the standard end-of-line character.)

‘-D string’

‘--input-field-delimiter=string’

Look for string in the associated input file to determine where
fields start and end. It can be any string, but it should be the string
that really separates fields in your file. Because combine doesn’t
care about such things, you can even use the null character (\0)
or the newline character (\n). At present, there is no interpretation
of quoted strings. As a result, you may get unexpected results if the delimiter
shows up in the middle of one of your fields.

When specified for the data file, the value specified is used as the input
field delimiter for all reference files and as the output field delimiter for
any output files, unless other action is taken. The options
‘--no-input-field-delimiter’ and ‘--no-output-field-delimiter’
will tell combine that a given file should not have a delimiter. The
options ‘--input-field-delimiter’ and ‘--output-field-delimiter’
will use whatever delimiter you specify rather than the data file input field
delimiter.

‘--no-input-field-delimiter’

This means that there is no delimiter between fields in the specified input file
or files. It will prevent the data file input field delimiter from being taken
as the input field delimiter for a reference file. When specified along with
‘--output-field-delimiter’, the option that appears later will override
the prior one.

‘-B string’

‘--input-record-delimiter=string’

Similarly here string is whatever string you need to use to tell
the records apart in input files. If you don’t specify it, combine
will assume that the usual text format applies, the newline on GNU
operating systems.

‘-L number’

‘--input-record-length=number’

If the records are all of the same length (whether of not there is a
linefeed or a ‘W’ between them), you can specify a record length.
combine will then split the file into records all of exactly
NUMBER bytes long. If there are linefeeds or carriage returns
between records, be sure to add them to the count. If the length is not
specified, combine will assume the records have a delimiter between
them.

‘-p’

‘--match-optional’

Make it unnecessary to find a match to this file for inclusion in data
file based output and for the further processing of a data record. All
matches are required by default.

When applied to a reference file, this means that records based on a data
record may be written without a match to a key in this reference file,
and any output from fields in this file will be left blank (or substituted
if specified with the ‘-O’ option).

When applied to a data file, this means that in the file based on the data
file records, records will be written at the end of processing for reference
file records that were not matched by a data file record. Any output
specified from fields in the data file will be left blank (or substituted
if specified with the ‘-O’ option). If there is more than one reference
file, care should be taken, because the unmatched records from each reference
file will each become a record in the output file. This has no effect on
reference-file-based output.

‘-P’

‘--exclude-match’

Require no match to this file for inclusion in data file based output and
for the further processing of a data record. All matches are required and
included by default. This differs from ‘--match-optional’ in that
records that do have a match are excluded from further consideration.

When applied to a reference file, this means that records based on a data
record may only be written when there is no match to a key in this reference
file, and any output from fields in this file will be left blank (or
substituted if specified with the ‘-O’ option).

When applied to a data file, this means that in the file based on the data
file records, records will only be written at the end of processing for
reference file records that were not matched by a data file record. Any output
specified from fields in the data file will be left blank (or substituted
if specified with the ‘-O’ option). If there is more than one reference
file, care should be taken, because the unmatched records from each reference
file will each become a record in the output file. This has no effect on
reference-file-based output.

2.3 Data Files

A data file is either stdin, one or more files, or both. If
there is more than one source, they should all share a format that fits
the specifications provided for output fields, fields holding numeric
values to be summed, and fields to be used as keys for matching.

The following are the commands that specifically affect the processing
of the data file. Some share the same name with options that can also
apply to reference files. Those must be on your command line before you
present your first reference file. The others could appear anywhere on
the command line; however, convention dictates that all the data-file
options should be together before the first reference file is named.

‘-w’

‘--write-output’

Signals the program that output records should be written every time a
data record satisfies the matching criteria with the reference files.
The record written will combine all specified output fields from the
data file record, the matching reference file record(s), and any specified
constant values. This option is a positional option for all files, and
must appear before the first reference file is named.

‘-t filename’

‘--output-file=filename’

If provided, write the data-based output to filename. Otherwise
the output will go to stdout. This option only makes sense if
you plan to write data-file-based output. This is a positional option
for all files and must appear before the first reference file is named.

‘-o range_string’

‘--output-fields=range_string’

Write the fields specified by range_string as part of the record
in any data-based output. The range specifications share a common
format with all field specifications for combine. This option only
makes sense if you plan to write data-file-based output. This is a
positional option and to apply to the data file it must appear before
the first reference file is named.

‘-K string’

‘--output-constant=string’

Write string to the data-file-based output. This option only
makes sense if you plan to write data-file-based output. This is a
positional option and to apply to the data file it must appear before
the first reference file is named.

‘-s range_string’

‘--sum-fields=range_string’

In any reference-file-based output write a sum of the value of the
fields specified by range_string. The sum for a given reference
file record will come from all data file records that matched it. This
option only makes sense if you have requested reference-file-based
output.

If a precision is provided for a sum field, that precision is honored,
and decimal fractions up to that many digits will be included in the sum.
Any further precision on the input fields will be ignored (without
rounding) in the sum. The resulting sum will be written with a decimal
point and the number of fractional digits implied by the precision.

2.4 Reference Files

A reference file record is expected to match on a set of key fields to a
data file record. The parts of a reference file that are necessary for
processing are read entirely into memory. You can specify as many
reference files as you want, depending only on the amount of memory your
system can spare to hold them. For any reference file, it is minimally
required that you specify a file name, a specification of the key fields
in the reference file, and a specification of the matching key fields in
the data file.

The following are the options that are related to reference files. They
are all positional, and they apply to the processing of the previously
named reference file. (Except of course for the reference file name
itself, which applies to itself.)

‘-r filename’

‘--reference-file=filename’

Use filename as a reference file to match to the data file in
processing. This option introduces a block of positional options that
relate to this reference file’s processing in combine.

‘-k range_string’

‘--key-fields=range_string’

Use the fields specified by range_string as a key to match to a
corresponding key in the data file.

‘-m range_string’

‘--data-key-fields=range_string’

Use the fields specified by range_string as the corresponding key
to a key taken from a reference file.

‘-a range_string’

‘--hierarchy-key-fields=range_string’

Use the fields specified by range_string as a key to perform a
recursive hierarchical match within the reference file. This key will
be matched against values specified in the regular key on the reference
file.

‘-u’

‘--unique’

Keep only one record for the reference file in memory for each distinct
key. By default combine maintains all the records from the
reference file in memory for processing. This default allows for
cartesian products when a key exists multiple times in both the
reference and data files.

‘-h number’

‘--hash-size=number’

Use the number provided as a base size for allocating a hash table
to store the records from this reference file. If this number is too
small, combine will fail when it tries to record a record it has no
room for. If it is only a little bit too small, it will cause
inefficiency as searching for open space in the hash table will be
difficult.

‘-H keyword’

‘--hash-movement=keyword’

One of the keywords binary, number, beginning,
or end, indicating how to turn the key into a number with
the best variability and least overlap. The wise choice of this option
can cut processing time significantly. The binary option is the default,
and treats the last few bytes (8 on most computers) of the key string(s)
as a big number. The number option converts the entire key to a number
assuming it is a numeric string. The other two take the least significant
3 bits from each of the first or last few (21 where a 64 bit integer is
available) bytes in the key strings and turns them into a number.

‘-w’

‘--write-output’

Signals the program that output records should be written for every
record stored for this reference file. This will either be one record
for every record in the reference file or one record for every distinct
set of keys in the reference file, depending on the setting of the
option ‘--unique’. The record written will include all specified
output fields from the reference file record, any specified constant
value for this reference file, and any flag, counter, or sums requested.

‘-t filename’

‘--output-file=filename’

If provided, write the output based on this reference file to
filename. Otherwise the output will go to stdout. This
option only makes sense if you plan to write output based on this
reference file.

‘-o range_string’

‘--output-fields=range_string’

Write the fields specified by range_string as part of the record
in any reference-file- or data-file-based output. The range
specifications share a common format with all field specifications for
combine.

‘-K string’

‘--output-constant=string’

Write string to the reference- or data-file-based output.

‘-U’

‘--up-hierarchy’

When traversing the hierarchy from a given reference-file record, use
the values on that record in the ‘--hierarchy-key-fields’ fields
to connect to the ‘--key-fields’ fields of other records from the
reference file. For most purposes, the presence of the connection on the
first record suggests a single parent in a standard hierarchy. The hierarchy
traversal stops when the ‘--hierarchy-key-fields’ fields are empty.

If this option is not set, the ‘--key-fields’ fields are used to search
for the same values in the ‘--hierarchy-key-fields’ fields of other
records in the same file. This allows multiple children of an initial record,
and suggests going down in the hierarchy. The hierarchy traversal stops when
no further connection can be made. The traversal is depth-first.

‘-l’

‘--hierarchy-leaf-only’

When traversing a hierarchy, treat only the endpoints as matching records.
Nodes that have onward connections are ignored except for navigating to the
leaf nodes.

‘-F number’

‘--flatten-hierarchy=number’

When traversing a hierarchy, act as the ‘hierarchy-leaf-only’, except
save information about the intervening nodes. Repeat the ‘output-fields’
fields number times (leaving them blank if there were fewer levels),
starting from the first reference record matched.

2.5 Output Files

There are two basic kinds of output files: one based on the data records
and reference records that match them, the other based on a full set of
records from one reference file with flags, counts, or sums based on the
aggregate of the matching data records.

The output file based on the data file consists of information from the
data file records and any matching reference file records. The records
that go into data-based output files can be figured out as follows:

no reference file

If there is no reference file, there will be one record for every record
in the data file, with the exception of any records that were elimitated
through an extension filter. (see section Extending combine.)

reference files with ‘--unique’ and ‘--match-optional’ options

If all reference files are specified with the ‘--unique’ and
‘--match-optional’ options, then the records selected for insertion
into the data-based output file will be the same as those that would be
selected without a reference file.

reference files without the ‘--unique’ option

If a reference file is not given the ‘--unique’ option and there
is more than one reference record that matches a given data record, then
the data record will be represented more than once in the output file,
each time combined with information from a different matching reference
record. If there is more than one reference file where this is the
case, the result will be multiplicative (e.g. 2 matches in each of 2
reference files will produce 4 records). This is the default setting
for a reference file.

reference files without the ‘--match-optional’ option

If a reference file is not given the ‘--match-optional’ option, then
any data record that does not have a match in the reference file will
not be represented in the output file. This is the default setting.

The fields that can appear in data=file-based output can come from the
data-file record and any matching reference file records.

Reference=file-based output files are simpler. Depending on the
existence or not of the ‘--unique’ option, the file will have an
entry for each of the unique keys or for each of the records in the
reference file, respectively.

The fields in the reference=file-based output are exclusively from the
reference file, except for optional fields that can be summarized from
fields on matching data-file records.

The order of the fields in an output record can either be according to
the default or it can be explicitly specified by the user.

In data=file-based output, the standard field order is as follows. All the
fields listed are printed in this order if they are specified. If an
output field delimiter is specified, it is put between every pair of
adjacent fields. If there is no match for a given reference file (and
the ‘--match-optional’ option is set for the file), all the fields that
would normally be provided by that file are filled with spaces for
fixed-width fields or zero-length for delimited output.

All the data-file output fields (in order)

The constant string set for the data file

For each reference file

- The constant string set for the reference file

- All the reference-file output fields

In reference=file-based output, the standard field order is as follows.
All the fields listed are printed in this order if they are specified.
If an output field delimiter is specified, it is put between every pair
of adjacent fields.

All the reference-file output fields OR the key fields if no output fields are given

A 1/0 flag indicating whether there was any match

A counter of the number of data records matched

A sum of each of the data-file sum fields from each matching data-file record

The order of the fields in any output file can be customized using the
‘--field-order’ (or ‘-O’) option. The argument for the
option is a comma-separated list of field identifiers. Each field
identifier has 2 parts, a source and a type, separated by a period (.).

The sources are composed of an ‘r’ for reference file or ‘d’
for data file followed by an optional number. The number indicates
which reference file the field comes from and is ignored for data files.
Without a number, the first of each is taken.

A third source ‘s’ represents a substitution in the event that the
preceding reference file field could not be provided because there was
no match between that reference file and the data file. The number
following it, if blank or zero, tells combine to take the field from
the data file. Any other number means the corresponding reference file.
This allows the conditional update of fields from the data file, or a
prioritization of selections from a variety of reference files. If you
are working with fixed-width fields, you should ensure that the lengths
of the various fields in the substitution chain are the same.

The types are composed similarly. The identifiers are listed below.
The number is ignored for identifiers of string constants, flags, and
counters. For output fields, a hyphen-separated range of fields can be
written to avoid having to write a long list. Any number provided is
the number of the field in the order it was specified in the ‘-o’
or ‘-s’ option on the command line. In delimited-field files
this may differ from the field number used in those options.

In this case, the first three fields from the data file are followed by
the constant string from the data file. Then, if there was a match to
reference file 1, the first field from that file is taken, otherwise if
there was a match to reference file 2, the first field from that file is
taken. If neither file matched the data record, the fourth field from
the data record is taken instead.

The second line is equivalent, using a range of fields for convenience.

2.6 Field Specifiers

There are a number of options that require a list of fields from one of
the files. All take an argument consisting of a comma-separated list of
individual field specifications.

The individual field specifications are of the form
‘s-e.p(instruction)’. As a standard (with no field delimiter
given for the file), ‘s’ is a number indicating the starting
position of the field within the records in th given file, and ‘e’
is a number pointing out the ending position. If there is no hyphen, the
single number ‘s’ represents a one-byte field at position ‘s’.
If the first number and a hyphen exist without the number ‘e’, then
the field is assumed to start at ‘s’ and extend to the end of the
record. The numbering of the bytes in the record starts at 1.

If you do provide a delimiter for identifying the borders between fields
in your file, then combine does not need to count bytes, and you
just need to provide the field number (starting at 1) of the fields you
want to declare. Currently, giving a range of fields results in
combine acting as though you listed every field from the lower end
to the upper end of the range individually. Any precision or
extensibility information you provide will be applied to each field in
the range.

The optional ‘.p’ gives the number of digits of precision with
which the field should be considered. At present its only practical use
is in fields to be summed from data files, where it is used in converting
between string and number formats.

The optional ‘(instruction)’ is intended for extensibility. For
output fields and key fields it is a scheme command that should be run
when the field is read into the system. If the associated field
specification is a key for matching reference and data records, this can
affect the way the records are matched. If it is a field to be written
to an output file, it will affect what the user sees.

To refer to the field actually read from the source data within the
scheme command, use the following sceme variable in the appropriate place
in the command: ‘input-field’

Scheme commands attached to individual fields need to return a character
string and have available only the current text of the field as
described above(1). In delimited
output, feel free to alter the length of the return string at will. In
fixed-width output, it will be your responsibility to ensure that you
return a string of a useful length if you want to maintain the fixed
structure.

In the following string,

1-10,15-20.2,30(HiMom input-field),40-

The fist field runs from position 1 to 10 in the record. The second from
15 to 20, with a stated (but possibly meaningless) precision of 2. The
third is the single byte at position 30, to be adjusted by replacing it
with whatever the scheme function HiMom does to it. The fourth
field starts at position 40 and extends to the end of the record.

In a delimited example, the following two sets of options are equivalent.

-D ',' -o 4,5,6,7,8
-D ',' -o 4-8

In both cases the fourth, fifth, sixth, seventh, and eighth
comma-delimited fields are each treated as an individual field for
processing. Note that if you also provide a field order (with the
‘-O’ option), that order would refer to these fields as 1, 2, 3,
4, and 5 because of the order in which you specified them, not caring
about the order in the original file.

2.7 Emulation

Because of its simularity with join, combine has an emulation
mode that allows you to use the syntax of the join command to start
combine.

The emulation can be started by using ‘--emulate join’ option as the
first option after the command name. After that, you can use the same
options you get with join.

For details of the join command, see See (coreutils.info)join invocation section ‘Join Invocation’ in GNU Coreutils Manual.
When tested against the test cases packaged with GNU Coreutils, results are
identical to join with a couple of exceptions:

The sort order can be different. combine produces records in the order
of the second input file, with any unmatched records from the first input file
following in an arbitrary order.

I change the arguments to the ‘-o’ option in the test script to have
quotes around the field order list when it is separated by spaces. combine
can handle the space-delimited list, but the standard argument handler
getopt_long does not interpret them as a single argument. I don’t see the
need to overcome that.

There is not a specific test, but I have not yet implemented case-insensitive
matching. It would have failed if tested.

One more option that is not implemented in the emulation are the ‘-j1’
and ‘-j2’ methods of specifying separate keys for the two files. Use
‘-1’ and ‘-2’ for that. The two obsolete options would be interpreted
as speciying field 1 or field 2 as the common join key, respectively.

There are also a number of features of combine that come through in the
emulation. The main features relate to the keys: the sort order of the records
in relation to the keys does not matter to combine. combine also
allows you to specify a list of key fields (comma-delimited) rather than just
one as arguments to ‘-1’, ‘-2’, and ‘-j’. You should make
sure that the number of key fields is the same.

Another feature is that the second input file can actually be as many files as
you want. That way you can avoid putting the records from several files together
if not otherwise necessary.

3. How combine Processes Files

The base of combine reads records from a data file (or a series of
them in a row) and if there is an output request for data records, it
writes the requested fields out to a file or to stdout. Here is
an example of this most simple version of events.

combine --write-output --output-fields=1-

This is essentially an expensive pipe. It reads from stdin and
writes the entire record back to stdout.

Introducing a reference file gives more options. Now combine reads
the reference file into memory before reading the data file. For every
data record, combine then checks to see if it has a match. The
following example limits the simple pipe above by restricting the output
to those records from stdin that share the first 10 bytes in
common with a record in the reference file.

Note that the option ‘--unique’ is used here to prevent more than
one copy of a key from being stored by combine. Without it,
duplicate keys in the reference file, when matched, would result in
more than one copy of the matching data record.

The other option with a reference file is to have output based on the
records in that file, with indicators of how the data file records were
able to match to them. In the next example, the same match as above is
done, but this time we write out a record for every unique key, with a
flag set to ‘1’ if it was matched by a data record or ‘0’
otherwise. It still reads the data records from stdin and writes
the output records to stdout.

combine -r -f reference_file.txt -k 1-10 -m 1-10 -u -w -o 1-10

Of course, you might want both sets of output at the same time: the list
of data records that matched the keys in the reference file and a list
of keys in the reference file with an indication of which ones were
matched. In the prior two examples the two different kinds of output
were written to stdout. You can still do that if you like, and
then do a little post-processing to determine where the data-based
records leave off and the reference-based records begins. A simpler
way, however, is to let combine write the information to separate
files.

In the following example we combine the output specifications from the
prior two examples and give them each a filename. Note that the first
one has a spelled-out ‘--output-file’ while the second one uses the
shorter 1-letter option ‘-t’.

4. Extending combine

If combine was built with Guile (GNU’s Ubiquitous Intelligent
Language for Extensibility), you can do anything you want (within reason) to
extend combine. This would have been set up when combine was
compiled and installed on your computer. In a number of places, there
are built-in opportunities to call Guile with the data that is currently in
process. Using these options, you can use your favorite modules or write your
own functions in scheme to manipulate the data and to adjust how combine
operates on it.

The most common method (in my current usage) of extending combine is
to alter the values of fields from the input files before they are used
for matching or for output. This is done inside the field list by
adding the scheme statement after the range and precision. This is
covered in the section on field specifications. See section Field-specific extensions, for details.

Another useful option is the ability to initialize Guile with your own
program. To do this, you can use the ‘--extension-init-file’ (or
‘-X’) followed by a file name. combine will load that
scheme file into Guile before any processing. In that way your
functions will be available when you need them in the running of the
program. It certainly beats writing something complicated on the
command line.

In addition, there are Guile modules included in the distribution, which
can be used in extension scripts.

4.1 Extension Options

The remaining extensibility options are called at various points in the
program: when it starts, when a file is started, when a match is found,
when a record is read, when a record is written, when a file is closed,
and at the very end of the program. The options are listed below along
with the way to get access to the relevant data.

The various non-field=specific options are as follows. They all occur
as arguments to the option ‘--extension’ (or ‘-x’).

‘lscheme-command’

Filter records from the current file using the scheme command provided.
The scheme command must return ‘#t’ (to keep processing the record)
or ‘#f’ (to ignore this record and move on to the next). The
variables ‘reference-field-n’ or ‘data-field-n’ will be
available to the scheme command, depending on whether the record to be
filtered is from the data file or a reference file. In the variable
names ‘n’ represents the number of the specified output field,
numbered from 1.

‘mscheme-command’

Validate a proposed match using the scheme command provided. The scheme
command must return ‘#t’ (to confirm that this is a good match) or
‘#f’ (to tell combine that this is not a match). The variables
‘reference-field-n’ and ‘data-field-n’ will be available to the
scheme command from the reference and data records involved in a
particular match. In the variable names ‘n’ represents the number
of the specified output field, numbered from 1. The extension
specification affects the match between the data file and the last named
reference file.

‘hscheme-command’

Validate a proposed match between two records in the same hierarchy using
the scheme command provided. The scheme command must return ‘#t’
(to confirm that this is a good match) or ‘#f’ (to tell combine
that this is not a match). The variables ‘reference-field-n’ and
‘prior-reference-field-n’ will be available to the scheme command
from the prior and current reference records involved in a particular
match. In the variable names ‘n’ represents the number of the
specified output field, numbered from 1. The extension specification
affects the match while traversing the hierarchs in the last named
reference file.

‘rscheme-command’

Modify a record that has just been read using the scheme command
provided. The scheme command must return a string, which will become
the new value of the input record to be processed. The input record
iteself can be referred to in the scheme command by using the variable
‘input-record’ in the scheme command at the right place. The
records affected by this option are the records from the most recently
named reference file, or from the data file if no reference file has
yet been named.

As an example, consider that you may have received a file from someone
who strips all the trailing spaces from the end of a record, but you
need to treat it with a fixed-width record layout. Assuming that you
have defined a scheme function rpad in the initialization file
‘util.scm’, you can use the following command to get at the field
in positions 200-219, with spaces in place of the missing rest of the
record.

4.2.1 Calendar Functions

Included in the combine package are two Guile modules to work with dates
from a number of calendars, both obscure and common. The basis for them is
the set of calendar functions that are shipped with Emacs.

The reason that these functions deserve special notice here is that date
comparisons are a common type of comparison that often cannot be made
directly on a character string. For example I might have trouble knowing
if "20030922" is the same date as "22 September 2003" if I compared strings;
however, comparing them as dates allows me to find a match. We can even
compare between calendars, ensuring that "1 Tishri 5764" is recognized as the
same date as "20030927".

The calendar module can be invoked as (use-modules (combine_scm calendar)).
It provides functions for converting from a variety of calendars to and from
and absolute date count, whose 0-day is the imaginary date 31 December 1 B.C.
In the functions, the absolute date is treated as a single number, and the
dates are lists of numbers in (month day year) format unless otherwise
specified.

4.2.2 Calendar Reference

Here are some variables that can be used as references to get names
associated with the numbers that the date conversion functions produce
for months.

‘gregorian-day-name-alist’

An associative list giving the weekdays in the Gregorian calendar
in a variety of languages. Each element of this list is a list
composed of a 2-letter language code (lowercase) and a list of 7
day names.

‘gregorian-month-name-alist’

An associative list giving the months in the Gregorian calendar
in a variety of languages. Each element of this list is a list
composed of a 2-letter language code (lowercase) and a list of 12
month names.

‘calendar-islamic-month-name-array’

A list of the months in the Islamic calendar.

‘calendar-hebrew-month-name-array-common-year’

A list of the months in the standard Hebrew calendar.

‘calendar-hebrew-month-name-array-leap-year’

A list of the months in the leap year Hebrew calendar.

‘chinese-calendar-celestial-stem’

‘chinese-calendar-terrestrial-branch’

‘lunar-phase-name-alist’

‘solar-n-hemi-seasons-alist’

‘solar-s-hemi-seasons-alist’

‘french-calendar-month-name-array’

A list of the months in the French Revolutionary calendar.

‘french-calendar-multibyte-month-name-array’

A list of the months in the French Revolutionary calendar, using
multibyte codes to represent the accented characters.

‘french-calendar-day-name-array’

A list of the days in the French Revolutionary calendar.

‘french-calendar-multibyte-special-days-array’

A list of the special days (non weekdays) in the French
Revolutionary calendar, using multibyte codes to represent the accented
characters.

‘french-calendar-special-days-array’

A list of the special days (non weekdays) in the French Revolutionary calendar.

4.2.3 Calendar Parsing

The calendar parsing module can be invoked as (use-modules (combine_scm parse)).

The most useful function in the module is parse-date. It takes as arguments
a date string and an output format. The date string is parsed as well as possible
in descending order of preference for format in case of ambiguity. The function
returns the date triplet (or other such representation) suggested by the format
string.

The supported format strings are the words in the function names of the form
calendar-xxxx-from-absolute that would take the place of the xxxx.
See section Calendar Functions, for more information.

The parsing of the date string depends on the setting of a couple of variables.
Look inside the file ‘parse.scm’ for details. The list parse-date-expected-order
lists the order in which the parser should look for the year, month, and day in case
of ambiguity. The list parse-date-method-preference give more general format
preferences, such as 8-digit, delimited, or a word for the month and the expected incoming
calendar.

Here are a few examples of passing a date and putting it out in some formats:

The 13 day difference in the calendars is the reason that the Orthodox Christmas
is 2 weeks after the Roman Catholic Christmas.

guile> (parse-date "27 September 2003" "hebrew")
(7 1 5764)

Note that the Hebrew date is Rosh HaShannah, the first day of the year 5764.
The reason that the month is listed as 7 rather than 1 is inherited from the
Emacs calendar implementation. Using the month list in
calendar-hebrew-month-name-array-common-year or
calendar-hebrew-month-name-array-leap-year correctly gives "Tishri", but
since the extra month (in years that have it) comes mid-year, the programming
choice that I carried forward was to cycle the months around so that the extra
month would come at the end of the list.

I won’t get into the detail, but the five numbers reflect the date in
the Mayan calendar as currently understood.

Generally, I’d recommend using the more specific functions if you are sure
of the date format you expect. For comparing dates, I would further recommend
comparing the absolute day count rather than any more formatted format.

5. Applying combine

There are lots of ways to apply combine to problems that will pop up
in the course of day-to-day life. (Of course this fits into your day-to-day
life a bit more if you have a job or hobby that regularly puts you into
contact with data files.)

5.1 Rearranging Fields

When you do not use any reference files, combine still gives you the
opportunity to create a new record layout based on the records you read.

This is an advantage over the cut utility because while cut
only allows you to omit portions of the record, combine also allows
you to reorder those fields you keep and to add a constant field somewhere in the
order. In addition, combine gives you the chance to convert
between fixed-width and delimited formats, where cut keeps the
format you started with (although the GNU version does let you change
delimiters).

Clearly, flexible tools like awk or sed or any programming
language will also make this kind of thing (and with a little work anything
combine can do) possible. It may be that they are a more efficient choice,
but I have never tested it.

As an example, here is a fixed width file, which contains in its record layout
some address book information. If I need to make a tab-delimited file of
names and phone numbers to upload into my mobile phone, I can use the command
that follows to get the output I want.

Here is a command that grabs the first and last name and the phone number
and tacks the word "Home" on the end so that my phone marks the number with
a little house.(2)

Note that the statistics and the output all show up on the screen if you do
not say otherwise. The statistics are on stderr and the output on
stdout, so you can redirect them differently. You can also use the
option ‘--output-file’ (or ‘-t’) to provide an output file,
and you can suppress the statistics if you want with ‘--no-statistics’.

5.2 Aggregation

One feature of combine is aggregation. In other words, based on
a set of keys in a data file, create a new file containing a summary record
for each unique combination of values that appears in those keys.

This is a special case of reference-file based output in that the reference
file and the data file are the same file. We are using it both to build
the list of keys and to provide the values to summarize.

Here is a simple example. Suppose my 4-year-old were flipping
Sacajawea(3)
dollar coins and recording the results, along with the time and the date.
A portion of the resulting file might look like this:

Then if I wanted to measure her daily performance, I could count the number
of coin flips per day, with the following command.

The option ‘--data-is-reference’ tells the program to read the file
only once, bulding the list of keys as it goes. This is useful for
aggregations, and it requires that the key fields be the same and that
we only keep one copy of each key (with the ‘--unique’ option).

The records come out in an unspecified order. If you require a specific
order, you can pipe the result through the sort command.

An equivalent SQL statement, assuming that the table has been created
would be.

SELECT Date, COUNT (Result)
FROM Coinflip
GROUP BY Date;

If she wanted to count the number of heads and tails, and also the
amount she bet on each, she could do the following. (I have shortened
the options you have already seen to their 1-byte equivalents.)

The specification of the field to sum says to keep 2 decimal places.
combine works with integers if you don’t say otherwise. The
option ‘--counter-size’ tells the program to use at least that
many bytes to present counters and sums.

5.3 Finding Changes

If you get the same file again and again, and you want to see what’s
different, the canonical method is to use the diff command.
It will list for you those records that are unique to each file, and
those that appear to have changed. One drawback is that it requires
the records in the two files to be in the same order. Another is that
it does not pay attention to the key fields that may be important to
you.

When I want to find the changes in a file, including additions,
deletions, and any changes for the information about a key I normally
use 4 applications of combine. One command is suffucuent to
find the new keys. One is enough to find those that are no longer there.
A third finds everything that is different, and the fourth finds the
before and after image of keys whose information has changed.

As an example, suppose my bridge club’s roster has a name and a phone
number, separated by commas, for each member. The following four
commands will identify the new members, the departed members, and
those whose phone number has changed.

5.4 Hierarchies

A hierarchy tends to be an organization where there is a directional
one-to-many relationship between parent records and their associated
children. combine works with hierarchies within reference files
when the file has a record for each node and each record points to its
parent in the hierarchy.

Because the hierarchy is assumed to be stored in a reference file, it
is accessed by matching to a data record. Once an individual reference
record has been matched to the data record, its relationship to other
records within the hierarchy is followed through the hierarchy until
there is no further to go.

The standard process is to assume that the key that matched to the
data file key is at the top of the hierarchy. When traversing the
hierarchy, combine looks for the key on the current record in
the hierarchy key of other reference records. This repeats until
there are no further linkages from one record to the next. For each
record that is linked to the hierarchy, that record is treated as a
reference record that matched the data record.

In this section, we’ll use the following hierarchy file. It is a simple
hierarchy tree with ‘Grandfather’ as the top node and 2 levels of
entries below.

If my data file consisted only of a record with the key ‘Grandfather’,
then the following command would result in the records listed after it.
Each record written includes the entry itself and its parent.

We can arrive at the same number of records, each containing the entire hierarchy
traversed to get to the leaf nodes, by using the option ‘--flatten-hierarchy’
(‘-F’). This option takes a number as an argument, and then includes
information from that many records found in traversing the hierarchy, starting from
the record that matched the data record. This example tells combine to report
three levels from the matching ‘Grandfather’ record.

As with other areas within combine, the hierarchy manipulation is extensible
through Guile. The key fields can be modified as with any other fields. See section Field-specific extensions, for details. The matches within the hierarchy
can be further filtered, using the ‘h’ suboption of the option ‘-x’.
(see section Extending combine.) As with matches between reference records and data
this filtering can allow you to perform fuzzy comparisons, to do more complex
calculations to filter the match, or to decide when you have gone far enough and would
like to stop traversing the hierarchy.

6. Future Work

Here is a short list of things that we think will be useful in the near
future:

Indexed files. Add the ability to work with a reference file through an
index rather than loading the whole thing into a hash table. Open
questions are the format of the index and whether to put the entire
index into memory, or work with it from disk.

The purpose of this License is to make a manual, textbook, or other
functional and useful document free in the sense of freedom: to
assure everyone the effective freedom to copy and redistribute it,
with or without modifying it, either commercially or noncommercially.
Secondarily, this License preserves for the author and publisher a way
to get credit for their work, while not being considered responsible
for modifications made by others.

This License is a kind of “copyleft”, which means that derivative
works of the document must themselves be free in the same sense. It
complements the GNU General Public License, which is a copyleft
license designed for free software.

We have designed this License in order to use it for manuals for free
software, because free software needs free documentation: a free
program should come with manuals providing the same freedoms that the
software does. But this License is not limited to software manuals;
it can be used for any textual work, regardless of subject matter or
whether it is published as a printed book. We recommend this License
principally for works whose purpose is instruction or reference.

APPLICABILITY AND DEFINITIONS

This License applies to any manual or other work, in any medium, that
contains a notice placed by the copyright holder saying it can be
distributed under the terms of this License. Such a notice grants a
world-wide, royalty-free license, unlimited in duration, to use that
work under the conditions stated herein. The “Document”, below,
refers to any such manual or work. Any member of the public is a
licensee, and is addressed as “you”. You accept the license if you
copy, modify or distribute the work in a way requiring permission
under copyright law.

A “Modified Version” of the Document means any work containing the
Document or a portion of it, either copied verbatim, or with
modifications and/or translated into another language.

A “Secondary Section” is a named appendix or a front-matter section
of the Document that deals exclusively with the relationship of the
publishers or authors of the Document to the Document’s overall
subject (or to related matters) and contains nothing that could fall
directly within that overall subject. (Thus, if the Document is in
part a textbook of mathematics, a Secondary Section may not explain
any mathematics.) The relationship could be a matter of historical
connection with the subject or with related matters, or of legal,
commercial, philosophical, ethical or political position regarding
them.

The “Invariant Sections” are certain Secondary Sections whose titles
are designated, as being those of Invariant Sections, in the notice
that says that the Document is released under this License. If a
section does not fit the above definition of Secondary then it is not
allowed to be designated as Invariant. The Document may contain zero
Invariant Sections. If the Document does not identify any Invariant
Sections then there are none.

The “Cover Texts” are certain short passages of text that are listed,
as Front-Cover Texts or Back-Cover Texts, in the notice that says that
the Document is released under this License. A Front-Cover Text may
be at most 5 words, and a Back-Cover Text may be at most 25 words.

A “Transparent” copy of the Document means a machine-readable copy,
represented in a format whose specification is available to the
general public, that is suitable for revising the document
straightforwardly with generic text editors or (for images composed of
pixels) generic paint programs or (for drawings) some widely available
drawing editor, and that is suitable for input to text formatters or
for automatic translation to a variety of formats suitable for input
to text formatters. A copy made in an otherwise Transparent file
format whose markup, or absence of markup, has been arranged to thwart
or discourage subsequent modification by readers is not Transparent.
An image format is not Transparent if used for any substantial amount
of text. A copy that is not “Transparent” is called “Opaque”.

Examples of suitable formats for Transparent copies include plain
ASCII without markup, Texinfo input format, LaTeX input
format, SGML or XML using a publicly available
DTD, and standard-conforming simple HTML,
PostScript or PDF designed for human modification. Examples
of transparent image formats include PNG, XCF and
JPG. Opaque formats include proprietary formats that can be
read and edited only by proprietary word processors, SGML or
XML for which the DTD and/or processing tools are
not generally available, and the machine-generated HTML,
PostScript or PDF produced by some word processors for
output purposes only.

The “Title Page” means, for a printed book, the title page itself,
plus such following pages as are needed to hold, legibly, the material
this License requires to appear in the title page. For works in
formats which do not have any title page as such, “Title Page” means
the text near the most prominent appearance of the work’s title,
preceding the beginning of the body of the text.

The “publisher” means any person or entity that distributes copies
of the Document to the public.

A section “Entitled XYZ” means a named subunit of the Document whose
title either is precisely XYZ or contains XYZ in parentheses following
text that translates XYZ in another language. (Here XYZ stands for a
specific section name mentioned below, such as “Acknowledgements”,
“Dedications”, “Endorsements”, or “History”.) To “Preserve the Title”
of such a section when you modify the Document means that it remains a
section “Entitled XYZ” according to this definition.

The Document may include Warranty Disclaimers next to the notice which
states that this License applies to the Document. These Warranty
Disclaimers are considered to be included by reference in this
License, but only as regards disclaiming warranties: any other
implication that these Warranty Disclaimers may have is void and has
no effect on the meaning of this License.

VERBATIM COPYING

You may copy and distribute the Document in any medium, either
commercially or noncommercially, provided that this License, the
copyright notices, and the license notice saying this License applies
to the Document are reproduced in all copies, and that you add no other
conditions whatsoever to those of this License. You may not use
technical measures to obstruct or control the reading or further
copying of the copies you make or distribute. However, you may accept
compensation in exchange for copies. If you distribute a large enough
number of copies you must also follow the conditions in section 3.

You may also lend copies, under the same conditions stated above, and
you may publicly display copies.

COPYING IN QUANTITY

If you publish printed copies (or copies in media that commonly have
printed covers) of the Document, numbering more than 100, and the
Document’s license notice requires Cover Texts, you must enclose the
copies in covers that carry, clearly and legibly, all these Cover
Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
the back cover. Both covers must also clearly and legibly identify
you as the publisher of these copies. The front cover must present
the full title with all words of the title equally prominent and
visible. You may add other material on the covers in addition.
Copying with changes limited to the covers, as long as they preserve
the title of the Document and satisfy these conditions, can be treated
as verbatim copying in other respects.

If the required texts for either cover are too voluminous to fit
legibly, you should put the first ones listed (as many as fit
reasonably) on the actual cover, and continue the rest onto adjacent
pages.

If you publish or distribute Opaque copies of the Document numbering
more than 100, you must either include a machine-readable Transparent
copy along with each Opaque copy, or state in or with each Opaque copy
a computer-network location from which the general network-using
public has access to download using public-standard network protocols
a complete Transparent copy of the Document, free of added material.
If you use the latter option, you must take reasonably prudent steps,
when you begin distribution of Opaque copies in quantity, to ensure
that this Transparent copy will remain thus accessible at the stated
location until at least one year after the last time you distribute an
Opaque copy (directly or through your agents or retailers) of that
edition to the public.

It is requested, but not required, that you contact the authors of the
Document well before redistributing any large number of copies, to give
them a chance to provide you with an updated version of the Document.

MODIFICATIONS

You may copy and distribute a Modified Version of the Document under
the conditions of sections 2 and 3 above, provided that you release
the Modified Version under precisely this License, with the Modified
Version filling the role of the Document, thus licensing distribution
and modification of the Modified Version to whoever possesses a copy
of it. In addition, you must do these things in the Modified Version:

Use in the Title Page (and on the covers, if any) a title distinct
from that of the Document, and from those of previous versions
(which should, if there were any, be listed in the History section
of the Document). You may use the same title as a previous version
if the original publisher of that version gives permission.

List on the Title Page, as authors, one or more persons or entities
responsible for authorship of the modifications in the Modified
Version, together with at least five of the principal authors of the
Document (all of its principal authors, if it has fewer than five),
unless they release you from this requirement.

State on the Title page the name of the publisher of the
Modified Version, as the publisher.

Preserve all the copyright notices of the Document.

Add an appropriate copyright notice for your modifications
adjacent to the other copyright notices.

Include, immediately after the copyright notices, a license notice
giving the public permission to use the Modified Version under the
terms of this License, in the form shown in the Addendum below.

Preserve in that license notice the full lists of Invariant Sections
and required Cover Texts given in the Document’s license notice.

Include an unaltered copy of this License.

Preserve the section Entitled “History”, Preserve its Title, and add
to it an item stating at least the title, year, new authors, and
publisher of the Modified Version as given on the Title Page. If
there is no section Entitled “History” in the Document, create one
stating the title, year, authors, and publisher of the Document as
given on its Title Page, then add an item describing the Modified
Version as stated in the previous sentence.

Preserve the network location, if any, given in the Document for
public access to a Transparent copy of the Document, and likewise
the network locations given in the Document for previous versions
it was based on. These may be placed in the “History” section.
You may omit a network location for a work that was published at
least four years before the Document itself, or if the original
publisher of the version it refers to gives permission.

For any section Entitled “Acknowledgements” or “Dedications”, Preserve
the Title of the section, and preserve in the section all the
substance and tone of each of the contributor acknowledgements and/or
dedications given therein.

Preserve all the Invariant Sections of the Document,
unaltered in their text and in their titles. Section numbers
or the equivalent are not considered part of the section titles.

Delete any section Entitled “Endorsements”. Such a section
may not be included in the Modified Version.

Do not retitle any existing section to be Entitled “Endorsements” or
to conflict in title with any Invariant Section.

Preserve any Warranty Disclaimers.

If the Modified Version includes new front-matter sections or
appendices that qualify as Secondary Sections and contain no material
copied from the Document, you may at your option designate some or all
of these sections as invariant. To do this, add their titles to the
list of Invariant Sections in the Modified Version’s license notice.
These titles must be distinct from any other section titles.

You may add a section Entitled “Endorsements”, provided it contains
nothing but endorsements of your Modified Version by various
parties—for example, statements of peer review or that the text has
been approved by an organization as the authoritative definition of a
standard.

You may add a passage of up to five words as a Front-Cover Text, and a
passage of up to 25 words as a Back-Cover Text, to the end of the list
of Cover Texts in the Modified Version. Only one passage of
Front-Cover Text and one of Back-Cover Text may be added by (or
through arrangements made by) any one entity. If the Document already
includes a cover text for the same cover, previously added by you or
by arrangement made by the same entity you are acting on behalf of,
you may not add another; but you may replace the old one, on explicit
permission from the previous publisher that added the old one.

The author(s) and publisher(s) of the Document do not by this License
give permission to use their names for publicity for or to assert or
imply endorsement of any Modified Version.

COMBINING DOCUMENTS

You may combine the Document with other documents released under this
License, under the terms defined in section 4 above for modified
versions, provided that you include in the combination all of the
Invariant Sections of all of the original documents, unmodified, and
list them all as Invariant Sections of your combined work in its
license notice, and that you preserve all their Warranty Disclaimers.

The combined work need only contain one copy of this License, and
multiple identical Invariant Sections may be replaced with a single
copy. If there are multiple Invariant Sections with the same name but
different contents, make the title of each such section unique by
adding at the end of it, in parentheses, the name of the original
author or publisher of that section if known, or else a unique number.
Make the same adjustment to the section titles in the list of
Invariant Sections in the license notice of the combined work.

In the combination, you must combine any sections Entitled “History”
in the various original documents, forming one section Entitled
“History”; likewise combine any sections Entitled “Acknowledgements”,
and any sections Entitled “Dedications”. You must delete all
sections Entitled “Endorsements.”

COLLECTIONS OF DOCUMENTS

You may make a collection consisting of the Document and other documents
released under this License, and replace the individual copies of this
License in the various documents with a single copy that is included in
the collection, provided that you follow the rules of this License for
verbatim copying of each of the documents in all other respects.

You may extract a single document from such a collection, and distribute
it individually under this License, provided you insert a copy of this
License into the extracted document, and follow this License in all
other respects regarding verbatim copying of that document.

AGGREGATION WITH INDEPENDENT WORKS

A compilation of the Document or its derivatives with other separate
and independent documents or works, in or on a volume of a storage or
distribution medium, is called an “aggregate” if the copyright
resulting from the compilation is not used to limit the legal rights
of the compilation’s users beyond what the individual works permit.
When the Document is included in an aggregate, this License does not
apply to the other works in the aggregate which are not themselves
derivative works of the Document.

If the Cover Text requirement of section 3 is applicable to these
copies of the Document, then if the Document is less than one half of
the entire aggregate, the Document’s Cover Texts may be placed on
covers that bracket the Document within the aggregate, or the
electronic equivalent of covers if the Document is in electronic form.
Otherwise they must appear on printed covers that bracket the whole
aggregate.

TRANSLATION

Translation is considered a kind of modification, so you may
distribute translations of the Document under the terms of section 4.
Replacing Invariant Sections with translations requires special
permission from their copyright holders, but you may include
translations of some or all Invariant Sections in addition to the
original versions of these Invariant Sections. You may include a
translation of this License, and all the license notices in the
Document, and any Warranty Disclaimers, provided that you also include
the original English version of this License and the original versions
of those notices and disclaimers. In case of a disagreement between
the translation and the original version of this License or a notice
or disclaimer, the original version will prevail.

If a section in the Document is Entitled “Acknowledgements”,
“Dedications”, or “History”, the requirement (section 4) to Preserve
its Title (section 1) will typically require changing the actual
title.

TERMINATION

You may not copy, modify, sublicense, or distribute the Document
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense, or distribute it is void, and
will automatically terminate your rights under this License.

However, if you cease all violation of this License, then your license
from a particular copyright holder is reinstated (a) provisionally,
unless and until the copyright holder explicitly and finally
terminates your license, and (b) permanently, if the copyright holder
fails to notify you of the violation by some reasonable means prior to
60 days after the cessation.

Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.

Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, receipt of a copy of some or all of the same material does
not give you any rights to use it.

FUTURE REVISIONS OF THIS LICENSE

The Free Software Foundation may publish new, revised versions
of the GNU Free Documentation License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns. See
http://www.gnu.org/copyleft/.

Each version of the License is given a distinguishing version number.
If the Document specifies that a particular numbered version of this
License “or any later version” applies to it, you have the option of
following the terms and conditions either of that specified version or
of any later version that has been published (not as a draft) by the
Free Software Foundation. If the Document does not specify a version
number of this License, you may choose any version ever published (not
as a draft) by the Free Software Foundation. If the Document
specifies that a proxy can decide which future versions of this
License can be used, that proxy’s public statement of acceptance of a
version permanently authorizes you to choose that version for the
Document.

RELICENSING

“Massive Multiauthor Collaboration Site” (or “MMC Site”) means any
World Wide Web server that publishes copyrightable works and also
provides prominent facilities for anybody to edit those works. A
public wiki that anybody can edit is an example of such a server. A
“Massive Multiauthor Collaboration” (or “MMC”) contained in the
site means any set of copyrightable works thus published on the MMC
site.

“CC-BY-SA” means the Creative Commons Attribution-Share Alike 3.0
license published by Creative Commons Corporation, a not-for-profit
corporation with a principal place of business in San Francisco,
California, as well as future copyleft versions of that license
published by that same organization.

“Incorporate” means to publish or republish a Document, in whole or
in part, as part of another Document.

An MMC is “eligible for relicensing” if it is licensed under this
License, and if all works that were first published under this License
somewhere other than this MMC, and subsequently incorporated in whole
or in part into the MMC, (1) had no cover texts or invariant sections,
and (2) were thus incorporated prior to November 1, 2008.

The operator of an MMC Site may republish an MMC contained in the site
under CC-BY-SA on the same site at any time before August 1, 2009,
provided the MMC is eligible for relicensing.

ADDENDUM: How to use this License for your documents

To use this License in a document you have written, include a copy of
the License in the document and put the following copyright and
license notices just after the title page:

Copyright (C) yearyour name.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts. A copy of the license is included in the section entitled ``GNU
Free Documentation License''.

If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
replace the “with…Texts.” line with this:

with the Invariant Sections being list their titles, with
the Front-Cover Texts being list, and with the Back-Cover Texts
being list.

If you have Invariant Sections without Cover Texts, or some other
combination of the three, merge those two alternatives to suit the
situation.

If your document contains nontrivial examples of program code, we
recommend releasing these examples in parallel under your choice of
free software license, such as the GNU General Public License,
to permit their use in free software.

Footnotes

That’s not strictly true. I discovered the
other day that I can create variables in some extension commands and
then use them in a subsequent command. That requires a little
documentation of the order of processing so you get the values you want,
but I haven’t done that yet. For now, you can know that a set of field
specifiers of the same type (like output fields for reference file 1 or
sum fields from the data file) are processed in order.