3. How combine Processes Files

The base of combine reads records from a data file (or a series of
them in a row) and if there is an output request for data records, it
writes the requested fields out to a file or to stdout. Here is
an example of this most simple version of events.

combine --write-output --output-fields=1-

This is essentially an expensive pipe. It reads from stdin and
writes the entire record back to stdout.

Introducing a reference file gives more options. Now combine reads
the reference file into memory before reading the data file. For every
data record, combine then checks to see if it has a match. The
following example limits the simple pipe above by restricting the output
to those records from stdin that share the first 10 bytes in
common with a record in the reference file.

Note that the option ‘--unique’ is used here to prevent more than
one copy of a key from being stored by combine. Without it,
duplicate keys in the reference file, when matched, would result in
more than one copy of the matching data record.

The other option with a reference file is to have output based on the
records in that file, with indicators of how the data file records were
able to match to them. In the next example, the same match as above is
done, but this time we write out a record for every unique key, with a
flag set to ‘1’ if it was matched by a data record or ‘0’
otherwise. It still reads the data records from stdin and writes
the output records to stdout.

combine -r -f reference_file.txt -k 1-10 -m 1-10 -u -w -o 1-10

Of course, you might want both sets of output at the same time: the list
of data records that matched the keys in the reference file and a list
of keys in the reference file with an indication of which ones were
matched. In the prior two examples the two different kinds of output
were written to stdout. You can still do that if you like, and
then do a little post-processing to determine where the data-based
records leave off and the reference-based records begins. A simpler
way, however, is to let combine write the information to separate
files.

In the following example we combine the output specifications from the
prior two examples and give them each a filename. Note that the first
one has a spelled-out ‘--output-file’ while the second one uses the
shorter 1-letter option ‘-t’.