12.49 text.csv - CSV tables

Module: text.csv

Provides a function to parse/generate CSV (comma separated value) tables,
including the format defined in RFC4180. You can customize the
separator and quoter character to deal with variations of CSV formats.

CSV table is consisted by a series of records, separated by
a newline. Each record contains number of fields, separated
by a separator character (by default, a comma). A field can contain
comma or newline if quoted, i.e. surrounded by double-quote characters.
To include double-quote character in a quoted field, use two
consecutive double-quote character. Usually, the whitespaces around
the field are ignored.

Since use cases of CSV-like files vary, we provide layered API
to be combined flexibly.

Low-level API

The bottom layer of API is to convert text into list of lists and vice versa.

Function: make-csv-readerseparator :optional (quote-char #\")

Returns a procedure with one optional argument, an input port.
When the procedure is called, it reads one record from the port
(or, if omitted, from the current input port)
and returns a list of fields.
If input reaches EOF, it returns EOF.

Returns a procedure with two arguments, output port and
a list of fields. When the procedure is called, it
outputs a separator-separated fields with proper escapes,
to the output port. Each field value must be a string.
The separator argument can be a character or a string.

You can also specify the record delimiter
string by newline; for example, you can pass "\r\n"
to prepare a file to be read by Windows programs.

The output of field is quoted when it contains special characters—
which automatically includes characters
in separator, quote-char and newline argument,
plus the characters in the char-set given to special-char-set;
its default is #[;\s].

Middle-level API

Occasionally, CSV files generated from spreadsheet contains
superfluous rows/columns and we need to make sense of them.
Here are some utilities to help them.

A typical format of such spreadsheet-generated CSV file has
the following properties:

There’s a “header row” near the top; not necessarily the very
first row, but certainly it comes before any real data. It signifies
the meaning of each column of the data.
There may be superfluous columns inserted just for cosmetics,
and sometimes the order of columns are changed when the original spreadsheet
is edited. So we need some
flexibility to interpret the input data.

“Record rows” follow the header row. It contains actual data.
There may be superfluous rows inserted just for cosmetics.
Also, it’s often the case that the end of data isn’t marked clearly
(you find large number of rows of empty strings, for example).

The main purpose of middle-level CSV parser is to take the output
of low-level parser, which is a list of lists of strings, and
find the header row, and then convert the subsequent record rows
into tuples according to the header row. A tuple is just a list of
strings, but ordered in the same way as the specified header spec.

Convert input rows (a list of lists of strings)
to a list of tuples.
A tuple is a list of slot values.

First, it looks for a header row that
matches the given header-spec. Once the header row is found,
parse the subsequent rows as record row according to the header
and convert them to tuples.
If no header is found, #f is returned.

Header-specs is a list of header spec, each of which can be either
a string, a regexp, or a predicate on a string. If it’s a string,
a column that exactly matches the string is picked. If it’s a regexp,
a column that matches the regexp is picked. And if it’s a predicate,
as you might have already guessed, a column that satisfies the predicate
is picked.

The order fo header-specs determines the order of columns of
output tuples.

Required-slots determines if the input row is a valid record row
or not. The structure of required-slots is as follows:

The <header-spec> compared to the elements of header-slot
(by equal?) to figure out which columns to check. A single
<header-spec> in <spec> means that the column shouldn’t be
empty for a valid record row. If <spec> is a list of
<header-spec> and <predicate>, then the value of the
column corresponds to the <header-spec> is passed to <predicate>
to determine if it’s a valid record row.

If required-slots is omitted or an empty list, any row with
at least one non-empty column to be included in the tuple.

If allow-gap? is #t, it keeps reading rows until the end,
skipping invalid rows. If allow-gap? is #f (default),
it stops reading once it sees an invalid row after headers.

Let’s see an example. Suppose we have the following CSV file as
data.csv. It has extra rows and columns, as is often seen
in spreadsheet-exported files.

Note that irrelevant rows are skipped, and columns in the results
are ordered as specified in the header-specs.

Since there’s a gap (empty row) after the “Kingdom of Pascal” entry,
csv-rows->tuples stops processing there by default. If you want
to include “APL Republic”, you have to pass :allow-gap? #t
to csv-rows->tuples.

The next example gives :required-slots option to eliminate
rows with missing some of Year, Country or GDP—thus “Kingdom of Pascal”
is omitted from the result, while “APL Republic” is included
because of :allow-gap? argument.
(It also checks Year has exactly 4 digits.)

Create a procedure that takes a row (a list of strings) and checks if
if it matches the criteria specified by header-specs.
(See csv-rows->tuples above about header-specs.)
If the input satisfies the spec, it returns a permuter vector that maps
the tuple positions to the input column numbers.
Otherwise, it returns #f.

The permuter vector is a vector of integers,
where K-th element being I means the K-th item of the tuple should be
taken from I-th column.

Let’s see the example. Suppose we know that the input contains
the following row as the header row:

Applying this header parser to the input data returns the permuter vector:

(header-parser *input-row*)
⇒ #(3 2 6 5)

It means, the first item of tuple (Country) is in the 3rd column of the input,
the second item of tuple (Year) is in the 2nd column of the input,
and so on. This permuter vector can be used to parse record rows to
get tuples.