Notes about UNIX/Linux coding pragmatics

Pragmatics is with semantics and syntax one of the central
aspects of a program. While syntax is about language and semantics
about effects, pragmatics is about quality, that is usefulness.

Under UNIX/Linux there are some established programming
conventions that amount to good pragmatics, and are inspired by
some important aspects of the UNIX/Linux architecture:

Programs can be connected using pipes.

The major coarse level abstraction mechanism of UNIX/Linux is
the pipe, by which the output of a program immediately is input
to another program.

Programs can be invoked within scripts.

Not only can commands be invoked on the command line, but they
can also be invoked by scripts, as scripts are a good way to
combine programs to provide new functionality.

Programs have a very long useful life, and get
modified a lot because of source availability.

Many UNIX/Linux programs have been around for thirty years,
and have been ported to many platforms and have spawned many
derivatives.

Files/memory are plain flat byte streams/arrays.

Both the ephemeral (memory) and the persistent (files) storage
abstractions are untyped, boundaryless byte streams, including
most devices, and the API for accessing all types of files is
(mostly) the same.

Programs can have multiple output channels, and
these can be independently (re)directed to different output
media.

This is supported by the OS with file descriptors, the
stdio library with FILE pointers
and by the shell with file descriptor redirection.

Programs can be decomposed into libraries, and use
libraries.

UNIX has two program decomposition techniques: in the big,
pipes and scripts, in the small libraries. Both are very
frequently used, as C in particular is a language suitable (with
some important limitation) for writing standalone runtime
libraries.

There are powerful search/replace and sorting tools
and libraries.

This means that reprocessing large amounts of data output by a
program is easy, and it is useful to do so in a surprisingly
large number of cases.

Many popular tools generate source code as output or
process source code as input.

This means that source code is not always, or even often,
authored by humans; for example the C compiler almost never
processes source code authored by humans, but almost always the
output of the C preprocessor. Also, humans must always
eventually use a program, such as an editor (or even
cat) to actualy record a program text, and it is a
good idea to make it easy for that program to process, and in
the case of an editor to reprocess, that text.

These aspects are radically different from those that pertain to
many other popular operating systems.

There is also a general principle of programming, that program
texts should be speak for themselves, as their purpose is to
communicate precisely and clearly a program (both to humans and
other programs or hardware that reads them).

The ostensible purpose of a program is to achieve an intended
effect, but programmers (as well as compilers, CPUs, etc.) cannot
write or read programs, only program texts.

The quality of a program is then a consequence of the quality of
the program text, as the program text is much more important to
the lifetime of a software project.

Many of the conventions listed below apply to any platform, and
they will be marked as such.

To make it possible to redirect or pipe just the actual
output of the program, for further processing.

Every program message should contain the name of
the program as the first thing.
(any platform)

If several programs are used in a script, message texts
should make it easy to figure which one emitted it.

Error messages should contain a direct report of
the operation that failed and its operands, not a periphrase.
(any platform)

A periphrase does not identifiy what is needed to fix the
problem. For example, a message like Configuration
unavailable does not help anywhere as much as
Cannot use configuration '/etc/prog.conf',
which is in turn rather inferior to Cannot open for reading
file '/etc/prog.conf', if that is what
actually was attempted.

Programs that have verbose output by default don't fit well
within scripts or pipes, especially as conditions in
if or while.

Both program source and program output should
fit in less than 80 columns, ideally less than 72 columns

This principle comes from the days when punched cards had 80
columns, of which often the last 8 contained a card number.
However it embodies a profound wisdom: that the human eye
tracks pages vertically more easily than horizontally, beyond
a certain narrow horizontal limit, which seems to be around 65
characters.
Around 70 characters is also, not by mere coincidence,
what fits on a line on a letter/A4 page when printed in a
decent point size.
Also, it is much easier to indicate grouping and
structure vertically, for example by using blocks and empty
lines between blocks.
Finally, a definite, traditional and low limit on line
length means that there is a chance that nobody will have to
scroll horizontally to read the source or output text of a
program, and scrolling horizontally usually works a lot less
smoothly and cleanly than vertical scrolling.

Embedded in each program's source and binaries
there should be a string that identifies the program and its
version, and there should be an option that prints it.
(any platform)

When asking for support or reporting bugs there must be a
way to identify exactly the version being used. Software often
lives for a long time, and many variants of any successful
software then coexist.

It is important that the exit code of a program be
properly set to zero for success or to non zero for failure.

To make it possible to do error checking and recovery in
scripts.

Programs should be written keeping in mind that
they can be killed at any time.
(any platform)

Because at the very least killing a script, or a process group
may have as side effect killing that program. Fortunately many
if not most programs are idempotent.

All parameters and other local names should be
declared as much const as possible.
(any platform)

Accidental errors are prevented.

The compiler can often perform much better optimization.

On reading the source, one can be sure that certain
entities will not change, without having to check for the
rest of the scope in which they are defined, making it
much quicker to understand a bit of code.

Function parameters should be in left to right
order most specific to most generic.
(any platform)

Allows a natural sorting order of functions that is both
nicer for reading and easier for error checking.

Makes it much clearer in which order partial application
should happen.

Helps with understanding the goals of the function.

File sections should usually be most specific to
most generic top to bottom.
(any platform)

This based on the idea that one should see first when
looking at a file the most specific information. However in
some cases (usually configuration files) if there are multiple
otherwise somewhat equivalent data item, the first one is
taken, in others the last one; then the specific order of file
sections should respect the particular override logic of the
applications reading the file.

Where applicable, when a program attempts to open a
file whose name is not absolute, it should attempt to
open by using as prefix the elements of a list of
directories, as executables are searched for in the
list contained in the PATH environment
variable. Especially if the file is a configuration
file. As a rule the program should have a default
directory list and this should be overridable with an
environment variable.

When a directory path is used for searching for
files with a relative name, the default should include
the current and home directory. The default directory
path should have directories in most specific to most
generic order, and prefixed with the current directory
(as such), the home directory, /usr/local,
the root directory and /usr in this
order. For example configuration files should be
searched for in a directory path like:

.:$HOME/etc:/usr/local/etc:/etc:/usr/etc

just like executables should be searched for in a
directory path like:

.:$HOME/bin:/usr/local/bin:/bin:/usr/bin

Program input should not have arbitrary and small
size limits.
(any platform)

With pipes, a program, rather than a human user, can be the
author of the input, and programs may have many less
limitations than humans as to the size of the things they can
output.

Output and input files should be in text format in
almost all cases.

For easy piping into text-based processing tools like
sort or perl, and for easy reading
and writing by humans.

Input expected by programs should be terse.

To make it easier for programs and humans to generate it.

White space should be allowed in input.
(any platform)

Where possible, arbitrary white space should be allowed in
textual program input (as a separator usually); usually
“newline” should be considered as white space too.

The default output or input column separator
should be a sequence of spaces and tabs, or else a colon or
other punctuation character if the data can contain spaces or
tabs

This matches tradition, and makes splitting each line into
fields easy, for the benefit of columnar oriented scripting
languages like AWK or Perl, or utilities like sort.

GUI based programs should have an equivalently
featured command line mode.
(any platform)

For use in scripts and pipes.

All variable parts of a program message should be
enclosed in some kind of delimiter so that it be obvious when
they are the empty string.
(any platform)

To avoid idiocies like Cannot open file %s.,
which becomes Cannot open file . if the argument
is the empty string.

Identifiers should be built with most generic to most
specific subparts in left to right order.
(any platform)

This gives the natural sorting order for sorting in
languages with left to right writing order. Too bad that email
addresses, domains, numbers and many date formats don't
respect this principle.

In each source file, whether it is an header or
code does not matter, there should be a list of all and only the
header files that contain definitions of entities used by the
program.
(any platform)

This is the only way to ensure correct dependencies among
headers and among sources and headers.

Header file includes should be listed
in most generic to most specific top to bottom, first thing in a
file.
(any platform)

This to prevent more specific definitions overriding more
generic ones. Such an override is particularly awful if a
definition in a system header is overriden.

Both to prevent interfile namespace pollution and mysterious
problems because of accidental name coincidence. It is
particularly awful if a definition in one file has the same
name as a definition in a system library used by many other
files.

Header files should always be protected in their entirety
by a multiple inclusion guard.
(any platform)

This is the only clean way to prevent multiple redefinitions
with multiple inclusions, and as a rule also speeds up
compilation. The only possible exception is a header file that
intentionally define an entity differently for each inclusion,
but these should almost never be written, or even imagined.

As a rule program options should be processed with
the getopt() library function, and preferably
with the getopt_long() GNU variant.

There should be an option that prints a brief help
message with the command invocation syntax.
(any platform)

This means that the program is to a limited but useful
extent self documenting at runtime, and that documentation is
easy to keep up to date as it is small and within the program
text itself.

Programs and libraries should have suitable
man pages.

Documentation usually is either reference or task oriented
and man pages are the summary of reference
documentation and they are very useful as to avoid putting too
much help material inside the program itself; programs should
not be documentation processors, man does that.

Reference documentation should be terse. Task
oriented documentation, like HOWTOs and user guides, doesn't
need to be terse.
(any platform)

Reference documentation's most important property is that it
should be accurate, and the second most important is ease of
finding the relevant bit. Verbosity interferes with both.
Reference documentation is not meant to explain what/how to do
things, but it may contain examples to clarify meanings.

Declarations and definitions should be shared in
header files in a single copy, not repeated in several places.
(any platform)

This sounds obvious, but then some people don't.

Programs should be written as collections of
libraries glued together by fairly small data navigation code.
(any platform)

So that libraries be reusable and behaviour and even
representation be sharable, which helps in scripts and
similar. For example if many programs use the same hash
database library means that one can embed that library in a
scripting language like Perl and then scripts in that language
can be used to access and manipulate the data maintained by
many other applications.
The principle to keep most data files in text form is in
effect a special case of this principle.

There should not be magic numbers in the code, but
almost all constants, even if used only once, should be given
descriptive names.
(any platform)

This is because usually such numbers embody assumptions, the
assumptions are pragmatic, and such pragmatics ought to be
made explicit; many of these pragmatics for example involve
units of measure.
A number does not speak for itself; a named constant does,
and also allows easy consistency checking between its value and
its name. For example fragments like
if (length > MIN_WEIGHT)
or #define MAX_VOLUME -10
tend to suggest something is amiss.

There should be common use of
assert() to document expected invariants.
(any platform)

assert() can be a debugging aid, but it is
mostly a code reading aid, as it documents what the author
expects at that point.

Defensive coding in libraries is not appropriate.
(any platform)

It is appropriate as to input. Internal and library
functions should instead use assert() to document
assumptions about their parameters.

These debugging traces are usually far more useful than
a debugger session because they make the program text speak
for itself, not only dynamically, but statically too, as they
document assumptions of the author as to what is relevant
and/or hairy.

Comments should be used to elucidate non trivial
assumptions and design decisions that are not obvious from the
code.
(any platform)

Other than that the program text should speak for itself.
But the program text usually cannot express well its intent
or the possible alternatives that have not been written and
whose consideration might elucidate it.

Some careful consideration should be given to naming.
(any platform)

The most important aim of program text writing is not that
the program it describes works, but that it communicates
clearly what it does. Working correctly is a consequence of
that. Naming has a large impact on the reading of the program
text by a human.

The traditional UNIX naming convention for functions is
object then verb, not verb then object.

As in fopen for file open. This
respects the principle that names should be in most specific
to most generic part order.

For easy postprocessing by simple columnar tools like
sort, which in particular is extremely important.
Stupid things like /proc/meminfo are hard to
easily split and process.

Variables should be defined in the narrowest scope
for which they are used; in particular, global variables should
be avoided.
(any platform)

This aids program comprehension and debugging considerably;
if a variably is only used in a small range of code, that range
should become a block and the variable defined in it, so that
it be clear that it cannot be used or modified anywhere else,
which makes understanding its role significantly quicker and
easier, as one needs only to comprehend a small scope.

Identifiers should be longer the wider their
scope.
(any platform)

In part to reduce the change of name ambiguities, in part
to communicate implicitly by that length the width of the
scope in which an identifier is defined, in part because
identifiers with a wide scope usually are mentioned less
frequently than identifiers with a narrow scope.

Common subexpressions or paragraphs of code should
not be repeated but given names.
(any platform)

If the same subexpression or paragraph of code occurs
identically or similarly in a section of program text usually it
expresses a particular concept relevant to that section; writing
it down once and naming it explicitly with a suggestive name
helps the program text speak for itself. It also helps ensuring
that the various uses of the same concept are indeed the
same.

Comments should be in traditional parenthetical
form, with no boxing, and preferably without the left asterisk
margin either.
(any platform)

For ease of justification and other processing by source
code tools, including editors.

Code should be disabled with #if 0
or if (0), not with commenting.
(any platform)

Code is not text, and any tools that process source files,
for example beautifiers will handle code differently from
comments.

Code should be written and indented in a regular
way with systematic layout and naming conventions.
(any platform)

This help the program text speak for itself, and the
regularity as a rule helps make structure evident, as a
changing of shape of the text then only reflects a change of
shape of the structure of the program, not shifts in the
layout or naming conventions. It also helps catch mistakes,
which often assume the textual form of irregularities.

Program output should be in the same syntax as
program input

In order to make it easy to pipe back to a program its
own output, it should be in the same syntax as its input or
command line arguments. For example, the contents of the lines
of /etc/fstab corresponds to the syntax for the
arguments to mount.

Processes and system modules should publish
extensive state as real or virtual files

When a process or a system module keep state, this should
available as a summary in a file, either a plain or device
file, and such file can be a real file or a virtual one, like
those under /proc in some versions of the
system.