Making Packager-Friendly Software

A package maintainer, or packager, is a person who creates packages
for software projects. He eventually finds common problems in these
projects, resulting in a complex packaging process and a final package that
is a nightmare to maintain. These little flaws exist because in most cases the original
developers are not packagers, so they are not aware of them. In
other words, if you do not know something is wrong, you cannot fix it.

This article describes some of these common problems and possible solutions. Consequently, it is of most value to software developers who make
their creations publicly available. Keep in mind that any published project
will eventually catch a packager's attention; the easier it is to create the
package, the sooner someone can package it.

This document can also help package maintainers to show them some problems
they may not be aware of. Remember that a task of a good packager is to send
bug reports--with appropriate fixes, if possible--to the mainstream
developers about any problems that are found. That way, future versions of
the program will be easier to maintain. Note that by doing this, they will help not only themselves, but also all other packagers who handle the same piece of
software in other operating systems or platforms.

In case you're wondering whether I know what I'm talking about, let
me present myself. I have worked for The
NetBSD Packages Collection (pkgsrc) since November 2002. During that time,
I have done more than 1,600 package updates and created around 200 packages,
most of which are related to GNOME; I am
the main maintainer of its packages. While doing this, I have repeatedly
encountered and fixed the problems described in this article, so I would like
to solve them at their root (by the original software developers). I hope
this gives you a bit of confidence.

When presenting solutions for the problems described, I have focused on the
most popular build infrastructure in the free software world: GNU Autoconf, GNU Automake, and GNU Libtool. However, the
ideas outlined here apply to any build infrastructure you can think
of.

I would like to thank Ben Collver, Thomas Klausner, and Todd Vierling, all of
them pkgsrc developers, due to their suggestions; and in general all other
developers of this system for continuously improving its quality.

Terminology

It's a good idea to be familiar with the following basic terms, which will be used in this article:

Distribution file (distfile, for short)--A file that contains the pristine sources of a program, as published by the
original authors. They usually come in the form of a tarball, such as
tar.gz or tar.bz2.

Packaging system--The infrastructure used to build and/or install packages in a system in
their preferred form. This includes the utilities used to generate binary
packages (see below) and to handle them on a running system.

Source package--The set of files used to build a binary package from a distribution file.
This concept is very clear in, for example, NetBSD's pkgsrc, FreeBSD's ports, or
Gentoo's Portage, because it refers to a single directory in the centralized
tree holding all packages.

However, this term also applies to other packaging systems that always use
binary packages. For example, when talking about Debian packages, it refers to the
debian subdirectory included in some distribution files. When
talking about RPMs, this alludes to the Source RPM files (SRPMs).

Binary package--A file that provides a program in a ready-to-install manner, usually
including prebuilt binaries and possibly providing some scripts to finish its
configuration. This is the most common form of packages in Linux distributions,
as .deb and .rpm files are exactly this.

Package (n.)--Used to refer to a binary package and a source package
indistinctly.

Package (v.)--To create a source package from scratch, based on a
published distribution file.

Broken package--A package that, due to an unexpected reason, fails to work properly. This
can be either because its build fails, because it does not install some
expected files, because it cannot be fetched, and so on.

Packager--The person who creates a package.

The Distribution File

The first problems in packaging come from the way that project maintainers
create or handle the distfiles. These issues are uncommon, but once you start
maintaining an affected package, you are likely to suffer its problems forever
(unless you persuade the author to fix them). Here's how you can avert trouble:

Avoid modifying published distfiles. Once you have made a
distfile available, never modify it. Even if it includes a stupid bug, don't
touch it; instead, publish a new version.

Rationale: Many packaging systems store cryptographic digests of the
distfiles they use in the source packages. This helps verify that no third
party has modified the package since its creation. If you change a distfile,
you will break the package because the digest test will fail. The maintainer
has to check why the test fails, to ensure that there are no
malicious changes--not an easy task.

Avoid moving published distfiles. Once you have published a
distfile and distributed its URL, don't remove it from the server or
move it around. If you must do it, it would be nice if you contacted all
known package maintainers to let them know this issue.

Rationale: Many source packages download distfiles from their original
sites; if the file is moved or removed, the fetch process will fail and the
package will be broken. This isn't difficult to fix, but it opens a time window
during which people cannot download the package.

Always use versioned distfiles. The distfile's name must always
include a version string identifying it, whether a version number or a timestamp.
If you want a static name that refers to the latest version, use a symbolic
link on your sever pointing to the full name.

Rationale: This is very similar to the modification of published distfiles
described above. If you replace a distfile with one containing a new version,
you implicitly break the cryptographic digests stored in source
packages.

Do not include prebuilt files in your distfile. Be sure that
your distfile does not contain prebuilt files that are OS- or
architecture-specific. For example, it is erroneous to include a prebuilt
object file, but correct to include a Lex-generated C source file.

Rationale: When building on operating systems and/or architectures different from yours, those files will not be built again because the rebuild rules will
not fire. They will cause strange errors later, as their format will be
incorrect.

Documentation Files

Several build tools force developers to include documentation files in
their distfiles. For example, GNU Automake checks for the existence of
README, NEWS, COPYING, and other files, although it
does not check the contents. Unfortunately, many developers create those files
to shut up errors but forget to fill them in. Although it's hard to believe, I have
found several distfiles without any kind of information, many of which are
GNOME core libraries.

Why are these files important? They provide very valuable information to the
packager. At the very least, he needs:

Description of the program: Two or three paragraphs are enough.
Ideally, this goes at the very beginning of the README file.

Rationale: Source packages usually provide a file with the description of
the package. If the packager has to write it without any reference, he may
write something inaccurate or forget to say something important.

License: Make clear the license terms under which you
have distributed your work. This often manifests itself as a COPYING
file in the top-level directory of the source tree, containing a summary of the
license that affects all the files in it.

Rationale: It's important to know which restrictions apply to your work
when creating a package. A common example is the Sun Java Virtual Machines: we can create a
package for them for personal use, but cannot redistribute it later. Plus the
source package cannot download them automatically, so the packager has to tell
the user how to do it manually.

Changes between versions: You should provide a list of
major changes between all the versions you have published. Ideally,
this goes in the NEWS file as an enumeration. Note that
ChangeLogs are conceptually different, as they detail every change
in every source file. Those are useful too, but not as much as a digest of
changes between versions.

Rationale: When updating a source package to the latest version, the
packager must know which changes happened. Guessing them is very difficult and
inaccurate, which will result in updates lacking information (something other packagers dislike). Also keep in mind that this information is very
valuable when tracking down bugs in a software project.

If you are using GNU Automake, you can tweak it to bomb out when doing a
make dist if the NEWS file is not up to date. Do this by
adding the check-news flag to the call to
AM_INIT_AUTOMAKE. You might change your configure.ac
file to include the following line:

AM_INIT_AUTOMAKE(1.9 check-news)

Note that keeping all this information in a web page is not as useful as
including it in the package. Web pages are by nature volatile, so they may
become unavailable after some time, especially if the project is abandoned or
moved from the original server.

Additionally, please be careful when writing these files. Lots of projects
include incomplete notes and are full of typos and incorrect spacing,
which denotes that the author does not care about them. These files are usually the first thing the occasional user of your program will examine; if
they look sloppy, he will have a bad impression of your project, even if it is
coded perfectly.

Configuration Techniques

Before you can build a program from its sources, you have to tune several
details to adapt it to your system. Other times, you have to change some
default settings so that it fits your expectations. This process is known as source
configuration. Believe me when I say that all software packages
have some configurable aspect at this stage and that somebody, somewhere, will
need to change it; there are very, very, few exceptions. To understand why
this is so important, consider the following scenarios:

The installation prefix must be changeable. You cannot force a
user to install a program in a specific directory. He must be able to choose
where the program will end up, because your preferred directory may not meet
his administration policies. When discussing packaging systems, consider that
the package must follow some layout policies. That is, do not assume that
/usr and /usr/local are the only possible
locations.

There must be no hardcoded paths in the sources. This includes
paths to data files, configuration files, extra libraries, and devices. All of
these are good candidates for configuration.

Suppose your program is a simple Perl
script; you have to offer the user an easy way to tell it where the interpreter
is. Using #!/usr/bin/perl won't work on many systems, as people
can install Perl in many other places (for example, /usr/pkg/bin/perl
on a default NetBSD setup).

Perhaps you think, But... #!/usr/bin/env perl" will do the
trick, won't it? Yes, it will--sometimes. Consider multiple
versions of Perl installed on the system: Perl 5.6's binary is
/usr/bin/perl, and Perl 5.8's binary is /usr/local/bin/perl.
Now assume that you have a program that requires Perl 5.8, but you used the
line mentioned before. What happens? The script will pick up the first Perl
binary it finds in the PATH, which may not match the version your
program expects.

Remember, relying on the PATH is, generally, bad. This
is why in pkgsrc we always replace such lines with a full path to the
real Perl binary. Obviously, you can extrapolate this to any other situation
affected by absolute filenames.

Besides, there are also some programs that try to cover all "known"
possibilities to locate the file they are looking for by using paths like
/usr/local/somewhere, /usr/somewhere, and
/opt/package/somewhere. Simply put, you cannot know where the
user has his stuff installed, so you need to let him specify where it is. For
example, pkgsrc places all its files under /usr/pkg, but this location
is configurable: this may lead to a program working on a system using the
default settings, but not on another one that has been modified.

There has to be an easy way to choose optional features. If
your program includes optional functionalities--such as a GTK front end--there has to be an easy way for
the user to enable or disable them. This can occur automatically, but see Automatic decisions below.

Given these reasons, I hope you see the need for a configuration
framework in almost all scenarios. Without it, your program is neither
portable nor usable, because it will be very problematic to
make it work on any system different from yours.

Assuming that this has convinced you, you now have to choose which
configuration framework to use. The most common alternatives are:

A Makefile with easy-to-change variables. This is an old way to
configure software. It consists of a Makefile placed in your source
tree with a section where the user can modify some variables to specify paths,
system features, and more. This Makefile can either be the same as the top-level one, or one specially designed for configuration.

This approach works quite well if the amount of customizable features is
small and you expect people to install the package manually. Note
that many novice users will find this frightening and will probably make
mistakes.

Packaging systems work in an unattended manner, so this framework is
difficult to manage. The packager has to patch the configuration
Makefile to mark lines to customize; then, the package must run
sed(1) over it to replace the previous marks with real values.

Consider a simple example: if the original Makefile includes a line saying
PREFIX=/usr/local, the packager has to change it to
PREFIX=@PREFIX@ and then use a regular expression such as
s|@PREFIX@|${PREFIX}|g to put the correct value in there.
Remember, the installation prefix must be configurable, hence the need for
dynamic replacement.

As you can imagine, these patches easily fall out of sync and must be remade
in every update of the package. Using this approach will only discourage
packagers to create a package for your program.

A configuration script. This is a very common way to configure
software and works very well if the script is smart enough. A script gathers
all the required information, either automatically or through flags given by
the user, to create the Makefiles and other files accordingly.

These scripts often use GNU Autoconf, which is usually a safe choice because
it integrates well with several packaging systems. Of course there are several
other frameworks that you can use, and if you have enough energy, you can even
create your homegrown script. Be very careful if you do this, though, as it
may not be as portable and useful as you may think.

The configuration script can do much more than the previous approach (a
static Makefile): it can check whether required dependencies are present,
whether specific functionality exists in libraries, and more. This is definitely
the best way to go, even if it requires a bit more of extra work on your side.
It will not only simplify packaging, but also make your package nicer to
the end user.

From now on I will assume that your program includes a configuration script.
If it does not, well, read the reasons again. Keep reading, even if you
still resist the idea, as the concepts explained below should apply to
whichever method you use.

Configuration Script Tips

As explained earlier, a configuration script adapts the source code
of a program to build and work properly on the build host. (I will not
consider cross-compilation here, but that is often a focus of problems, too.) What
kinds of details must a developer care about to make his creations
package-friendly?

The script should be noninteractive. Of course, it may require
information from the user, but this should be optional and should come
from command-line options or environment variables (see the next point).

Rationale: Passing options to a script from a packaging system is trivial;
append them to the call to the configuration script and everything will be
automatic. However, if the script requires interaction, the packaging system
must simulate it, which may be "easy" if it is command-line oriented--redirecting stdin from a previously stored file--or almost
impossible if using a utility such as dialog(1). Other solutions
include hand-patching the script, which is equally problematic.

Do not hardcode paths or other values in the sources. If you
have to put a specific path or value such as a user name or group name in
your program, do not hardcode it in the sources. This is a very good candidate
to customize from the configuration script, either through automatic detection
or through a user-specific flag.

Rationale: The paths or values you hardcoded may not be acceptable for every
system. Remember that not everything is GNU/Linux running on i386.

Be careful with hardcoded paths in the configuration script. If
you are looking for some file in a running system, you might try some common
paths. Nevertheless, let the user override these defaults if necessary.

Suppose you need to locate the xmlcatmgr utility. An incorrect
approach could be to search for it in /usr/bin, then try
/usr/local/bin, and at last abort the operation if it's not found. This is
incorrect because the application may be present in an unexpected path.

A better solution is to provide the user a way to override the search patch
so that he can explicitly tell the configuration script where to look; for
example, in ${HOME}/local/bin:/opt/xmlcatmgr/bin. In the case where the
user has not specified a path, falling back to your favorite built-in
directories is still a valid option.

The best solution is to let the user explicitly specify which utility to
use. In this example, that could be through a XMLCATMGR variable,
which includes an absolute path to the binary.

Do not use the == operator when calling
test(1). This is a GNU extension, and it breaks on more
conservative systems such as NetBSD.

An automated decision is one taken based on the software available on the
system at configuration time, without user intervention. They are very
harmful, as they make maintenance harder and often lead to incorrect
dependency tracking, which is a very serious problem in a package.

As an example, consider the following scenario: your program comes with an
optional GTK front end, and your configuration script provides an
--enable-gtk-fe={yes,no} flag to specify whether to build it. The
default action, however, is to take an automatic decision based on the presence
of GTK in the system; that is, if GTK is available, build and install the GTK
front end. (To make this more credible, this is what xchat and other programs do.)

This behavior is acceptable, and often very good, if the user is installing
your program by hand. Unfortunately, it makes things (very) difficult in the
face of package maintainers, especially when the amount of optional features is
large (gst-plugins is one such beast).

When a maintainer creates a package for a software program, he must choose a
known set of default build options for it. He does this to create the
same--or almost equal--binary packages no matter which machine they are
built on. The goal behind this is to keep a fixed dependency tree that is
easy to track properly. The common procedure to do this is as follows:

Manually analyze the available configure-time options (as given
by ./configure --help or as seen in the README file) and
the output of the configuration script.

Check which features are optional and decide whether to enable
them for the actual package.

Adapt the source package to use only the chosen dependencies, either by
giving extra flags to the configuration script or by patching it manually.
Doing the latter is often quite difficult (because configuration scripts are
pregenerated and unreadable shell code).

As you can imagine, this task is prone to error: it is easy to miss a required
dependency, especially if it is unclear (which unfortunately is the case 90 percent of the time). Think, for example, about the yacc and
lex utilities: if the packager forgets to add a dependency on
them, the end user will probably have trouble building the package. It's even
worse if the package finds an extra library and uses it but does not record
this fact anywhere. Any mistake here will surely cause trouble to end users,
who may experience build failures, extra files being installed, and so on.

Another problem appears when it is time to update the package. The packager
has to repeat the same procedure to verify that the package has
introduced no new dependencies. If all of them were off (or on!) by default,
this could mitigate the pain, but due to the automatic decisions explained
above, this causes a lot of headache.

Consider gst-plugins, which I mentioned earlier. This can build a
huge amount of plugins depending on the libraries and codecs available on the
system. In pkgsrc, we explicitly disable them all through configuration
arguments and select them one by one in individual packages (see its Makefile.common).
New versions of gst-plugins often come with new modules, so the set of
arguments to pass to the configuration script needs manual adjustment on every
update.

Now imagine that the packager misses the --disable-arts
argument. The aRts plugin
(libgstartsdsink.so) will build on some systems but not others due
to the automatic detection. If the packager does not have aRts in his system,
he will not add a dependency on aRts because he will not notice it. When
another user builds it on his aRts-enabled system, aRts will become a
dependency; however, this fact will go unrecorded. aRts has become a
hidden dependency of gst-plugins. A further removal of the former
will mysteriously break the latter. This kind of situation is a very
serious problem that comes up over and over again.

What are some possible solutions to this dilemma?

Make the configuration script abort its process when it cannot
activate a feature because of missing dependencies. For example, if the
default behavior of xchat is to build the GTK front end, abort the
configuration process if GTK is not available. (The word default is important
here; if the default is to not enable the GTK front end, the script should not
care at all about GTK presence.) I know; this solution is too drastic because
it makes things difficult to people building by hand (though, if they are
building by hand, they should take all the consequences ...).

Add an --enable-packager-mode (or similar) flag. Passing
this flag to the configuration script should disable all automatic decisions,
as explained in the previous solution. However, if the flag is absent, the
script should behave as usual, taking automatic decisions.

In my opinion, you should use the second solution, as it does not intrude
and is more flexible. Is it too complex? Not really. The following code
snippet adds the --enable-packager-mode in your own GNU Autoconf
scripts:

What's Next

This article has introduced the problems of software packaging and why
developers should be aware of them. It discussed multiple problematic issues
that can be usually found in the distribution files and in documentation.
Finally, it analyzed in detail the need for configuration scripts, techniques
to implement them, and multiple problems that arise during their creation.

The next article will focus on the build infrastructure used by third-party
packages, as well as some code portability issues. Until then, if you are the
maintainer of a specific software project, you have enough time to apply all
the tips explained. Time to work!