Configuration Management With Autoconf, Pkgsrc, and Cfengine
============================================================
March 18, 2004
Hal Snyder
0. Introduction.
The purpose of this presentation is to
A. Describe the configuration problem.
B. Propose some solutions.
1. An example.
Some details of this example are made up, but the issues presented
are typical of what we face with any new subsystem on our VoIP platform,
such as call detail record (CDR) collection.
Suppose Jack is a programmer coding a SIP application MYAPP. When he
finishes, all we will need to do is copy the program to the production
server and start it up ...
Jack works with:
a work area: ~jack/MYAPP/...
on a server: sipsrv02
with an operating system: x86 SunOS 5.8 SP 02/2003
compiler: gcc-3.0.1 and libraries /usr/local/lib/libgcc*
freeware third party support libraries: freetds, libwww
proprietary third party speech recognition libraries
SIP stack: /net/sipdev/home/build/sip/releases/20030211-03/SunOS/5.8/i386/
When the new application is finished, i.e. programming is done
on server sipsrv02, the result is to be somehow installed on the
platform. Jack has compiled the program in his home directory.
If you look at what the OS sees when he tests it, the dependencies
will look something like this:
- shared library files used when myapp runs -
libvccxml.so => /export/home/jack/MYAPP/sandbox/libvccxml.so
libvjsdbc.so => /export/home/jack/MYAPP/sandbox/libvjsdbc.so
libvvml.so => /export/home/jack/MYAPP/sandbox/libvvml.so
libmyapp.so => /export/home/jack/MYAPP/sandbox/libmyapp.so
libmozjs.so => /usr/local/mozilla/lib/libmozjs.so
libnspr4.so => /usr/local/mozilla/lib/libnspr4.so
libplc4.so => /usr/local/mozilla/lib/libplc4.so
libplds4.so => /usr/local/mozilla/lib/libplds4.so
libwwwapp.so.0 => /usr/local/lib/libwwwapp.so.0
libwwwcache.so.0 => /usr/local/lib/libwwwcache.so.0
libwwwhtml.so.0 => /usr/local/lib/libwwwhtml.so.0
libwwwutils.so.0 => /usr/local/lib/libwwwutils.so.0
libwwwcore.so.0 => /usr/local/lib/libwwwcore.so.0
libwwwinit.so.0 => /usr/local/lib/libwwwinit.so.0
libwwwmime.so.0 => /usr/local/lib/libwwwmime.so.0
libwwwhttp.so.0 => /usr/local/lib/libwwwhttp.so.0
libwwwfile.so.0 => /usr/local/lib/libwwwfile.so.0
libwwwstream.so.0 => /usr/local/lib/libwwwstream.so.0
libwwwftp.so.0 => /usr/local/lib/libwwwftp.so.0
libwwwnews.so.0 => /usr/local/lib/libwwwnews.so.0
libwwwdir.so.0 => /usr/local/lib/libwwwdir.so.0
libwwwtelnet.so.0 => /usr/local/lib/libwwwtelnet.so.0
libwwwtrans.so.0 => /usr/local/lib/libwwwtrans.so.0
libwwwgopher.so.0 => /usr/local/lib/libwwwgopher.so.0
libmd5.so.1 => /usr/lib/libmd5.so.1
libnsl.so.1 => /usr/lib/libnsl.so.1
libsocket.so.1 => /usr/lib/libsocket.so.1
libpthread.so.1 => /usr/lib/libpthread.so.1
librt.so.1 => /usr/lib/librt.so.1
libsybdb.so.3 => /usr/local/freetds-0.61-1/lib/libsybdb.so.3
libstdc++.so.3 => /usr/local/lib/libstdc++.so.3
libm.so.1 => /usr/lib/libm.so.1
libc.so.1 => /usr/lib/libc.so.1
libstdc++.so.2.10.0 => /opt/sfw/lib/libstdc++.so.2.10.0
libdl.so.1 => /usr/lib/libdl.so.1
libgcc_s.so.1 => /usr/local/lib/libgcc_s.so.1
libthread.so.1 => /usr/lib/libthread.so.1
libmp.so.2 => /usr/lib/libmp.so.2
libaio.so.1 => /usr/lib/libaio.so.1
On the development server Jack was using, the program actually was
pulling in libraries from 3 different versions of gcc at the same
time.
When the program runs, it answers calls to a test number which is
routed through a test gateway. It registers as a test app with a
test SIP proxy. It accesses the test database authenticating with
test user and password entries. The app appends to log files which
are written into Jack's home directory. It runs as user "jack" and
accesses files like
/export/home/jack/MYAPP/sandbox/2
It is started manually and exits after each call. Each instance
requires a separate directory writable by the application, into
which log files are placed without limit.
Note this story is simplified because it makes no mention of
monitoring, call detail, failover, load balancing, resource management,
etc.
Meanwhile, out on the platform:
SunOS 5.8 refuses to install on any of the new servers we buy,
but we are able to install SunOS 5.9.
A customer on another part of the platform reports a bug that
they claim keeps them from going into production. According to
vendor release notes, the bug is fixed in a new release of the
speech recognition libraries.
When it is time to install myapp, we have to set
location on the server from which the file runs
user and group id under which myapp executes
DNIS
ANI to display to called party
initial page of XML to be executed
location of prompts used
unique identifier for each instance to run
UDP SIP port for each instance
telnet port for each instance
proxies with which to register
database: name, user, password
log directories: location, ownership, permissions, rotation scheme
In addition, there are often two and or more versions of freetds,
libwww, C and C++ run-time libraries on the production servers.
This case is not the most complex we have dealt with. There exist
single Unix processes with over 100 configurable parameters.
2. Problems.
a. How do we make sure that a program coupled to development resources
(libraries, compilers, databases, SIP proxies, etc) will work as
desired in production?
b. How do we make sure that programs developed in a manual, prototype
setting function properly in a 24x7 shared environment? (i.e. that
they can write data where it is needed, but don't fill up all disk
space with logfiles, use up all memory, file descriptors, process
table entries, etc)
c. How do we make software dependencies manageable during development?
(How can Jack easily select one version vs. another of freetds and
track what is done?)
d. How do we set up all the needed interactions when moving into
production?
3. What does not work.
a. Copy the files and edit things manually on production servers.
b. Ghost the hard drives.
c. Jumpstart.
d. Write some scripts.
e. Keep a log of all the manual operations done.
f. Write lots of in-house how-to pages.
g. Whenever something is installed on a server, make an entry in a database.
h. Run a program that scans every server to find out what is out there.
We have tried all of these.
The next three sections discuss three technologies that will help deal
with the problems mentioned in #2.
4. Cfengine.
Cfengine is a tool for deploying and configuring software on large
numbers of servers. It originated at the University of Oslo in 1993
and has been actively maintained ever since.
Users of cfengine include Cisco, Hewitt, NASA, Nokia, NorTel,
Motorola, RedHat, and Sun.
How it works.
Each production server is a cfengine client and belongs to various
cfengine classes.
Policy host: server with files that describe what happens to each
class of client.
File masters: servers with content to be distributed to clients.
Servers may be configured to poll the policy host (we do this hourly)
or if being maintained, will poll only when manually triggered to do
so.
We use cfengine to do the following sort of thing:
if a server is going to be a myapp server, then
install the software needed by myapp
create startup entries (inittab) for myapp
configure myapp on target host for sip domain, sip proxy, etc.
Other things cfengine handles:
locally replicated prompts and grammars
filtering of log messages for email to hosting customers
setup of rsync and ftp servers
initialization of filesystem databases
scan for appearance of new core files
creation of crontab entries
associating rec clients with the right rec servers
placement of SNMP MIBs
localhost replication of initial page routing table
maintaining users, groups, and sudo authorization
Cfengine uses pull rather than push (configuration is done when the
client requests it). This feature makes it easy to defer updates until
they are needed, and to catch up on changes if a server happens to be
offline for awhile.
Cfengine will typically overwrite files that it is supposed to manage,
but we have several areas on each server where manual edits, if
required, will not be undone.
Our policy is not to stop or start processes with cfengine - for
example installing myapp creates inittab entries, but leaves them
"off"; they have to be set to "respawn" before the interpreter is
operational.
Potentially destructive operations such as recompiling a recognition
package (with vs. without speaker verification) or generating a new
speech engine configuration file are scripted in cfengine but
only happen if a special flag is added when the utility is invoked -
"cfagent -Dspeech_engine_config" for example.
Advantages.
Cfengine is better than jumpstart alone because it lets us update
servers any time after installation. Note jumpstart or equivalent
still has a role bootstrapping the OS and initial cfengine bits.
Cfengine gives us a record in the config files of what configuration
is done to which servers. That record is accurate because it was
actually used to perform the configuration. The record is maintained
under version control giving us a history of the platform. There is a
single, well-known area in CVS - netadmin/cfengine/conf - where
configuration rules for every server and every service can be found
and executed.
Problems.
Cfengine has root access on all client servers. That means it could
do immense damage if configured to do so. However, we have used it
for over a year on our VoiceXML hosting platform (about 30 servers
for most of that time) with only one instance in which production
services failed en masse - and that was before making the policy
decision not to start or stop processes from cfengine. In practice,
when a mistake occurs, the scope is limited and it can almost always
be remedied without taking services offline. There is nothing
cfengine can do that a human user could not do with admin privileges,
it's just that cfengine can do it to a lot of computers very fast.
Cfengine configuration file syntax is limited. It is difficult to
specify the ordering of certain kinds of actions and difficult to do
complex modifications of shared configuration files like inittab.
Cfengine is slow. It one recent test it was about 100 times slower
than rsync, taking almost a minute to replicate about 500 files.
Cfengine has bugs. Most of these cause it either to crash (in which
case replication succeeds the next time) or to perform a replication
when none is needed. I keep a list of defects I've found in cfengine -
at present it has over 30 entries.
In balance.
Cfengine has saved us a huge amount of work in the past year. While
we are still learning the best way to organize the policy files,
we should continue to use it. We are learning to work around its
limitations.
Simplified example.
When creating a new type of server, edit cfagent.conf:
# myapp hosts
myapp_server::
/var/cfengine/inputs/cf.myapp
Create cf.myapp and put things in it like this:
groups:
has_myapp = ( ReturnsZero(/usr/pkg/sbin/pkg_info -q -e myapp-0.8.1) )
which says a server is in a certain group if it has the myapp package.
shellcommands:
!has_myapp.ipv4_192_168_32::
"/bin/true;PATH=$(cf_path2) SIP_DOMAIN=site_a.local
MYAPP_HOME=/usr/pkg/site/myapp /usr/pkg/sbin/pkg_add
ftp://cfesrv01.local/pub/pkgbin/i386sol-8/myapp-0.8.1.tgz" umask=022
the above line says if the server is on the Chicago production network
and does not have the myapp package, then install myapp and all packages
required by myapp. Create any users, groups, and directories needed and
make the necessary edits on the end server.
Now suppose you want to configure server sipsrv01 to run myapp. Add the
following to cfagent.conf
myapp_server = ( sipsrv01 ... )
and either wait for regularly scheduled replication or log onto
sipsrv01 and do
sudo cfagent
To put the server into production, log on, edit occurrences of "off"
to "respawn" in /etc/inittab, and do
sudo kill -1 1
and you have added another server to the platform.
5. Pkgsrc.
What it is.
Pkgsrc is a suite of tools developed by the NetBSD core team, for
packaging software prior to installation, similar to RPMs on Linux
and Solaris packages on SunOS.
The focus of the NetBSD project is portability. The OS runs on 54
different architectures. Writing portable software enforces a
discipline that makes you pay attention to architectural issues you
wouldn't notice as quickly otherwise. The pkgsrc system, unlike any of
the other major packaging systems, runs on every modern Unix-family
OS. A port to Windows Interix is in early development.
Pkgsrc is derived from the FreeBSD "ports" system. FreeBSD ports have
been in use since 1993 and today allow users of FreeBSD access to over
10,000 software packages. The OpenBSD packaging system is also based
on FreeBSD ports.
Like other packaging systems, pkgsrc allows you to specify a list of
files with destination directory, ownership, and permissions. A
database is kept of files which are installed, so that packages can be
uninstalled. Dependencies among packages are tracked: installation of
a package will not succeed if prerequisites are not present or
installable; a package will not be uninstalled under default settings
if it is required by another installed package.
Pkgsrc can create users and groups with specified id numbers. It can
apply patches when files are built on a staging server as well as when
they are installed on the target server. It represents a far more
comprehensive interface than rpm's, because the latter relegate much
of the detail other than copying of files to ad hoc installation
scripts.
Pkgsrc allows us to keep a depot of version-tagged packages on a
master server. Cfengine can then install those packages as needed. The
pkgsrc settings allow finer control of configuration settings than
cfengine does.
Here's a listing of some packages installed with pkgsrc on the current
production myapp servers:
sipsrv05.local>pkg_info |tail| sort
erlang-9.2nb1 Concurrent functional programming language
freetds-0.61.2 LGPL'd implementation of Sybase's db-lib/ct-lib/ODBC libs
gcc-2.95.3nb4 GNU Compiler Collection, version 2
libwww-5.4.0 The W3C Reference Library
moz-lib-1.0 mozilla libs needed for myapp
openssl-0.9.6l Secure Socket Layer and cryptographic library
p5-DBD-Sybase-0.94nb2 Perl DBI/DBD driver for Sybase/MS-SQL databases
thttpd-2.23.0.1nb1 Tiny/turbo/throttling HTTP server
myapp-0.8.1 myapp main program and dedicated shlibs
myapp-wav-1.0 generic prompts for myapp
The freetds package includes several local adaptations made to the library.
Pkgsrc can be used to deploy binary only packages from vendors and to
build packages from source for any target architecture.
Pkgsrc and its relatives on the other BSDs have been used for nearly a
decade by thousands of developers to solve many of the deployment and
configuration problems facing us today. It represents a colossal
investment of labor by a large number of advanced programmers. We
would be foolish to ignore it.
Using pkgsrc.
Create a distfile, a version-labeled tar archive of files to be
deployed, like
myapp-0.8.1.tar.gz
and put the distfile into the distfile depot
cfesrv04 /u1/ftp/pub/pkgsrc/distfiles>sudo scp .../myapp-0.8.1.tar.gz .
Create a directory in your work area
~/work/pkgsrc/site/myapp
Create a stub package
url2pkg ftp://cfesrv04:/pub/pkgsrc/distfiles/myapp-0.8.1.tar.gz
Edit Makefile and DESCR. Special functions such as creating
directories and users on the target host are configured here, as are
dependencies on other packages.
Record a checksum for the distfile.
bmake makesum
Test install and deinstall.
bmake install
bmake deinstall
Make the package.
bmake package
Put the binary package into the pkgbin depot.
cfesrv04 /u1/ftp/pub/pkgbin/i386sol>sudo scp .../myapp-0.8.1.tgz .
The package may now be installed manually or via cfengine with
sudo pkg_add ftp://cfesrv04:/pub/pkgbin/i386sol/myapp-0.8.1.tgz
and may be deinstalled with
sudo pkg_delete myapp
Deleting a package will not remove files that were edited or added
after installation.
Cost.
Most of the cost in creating a pkgsrc package will be in getting
the Makefile right. For someone familiar with the process, it takes
from an hour to a day, depending on the complexity of the project.
This week, in a couple evenings, one programmer created the following
pkgsrc packages from existing code:
cloud_mon-1.0 Erlang Cluster Monitor
cdr_client-2.0 Call Detail Record (CDR) System Client
cdr_mapper-1.0 Call Detail Record (CDR) System Mapper
cdr_call_state-2.0 Call Detail Record (CDR) System Server
cdr_subscriber-1.0 Call Detail Record (CDR) System Subscriber
cdr_spool-2.0 Call Detail Record (CDR) System Spooler
www_tools-1.0.1 WWW Tools
resource_manager-1.0 Resource Manager
6. Autoconf.
GNU Autoconf has been at the underpinnings of open source since 1991.
It is used to adapt software to variations is Unix-like operating
systems. Generally, when you download a package, you get something
like foo-1.13.tar.gz. You extract files from that archive, then type
sh configure
make
and the program compiles. The package downloaded is a collection of
source files, traditionally C or C++. The invocation of the "sh
configure" script makes the adjustments needed for the particular
environment you use when building the program.
Thus, a programmer can write programs on Solaris, and use autoconf to
make it easier to build those programs on Linux or FreeBSD, or even a
different release of Solaris. There is considerable support for
autoconf on Windows.
Nearly every open source program we use was built with the aid of
autoconf. The list includes
gcc
apache
tomcat
erlang
perl
openssl
freetds
libwww
net-snmp
tcpdump
zebra
and hundreds more. With none of these do we worry about whether we are
building on RedHat Linux or Solaris or FreeBSD. In fact this is one
area where the proprietary vendors are decades behind the times.
The main advantages of autoconf for us today come not from portability,
but the ability to standardize on compile-time dependencies. Autoconf
macros are written to find needed libraries in their default
locations, but these locations may be overridden at configuration
time, for example
sh configure --prefix=/usr/pkg --with-freetds=/usr/pkg/freetds-0.61-1
The autoconf toolset includes libtool, which we can use to our
advantage with native checking for required versions of shared
libraries (built into modern Unix operating systems) instead of
various ad hoc legacy "magic number" techniques we have used in
legacy software.
There is a high degree of synergy between pkgsrc and autoconf, because
large parts of pkgsrc were developed to assist in installation of
software products developed under autoconf.
Like pkgsrc, autoconf is the result of a huge amount of development,
testing and use. Think of it as $1M in free software consulting time.
We should not again undertake the creation of a build system without
at least taking a long, hard look at autoconf.
Example.
Suppose you have a program working - it is in CVS with files that look
like this:
Makefile
main.c
header.h
xyz.c
then
autoscan (creates configure.scan)
edit configure.scan -> configure.ac
edit Makefile -> Makefile.am
this step may require creating aclocal macros
autoreconf -fiv
creates config.h.in, configure, aclocal.m4
code may be added here to deal with variations in OS, etc.
sh configure --prefix=/usr/pkg
gmake
gmake dist (creates foo-1.6.tar.gz or such, the source distfile)
Cost.
The work in making a project autoconf-compatible lies in creation
of configure.ac, Makefile.am, and supporting aclocal.m4 files.
Complex projects such as myapp could take a week or more. Simpler
projects such as most Erlang modules or our modifications to freetds
take less than an afternoon for someone familiar with the process.
Making the code truly portable will require extra effort - autoconf
can only help find parts of the program that will need to support
variations in the OS and make it easier to do so.
7. Migration.
One advantage common to each of the three systems is that they can
be deployed incrementally. Each can coexist with legacy processes
so that conversion can proceed at whatever rate is feasible.
All three tools, but especially autoconf, have evolved a set of coding
guidelines that make programming more amenable to portability and
configurability. One of the best reasons for using these tools is that
they will help us to write better software.
8. Other issues.
a. Windows. Many autoconf projects, including the Erlang platform,
offer Windows compatibility. But, we have not explored this field.
Also, we know we must support some MS deployment tools such as WMI.
b. Delegation of responsibility. We would like configuration for
types of servers to be managed by teams specific to those projects.
One approach is to permit modification of selected cfengine files
to the team in question.
c. Data areas. Some collections of files are too large for regular
replication with cfengine, but change too rapidly for repackaging
to be practical. One example is the collection of prompts used by
an application. Probably we should use another replication tool for
such things, such as rsync. Network services such as SFS (I would
not use bare NFS in mission critical applications) or custom servers
could also be used.
d. Backing out configuration. Sometimes we want to change a server,
say from call push to vapp. Some dependencies are not obvious, such as
incompatible sets of recognition packages. Probably it will never be
cost effective to automate all aspects of deinstallation - someone
will just have to know that placing a server into server class A means
some special intervention is needed if we ever want it in class B
instead.
e. Updates. Updating something like the OS (Solaris 8 to Solaris 9)
or speech recognition libraries still present challenges. About the
best we can do now is peel off a server into a new cfengine class,
test proposed changes to it, and gradually migrate other servers
into the new class. Sometimes you have to do several servers at
once because clients and servers need to be upgraded at the same time.
9. Suitability for our customers.
Are we locking ourselves into a system of using software which will
cause problems with customers?
Probably if you go to some customers and try to tell them all about
cfengine or pkgsrc, you will scare them off.
On the other hand, I know of no Unix development shop where autoconf
tools do not play some role.
I think if we present a working system of large-scale configuration
and deployment, that it will be of great added value to our products.
The tools presented also make production of releases and tracking of
dependencies a routine task, so that it will be much easier to
productize our software for external use.
There is nothing requiring a customer to adopt cfengine if they don't
want it. They can always copy and edit the same files by hand or
substitute another deployment tool of their choosing. Similarly with
pkgsrc - they don't have to use packages - we can install them and
just tar up the installed files.
10. Licensing.
IANAL. Autoconf and cfengine are GNU projects. I believe that means
that if you ship them, you need to make source for autoconf or
cfengine available (not the rest of your company software), and
that if you modify autoconf or cfengine and ship those modifications,
then those modifications need to be made available.
Pkgsrc is a BSD project. I believe you can do anything you want with
it, as long as you include appropriate attribution in your product.