Release Notes and Instructions for updated EDTK and Berkeley DB driver.
By Chris Newcombe
All of the work on the new version of EDTK was motivated by the need
for a complete 'production quality' driver for Berkeley DB. Both were
developed together. This document describes the changes.
[Obligatory disclaimer: all opinions and statements in the
documentation and code are my personal opinions and statements --
i.e. are not necessarily those of my employer.]
Please also read these:
TODO (some important details)
doc/
EDTK_BerkeleyDB.ppt (some rationale and detail)
examples/berkeley_db/
berkeley_db_api_support_status.txt
First, many thanks indeed to Scott Lystig Fritchie for writing the
original EDTK and berkeley_db driver -- it has saved me a huge
amount of time.
Now a ***SERIOUS WARNING*** (ignore at your peril)
If you are not familiar with using Berkeley DB then you should read
its documentation very carefully before using the berkeley_db driver.
Product home: http://www.oracle.com/database/berkeley-db/index.html
Docs: http://www.oracle.com/technology/documentation/berkeley-db/db/index.html
Also, the various public Berkeley DB discussion forums are an
excellent place to ask questions about API usage and application
design:
Main forum: http://forums.oracle.com/forums/forum.jspa?forumID=271
HA (replication) forum: http://forums.oracle.com/forums/forum.jspa?forumID=272
comp.databases.berkeley-db: http://groups.google.com/groups/dir?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=comp.databases.berkeley-db
Berkeley DB is a flexible toolkit for building fast, robust
datastores. It has a very convenient and powerful API. But that API
is exposed as procedural calls to library functions (i.e. not as a
declarative language like SQL). And almost all of those library
functions have critical pre-conditions which must be enforced by the
application.
When correctly used, Berkeley DB is *extremely* robust and safe (its
high code quality makes it safe to use the berkeley_db driver in
'linked-in' mode -- i.e. loaded into the Erlang VM's own OS process --
which is very important for performance). And Berkeley DB does in
fact catch & report many types of erroneous use (i.e. application
bugs).
But to maximize performance (as performance is one of its main
features), Berkeley DB does NOT check or prevent *all* types of
erroneous use.
Certain classes of application bug (e.g. closing a database or a
cursor while it is in use by a transaction) can lead to undefined
behaviour such as bus-errors (segv's) **and even irrevocable
data-corruption** (even when using replication -- i.e. the corruption
may be propagated to all replicas).
[Note: running the Erlang/BDB driver in "spawned program" mode -- aka
"pipe-mode" -- will only protect the Erlang VM from crashing if
Berkeley DB crashes. But it will NOT protect your data (stored in
Berkeley DB) from being corrupted due to incorrect use of the Berkeley
DB API. The only way to avoid that is to use Berkeley DB correctly --
follow the documentation carefully, and ask for clarification if it is
not clear, and make backups of your data (Berkeley DB supports hot
backups of live databases).]
Of course, this lack of reslience to some kinds of application bug is
clearly at odds with an important motivation for using Erlang --
building reliable systems in the presence of bugs.
However, the above warnings most often apply to explicit, direct,
incorrect use of the Berkeley API -- i.e. sins of commission that are
easy to avoid once you have thoroughly read the documentation (and
appreciate the danger).
Most importantly, the driver (and helper layers, mentioned below)
*does support* the standard (and vital) Erlang practice of 'crashing'
(exiting) by default if something unexpected occurs. Great pains have
been taken in both EDTK and the driver code to ensure that it is safe
to use Erlang as it was intended when using Berkeley DB (i.e. that
appropriate cleanup happens after a process crash, and that all
Berkeley DB preconditions and invariants remain satisfied during that
cleanup).
Also, the driver now includes several 'helper layers' (e.g. the
port_coordinator and replication_group server) which add significant
extra layers of convenience and protection (e.g. the port_coordinator
coordinates startup and shutdown of BDB, and owns and manages database
handles, ensuring that those handles will not be closed if a
transaction is running).
But even these layers don't guarantee to catch all potential mis-use
(e.g. explicitly commiting or aborting transaction that still has open
cursors). You still need to know what you are doing when using the
API.
BDB also has some particularly complex features; e.g. distributed
transactions and replication that can have very unpleasant failure
modes qif you are not careful (inc. deadlocked applications, loss of
data, or unrolling of previously commited transactions. So before
using those features it is essential that you read the relevant docs.
If you are unsure about anything, there is usually a working example
in the test suite
examples/berkeley_db/berkeley_db_test.erl
[There are instructions for running the tests at the top of that
file.]
Status/Quality Of This Release
This code is pretty much functionally complete -- all major Berkeley
DB APIs are now supported; see
examples/berkeley_db/berkeley_db_api_support_status.txt
Some APIs have adjusted (e.g. berkeley_db_replication_group_helpers.erl)
and various bugs have been fixed (including in Berkeley DB itself
-- all via official Oracle patches).
The goal is to make this code rock-solid and sufficiently performant
for use in high-end mission-critical systems.
There is an extesive functional/regression-test suite, and the BDB
driver is now used in some real applications, some of which have
had several months of intensive testing. So the driver appears to be stable,
But as always, Caveat Emptor. Do your own testing, with your own use-cases,
and please report any bugs that you see.
Important: BDB exposes a LOT of tuning parameters, many of which have
such dramatic effects on latency and throughput that they can be
'destabilizing' from an application's point of view (e.g. cause timeouts)
if they are set inappropriately.
The parameters often need to be tuned to achieve particular application
goals (levels of concurrency, latency, throughput) on specific hardware.
Tuning the parameters takes familarity with the BDB documentation and
often quite a lot of experimentation. To make experimentation as easy
and painless as possible, the driver exposes all BDB tuning parameters
in a logical and convenient way. I strongly recommend that you become
familiar with these parameters.
Sidebar: EDTK now has a lot of mostly-independent features, so the
combinatorics of testing are pretty heavy.
For example:
- The berkeley_db driver requires use of private threadpools (to
avoid various types of deadlocks). So I haven't tried compiling a
driver for which "default_async_calls=0", for which
"default_async_calls=1" but "private_async_threadpool=false", for
quite a while. Given the problems with the native Erlang
threadpool (e.g. just the danger of getting in the way of standard
drivers like efile_drv), I'd recommend that most drivers use the
private threadpool feature.
- The berkeley_db driver uses 'shared' (thread-safe) valmaps for
DB_ENV and DB, and 'nested' (parent/child relationship) valmaps
for DB_TXN. Both of those EDTK features can be combined, but not
with BDB, so that combination has not been tested (even compiled)
-- although I believe I did write all of the necessary code for
that combination.
Note on the other 'example' drivers.
Only the Berkeley DB driver has been compiled and tested with this
new release of EDTK.
The other 'example' drivers that shipped with EDTK v1.1 have **NOT**
been upgraded to this version. Indeed, I have never even compiled
those drivers. Therefore those other drivers may now be slightly
broken. It shouldn't take too much work to get them working again
because almost all of the new features in EDTK can be turned off,
and the internal APIs are largely unchanged the same. If the other
drivers are upgraded, then they can immediately take advantage of
new EDTK features such as private threadpools, shared (thread-safe)
valmaps, parent/child valmaps etc.
If the target library has similar requirements to Berkeley DB
(e.g. high cost to creating a port instance, need for coordinated
startup/shutdown, advantage in sharing expensive but thread-safe
resources) then it would be appropriate to write a port-coordinator
for that library -- adapting the BDB port-coordinator as a starting
point. (There is a case for making the port-coordinator into a
generated file -- i.e. a gslgen template -- because the code to
forward 'pipe mode' messages to the original sender of the request
is really part of EDTK infrastructure (essentially part of the
implementation of the receive_reply function in erl_template.gsl)
and should be managed as such. Almost all of the rest of the code
in berkeley_db_port_coordinator.erl is BDB-specific.
Requirements and Supported Platforms
- Operating System
I have only tested this on Red Hat Enterprise Linux v3 (the only
unix I have access to).
I believe it should work unchanged on more recent versions of linux,
and other unix platforms that support posix threads.
The original EDTK v1.1 code was probably not compatible with Windows
without some changes (due to use of pthreads), and that remains true
for this version -- in fact probably more so, due to the way
threadpools are implemented (under Windows the argument to
driver_select must be an event handle, not a pipe handle). I think
that windows support should be fairly straight-forward, but I won't
have time (or need) to do it myself in the forseeable future.
- Erlang Version
EDTK now requires Erlang R11B-3 or later, as the ErlDrvBinary
structure changed due to supporr for SMP. EDTK supports both smp
mode and non-smp mode. The BDB driver automatically uses some
optimizations in smp mode as some erl_driver functions are now
thread-safe. However, in testing the non-smp driver seems to
slightly out-perform the smp driver, presumably due to lock
contention.
Intensive testing has been done with R11B-3 in non-smp mode.
Significant testing was done with R11B-3 in smp mode. Noo problems
were found.
Some testing (e.g. regression tests and application tests but not
stress/performance testing) has been done with R11B-4 in both
non-smp and smp modes. No problems were found.
**IMPORTANT MAINTENANCE ISSUE**
edtk/erl_driver_pipelib.h now contains some structures and macros copy/pasted
from private Erlang VM code. These are required to make 'pipe mode' work.
These structures are required to all allow pipe_main.c to pretend
(to the generated driver shared-library) that it is the Erlang VM.
If these structures change in future Erlang releases, then edtk/erl_driver_pipelib.h
must be updated.
See also the long comments about driver_*_binary() APIs in edtk/erl_driver_pipelib.c
and this mailing-list thread:
http://www.erlang.org/pipermail/erlang-questions/2006-October/023500.html
- Template language
The driver still requires GSLgen from iMatix.
It uses the same version that EDTK v1.1 uses.
If you don't already have it, then the README file (Scott's original)
contains instructions.
I seem to remember that it was a little fiddly, so here are my notes
that I recorded at the time.
cd /home/$USER/erlang/downloads/edtk
Download http://www.imatix.com/pub/sfl/src/sflsrc21.tgz
mkdir sfl; cd sfl
gunzip -c ../sflsrc21.gz | tar -xvf -
chmod a+rx c build
export PATH=$PATH:/home/$USER/erlang/downloads/edtk/sfl
./build
Download http://www.imatix.com/pub/tools/gslsrc20.zip
mkdir gslgen; cd gslgen
unzip -a ../gslsrc20.zip
cd src
chmod a+rx c build
cp ../../sfl/*.h .
cp ../../sfl/*.a .
./build
If you choose a different directory, edit the GSLGEN_EXE variable
in examples/berkeley_db/Makefile accordingly.
- The BDB driver now requires the following support libraries.
[If you install them anywhere other than /usr/local you will need
to alter paths in examples/berkeley_db/Makefile and
examples/berkeley_db/releases/make-release.sh]
cd ~/erlang/downloads/edtk
Download from http://www.pcre.org/
tar -zvxf pcre-6.7.tar.gz
cd pcre-6.7
./configure
make 2>&1 | tee make.out
sudo make install 2>&1 | tee make-install.out
and
cd ~/erlang/downloads/edtk
Download from https://sourceforge.net/projects/goog-coredumper/
tar -zvxf coredumper-0.2.tar.gz
cd coredumper-0.2
./configure
make 2>&1 | tee make.out
sudo make install 2>&1 | tee make-install.out
- The BDB driver now requires BDB v4.5.20, which can be downloaded from:
http://www.oracle.com/technology/software/products/berkeley-db/index.html
IMPORTANT: this release of the driver also requires all of the patches in
examples/berkeley_db/patches-to-berkeley-db/db-4.5.20
There is a script to apply these in the correct order -- see below.
Most of the patches are 'official' patches provided by
Sleepycat/Oracle, and will be incorporated into later public
releases of BDB.
Patches should be applies in the root of the unpacked BDB tree,
before configuring and building BDB or EDTK/berkeley.
The patches should all apply cleanly; i.e. no 'hunk offset' warnings
from the patch utility.
e.g. (adjust paths as necessary)
cd /home/$USER/erlang/downloads/edtk
tar -zxvf db-4.5.20.tar.gz
cd db-4.5.20
~/erlang/downloads/edtk/edtk-1.5/examples/berkeley_db/patches-to-berkeley-db/db-4.5.20/apply-patches.sh
cd build_unix
../dist/configure --enable-debug --prefix=/home/$USER/erlang/downloads/edtk/BerkeleyDB.4.5
make
make install
**IMPORTANT**
Your operating system very likely ships with an older version of Berkeley DB.
Mixing different 'default' installations of of Berkeley DB on the same host
can be very confusing, as the contents of change between versions, and
that header may be compiled into existint programs.
Therefore I use the --prefix argument to 'configure' to pick a custom
installation directory. (The BDB library will be packaged as part of the
berkeley_db application anyway.)
If you do this but pick a different directory then you may need to
edit berkeley_db/Makefile to set BDB_INSTALL_DIR to the directory
that you chose.
Then build EDTK and the berkeley_db driver, and run the tests:
cd /home/$USER/erlang/downloads/edtk/edtk-1.5/examples/berkeley_db
pushd ../../edtk; make clean; make; popd; make clean; make
rm regression.out; make regression 2>&1 | tee regression.out
**IMPORTANT MAINTENANCE ISSUE**
Future releases of BDB require that constants in berkeley_db.xml be adjusted.
Diff the generated (installed) db.h file from BDB 4.5 with the db.h file from
the target BDB release to see what must be changed.
Architecture Overview
First, please read Scott's original README file which has pointers to
the original EDTK documentation, including his excellent paper.
The following overview begins with a brief, very high-level ('manager
level') summary of how Erlang applications are structured, and how we
want them to interface to BDB. It repeats/summarizes a little of the
overview in Scott's paper.
The later part of the overview explains the new 'helper layers' of the
BDB driver that were not present in EDTK v1.1. Everyone using the BDB
driver should read that part.
Quick Introduction:
The Erlang VM runs its own scheduler for Erlang processes. An Erlang
process is similar to a user-level thread but it is totally isolated
-- it can only communicate with other processes via message-passing;
i.e. no shared memory or mutexes. An Erlang application typically
consists of hundreds (perhaps thousands or even millions) of these
lightweight processes. The idea is to have one process for every
concurrent thing in the domain of your application (the 'active
object' model). The Erlang VM uses either a single OS thread, or a
single OS thread per cpu, to run schedulers for Erlang processes.
Erlang processes are scheduled preemptively. They can 'crash' without
disturbing other processes (due to no use of mutexes/shared memory).
Processes can also monitor each other and get messates or asynchronous
signals if another process crashes or exits.
Erlang interfaces to the outside world via 'port' objects which are
simply message-channels. To use an external library you have to write
code to make the library look like Erlang processes -- it can only
communicate by sending and receiving messages, and must not block the
Erlang VM process/thread. Once a library has such a message-passing
interface, it can be used by 'linking it in' to the Erlang VM as a
shared object, or spawning another program (another OS process) and
communicating with it over a pipe (over the spawned process' stdin and
stdout).
However, it's not quite as simple as just creating a message-passing
interface. Another important part of Erlang is the high degree of
isolation between Erlang processes -- e.g. lack of
locking/mutexes. But some libraries expose APIs that may block for a
long time -- even indefinitely. e.g. Berkeley DB implements a
page-oriented database, and includes a locking subsytem to control
access to those pages. Therefore, if multiple Erlang processes use
BDB concurrently (which we definitely want), those Erlang processes
can acquire locks, and block against locks held by other Erlang
processes. This applies to a Berkeley DB application written in any
language (it's not a property of EDTK or the driver.)
Also, Berkeley DB obviously does a lot of disk IO, but it does not
support asynchronous IO because it is a highly portable library and
there is no standard async IO API cross operating systems. For the
same reason, Berkeley DB does not even create any OS threads itself
(sidebar: one exception to the latter rule is the new 'replication
manager' convenience layer).
Therefore, Berkely DB will often block the application's own OS
threads -- when blocking against a logical lock, or when doing IO (the
latter applies to many kinds of libraries of course, and the new
features in EDTK will help when writing drivers for such libraries).
Of course it is critical that the Erlang VM's scheduler threads are
never blocked by BDB. The standard, supported way to achieve this is
using the Erlang 'async thread pool' ('erl +A' and the driver_async()
API). Unfortunately that mechanism is not flexible enough for the
Berkeley DB driver (and using it would risk interfering with other
important drivers like efile_drv). So EDTK now implements private
threadpools, and multiplexes commands across those pools. Each driver
instance (port) has it's own set of threads, and the pools are
resizeable at runtime. (This support only requires the standard Erlang
VM and APIs -- it does not require any modifications to the Erlang
VM.) The BDB driver currently uses 4 threadpools; see comments at the
top of berkeley_db.xml for details.
Note that when an Erlang processes 'blocks' against Berkeley DB, it is
actually sitting in a receive statement (with a timeout), waiting for
a reply from the driver. So to an application, blocking against a BDB
lock looks much the same as attempting to access a file on disk, or a
connetion to a client/server database like MySql or PostgreSQL, or
simply waiting for a reply from a gen_server:call. But to get decent
performance (avoid unnecessary blocking), applications using BDB do
need to consider access patterns and lock contention. Note that there
is one difference between waiting for the result of a BDB command and
waiting for the result of a normal Erlang gen_server:call -- BDB
commands are processed in threadpools of finite size, which therefore
may become backlogged or saturated. There are various mechanisms in
place to handle that; e.g. the threadpools can be resized at runtime
(while in use), and applications can query the current length of the
queues (to decide whether to add more threaes). See also the TODO
file for details of some subtle issues that can arise with threadpool
saturation.)
The role of EDTK:
Much of the code in port drivers is tedious boilerplate. EDTK is a
code-generation tool that takes a declaration of the target library's
API (e.g. see examples/berkeley_db/berkeley_db.xml) and some template
files containing boilerplate, and produces the Erlang and C code that
implements the message-passing interface for the library. It
generates the following code:
Erlang code
- Provides Erlang 'stub' functions that map to library APIs.
- The stub functions marshal arguments into a message. (A new
EDTK feature is that we now also pass a a tag that contains a
unique identifier that can be used to associate a reply with its
request, and the sender of the request.)
- Send message to the port
- Wait for reply. (It is now possible to generate non-blocking
variants of APIs -- these return the request 'tag', which can be
used to collect the reply later.)
- Unmarshal the reply message and return it to the caller.
- The previous release of EDTK would always return results to the
caller as {ok, Value} or {error, Value}. This release gives the
option (on a per-driver bases) of throwing errors as exceptions
by default (and returning just the Result, rather than {ok,
Result}, in the success case). See this paper for the rationale
of using throw vs {error, Reason}:
www.erlang.se/workshop/2004/exception.pdf
www.erlang.se/euc/04/carlsson_slides.pdf
Note: the BDB down now does this (and this is a significant API
change from the berkeley_db driver in EDTK v1.1).
See usage examples in:
examples/berkeley_db/berkeley_db_test.erl
C code
- Unmarshals command arguments from the message
- Create 'command' objects and enqueue them on the
producer/consumer FIFO queue for the appropriate threadpool.
- Implement the threadpools -- i.e. consume items from the queues,
call the BDB API with the correct arguments, and enqueue the
response in an internal queue for the Erlang VM process/thread
to collect (and then tell the Erlang VM that the internal queue
contains something, by writing to a pipe fd).
- Marshals the reply into a message
- Sends the reply back to the Erlang VM. The message is sent
directly to the process that sent the original request. This is
a significant difference between this release of EDTK and
previous releases, because it enables multiple Erlang processes
to share a single port instance (previous releases of EDTK
required that each Erlang process opened its own port instance).
However, this feature greatly complicates the implementation
details of EDTK, because of the need to do proper cleanup if an
Erlang process crashes while owning BDB resources like
transactions and cursors. In previous releases that process'
port instance would have been closed automaticall, and the
driver's C stop() function would have done the cleanup. But in
this release the single shared port instance cannot be closed
just because one of the processes using it happened to crash.
See next point (and
- Tracks resource usage by the Erlang application; these are
stored in arrays called 'valmaps'. e.g. The BDB driver tables
of open DB_ENV, DB, DBC, and DB_TXN handles. If the port driver
is closed unexpectedly (e.g. the Erlang process that owns it
crashes) then the shutdown C code closes cursors, aborts
transactions, closes database handles, closes the environment
handle, all in the correct order (even if some are currently in
use when the shutdown begins);.
Important: In this release, resources (valmap table entries) are
tagged as belonging to a specific Erlang process -- any process
that wants to use BDB. A command is provided to cleanup just
the subset of resources owned by a specific Erlang process
(should it crash). But note that this is entirely hidden from
application code in normal use (i.e. you don't have to worry
about it). See the description of the new 'port coordinator'
server later.
- Checks that constraints are met; e.g. transactions can only be
used by one command at a time, a prepared transaction cannot be
automatically aborted if a process crashes (it must be resolved
by a GTM), etc. etc. Basically it implements most (but not all)
of the pre-conditions/constraints described in the BDB
documentation.
- Converts asyncyronous events from Berkeley DB (the new
DB_ENV->set_event_notify mechanism in BDB v4.5) to an Erlang
message and sends them to Erlang (to the port-coordinator
mentioned later).
Plus about a zillion other details.
Using The BDB Driver:
The BDB driver can be packaged as an Erlang/OTP 'application', using
the (very basic) script:
examples/berkeley_db/releases/make_release.sh
The resulting directory tree is an OTP 'library application'
(including a minimal .app file), like STDLIB. i.e. It is just a
collection of code modules; it doesn't have start/2 function for use
by application:start().
A Erlang system that uses Berkeley DB will consists of one or more
Erlang nodes, each of which loads the berkeley_db application created
with the above script, and then opens one (or rarely, more) BDB
environments via one of the following two methods:
- If replication is not being used then the application starts an
instance of berkeley_db_port_coordinator per BDB environment.
- If replication is being used, the application starts an instance of
of berkeley_db_replication_group (which in turn starts a
port-coordinator process, and supervises it), per BDB environment.
Both of the above processes are standard gen_servers.
Importantly, they both have start_link functions, and can participate
in standard OTP supervisor trees.
**API note*** In general, all of the BDB driver functions conform to
'modern' Erlang thinking and return a plain Result or throw an
exception on error. i.e. They don't return {ok, Result} or {error,
Reason}. However there are exceptions to that rule -- e.g. all
start_link functions do still return {ok, Pid} or {error, Reason} as
OTP supervisors require that.
Please read the API documentation at the top of the respective .erl files.
What follows here is just a summary:
berkeley_db_port_coordinator.erl:
- A port_coordinator owns the single port instance that represents and
communicates with a single BDB environment on the local machine.
A single Erlang VM can run as many port_coordinators (and therefore
as many different BDB environments) as it likes. Most applications
will only need to use one.
IMPORTANT: It is *CRITICAL* that only a single port-coordinator be
configured to talk to a given BDB environment at any given time. If
two port-coordinators open the same BDB environment then irrevocable
data-corruption is almost certain. This may be made safe by use of
BDB's DB_REGISTER flag (see BDB documentation).
The port_coordinator does the following:
- Opens the port driver for an environment (which creates a single
DB_ENV handle for that environment).
- Provide a very flexible & homogeneous configuration facility for
almost all BDB config APIs and parameters, including sensible
defaults. (This is quite an important convenience, as Berkeley DB
has a *lot* of configuration parameters, scattered across a wide
number of APIs. It makes it possible to configure an entire BDB
environment by passing a single Erlang term (lists of tagged
tuples) to the port-coordiantor at startup.)
- Calls DB_ENV->open() to open the BDB environment. Typically it
also runs Berkeley DB recovery, to ensure that the database is in
a consistent state after any earlier crash. BDB recovery must be
coordinated (e.g. single-threaded), so the port-coordinator does
this synchronously, before it returns from init().
- If distributed transactions are being used, then it runs
DB_ENV->txn_recover and if necessary contacts a
GlobalTransactionManager to resolve any unresolved txns that it
finds during recovery. [Important: The BDB API makes this a
blocking operation -- the port-coordinator cannot continue unless
the GTM replies. Hence to avoid serious availability issues, the
GTM should be replicated. Work is in progress on that.] The
port-coordinator also does 'incremental recovery' of unresolved
transactions while the application is running (e.g. if a process
calls txn_prepare on it's local part of a distributed transaction,
but then crashes before it commits or aborts the transaction, then
the port-coordinator takes ownership of that unresolve
transaction, and asks the GTM for a decision whether to abort or
commit it). Important: distributed transactions should not be
used without knowing what you are doing (e.g. all such
transactions must use lock or txn timeouts to avoid distributed
deadlock). So caveat emptor.
- If replication has been enabled then it call DB_ENV->repmgr_start,
and repmgr_add_remote_site etc, to join a replication group.
(There is a lot more to replication than this; for instance, the
election of a 'master' site (only the master can do write
operations). Before using replication, read the BDB replication
documentation.
- Allow client processes to 'register' to use the BDB environment
This means that the port_coordinator tracks the life of the client
process, and will cleanup after that process if it crashes --
close any open cursors, abort open transactions, etc.
- Hand out shared DB handles to clients on demand (one handle per
database, shared by all clients). The port_coordinator creates
the database on first use (i.e. specifies DB_CREATE to db_open)
- Receives all BDB event notifications via the mechanism described
below, and publishes them to an arbitrary set of interested
(registered) processes via a gen_event handler. This set of
registered processes can be entirely independent from the set of
registered clients that have asked to use the environment.
- Recieves custom events from the C code in the driver. e.g. If an
attempt is made to automatically 'clean up' a transaction that has
been prepared but not yet explicitly commited or aborted (e.g. one
process participating in the distributed transaction crashed), we
cannot unilaterally abort that txn because other sites might
already have been told to commit. Instead of aborting the txn,
the C code sends an event (and a reference to the txn handle) to
the port_coordinator, which asks the GTM to resolve the txn -- and
the port_coordinator then commits or aborts the local txn.
berkeley_db_helpers.erl:
This contains convenience functions for using BDB -- e.g. a do_txn()
function that takes an entire transaction (packaged as a {Module,
Function, Args} tuple), and executes it against a given BDB
environment (i.e. it takes a port-coordinator Pid or name).
berkeley_db_sequence_server.erl
This is a (useful) example of a BDB application. It is a gen_server
that implements a classic database 'sequence'. That is, a series of
(in this case 64-bit) integers that do not repeat, even if the
server crashes (i.e. consumption of integers is persistent). The
integers are suitable for as message-ids, primary keys, etc. Note
that the sequence is NOT guaranteed to be purely sequential -- there
can be gaps (possibly very large gaps), and higher integers may be
returned before lower integers (although this is rare). The only
guarantee is that the same integer will not be returned twice.
Obtaining an integer is typically very fast as the server holds a
'cached range' of integers, and only needs to do a BDB transaction
when that range is exhausted. The size of the range can be set when
the server is started. Specifying a large range reduces the
frequency of transactions, but results in larger gaps in the
sequence (irrevocably wasted integers) when the server is restarted
(either normally or after a crash), as each restart must begin
consuming from the *next* cached range (as it is not know how many
integers from the last range were actually consumed). However, with
a 64-bit total range there is no lack of integers -- ranges of size
1,000 or more are typically fine.
Note that a recent version of Berkeley DB added an almost identical
feature (see the documentation for DB_SEQUENCE). However, the BDB
driver does not implement those BDB APIs because all calls to BDB
must incur overhead in marshalling/unmarshalling, enqueuing to
threadpools etc, and most calls to sequence_server only increment
and return an integer. i.e. It's faster to keep the cached-range
on the Erlang side of the port boundary, and do explicit transactions
when a new range needs to be started.
In the near future this application will be changed (or a variant
created) that supports replication, so that if a host/disk dies
permanently, the current value of the sequence is not lost.
berkeley_db_global_transaction_server.erl:
This is part of the support for distributed transactions. It
provides (on demand) 128-byte GUID required for distributed
transactions (see the BDB txn_prepare API). It also acts as an
authoratative repository for the state of all distributed
transactions (i.e. whether they have committed or aborted), which is
queried if a partitipant in a distributed transaction fails after
preparing it's local part of the transaction (when the partitipant
recovers it asks the GTM for the state of the transaction, in order
to commit or abort it locally).
In the near future this application will be changed (or a variant
created) that supports replication. This is vital for reliable
use of distributed transactions, as applications can block
permanently if the GTM is not available or loses data.
berkeley_db_msg_queue.erl:
This is a sketch of an example use-case for the BDB driver. It
implements an asynchronous transactional message queue with
guaranteed exactly-once delivery. It is designed to given similar
reliablity guarantees as distributed transctions but avoid the
availability issues of 2-phase-commit. In particular, the
destination may be unreachable but senders can still transactionally
enqueue messages.
The idea is that some 'sender worker' process (bit of business
logic) performs a BDB transaction do some work, and wants to notify
another process (presumably on another node) that the work has been
done (and perhaps pass some arbitrary payload with the message). So
the sender uses functions in this module to enqueue a message in a
Berkeley DB queue database as part of the normal work transaction.
That's all the 'sender' has to do -- i.e. there is no 'send message'
API, just an enqueue API.
A separate 'send pump' process transactionally consumes messages
from the local message queue, sends them to a destination 'receive
pump' process, waits for an ack, and then commits the consume
transaction. (If a consume transaction is aborted the message
automatically reappears in the queue, ready for a retry.)
The 'receive pump' process (on another host/BDB environment,
otherwise there is no point) receives messages, transactional
enqueues and commits them to a local queue database, and then
replies with an ack to the 'send pump'. (Note that if this ack is
lost the 'send pump' will abort it's consume transaction and try to
send the message again, so to guarantee exactly-once delivery all
messages have a guaranteed-unique id (an instance of sequence_server
is used for this) and the receive pump records the ids of all
messages that it has commited to its local queue -- and simply
discards and acks (again) any resends of messages that it has
already accepted.)
Now a 'receiver worker' process (another piece of business logic) on
the same host as the 'receive pump' process can transactional
consume messages from the receive queue, process them, and if the
work transaction succeeds, commit the consume transaction.
One implementation detail: Berkeley DB queue databases have
fixed-size records, but we want to support arbitrary payloads. So
if a message won't fit into the configured maximum queue record
length, part of it is stored in a btree database. The two databases
are manipulated in the same (nested) transactions, so they will
always remain consistent.
This code has not yet been 'productized' or even fully tested. In
particular it needs a better routing layer (currently it uses
'global' process registration of the receive pump process, as an
experiment -- see later for why this is not a good idea) So treat it
as an educational example -- i.e. caveat emptor.
Also, this subsystem is intended to run with replication, so no
messages can ever be lost even if a host or disk fails. But the
implementation does not yet support that.
There is a usage-example/test of this code in berkeley_db_test.erl.
berkeley_db_replication_group.erl:
The replication_group processes are responsible for tracking the
state of all sites in a replication group, including the location of
the master.
A replication_group server does the following (see source file
for more details):
- Spawns and supervises a local port_coordinator and registers as an
subscriber to it's event channel.
- Attempts to find all other sites in the replication group (and
retries if they are down). These other sites might be on other
Erlang VMs (on this host or other hosts).
- Optionally it can create the other sites if they are down (at
startup or if they crash for any reason). So an entire
replication-group across multiple nodes/hosts can be started by a
single API call. Note that the sites will all attempt to 'peer
supervise' each other -- i.e. restart each other if they crash.
This is an experimental feature. There is nothing wrong with
starting each site in a replication_group via a local supervision
tree on each node (infact that is the recommended approach).
- Listens for and tracks status updates from all sites
- Implements a simple state machine that converts sleepycat events
from all sites to absolute states.
- (Important convenience). All sites are willing to accept
'do_txn_on_master' calls, and (if the local site is not the
master) will transparently forwards user transaction functions to
the current master site.
For this API, transactions are 'packaged' as {Module, Function,
Arg} tuples. Note that funs are NOT used due to various issues
with code upgrade (we want to be able to upgrade code across a
replication group on a per-site basis, and funs are bound to the
hash of their module; {Module, Function, Args} tuples do not have
that problem.
- All sites are willing to acdept 'do_txn_on_any_site' calls
(i.e. read-operations). If the local site is 'in-sync' with
the master (or if no master is running) then the local site
will perform the transaction. Otherwise the transaction will
be sent to a site that is in-sync.
- Handles adding/removing sites to the group. (This is not yet
fully implemented -- it requires BDB features which will hopefully
be released in Berkely DB v4.6 in 2007).
- Coordinates clean shutdown of all sites in the group (when the
application shuts down). This involves suspending writes and
waiting for all sites to fall into sync. (This is not yet
implemented -- again in needs features in Berkeley DB v4.6.)
Approach To Distributed System Communication
When configuring a replication_server, each site needs to know about
all of the others in the group. Sites exchange replication data
(handled by BDB internals), and coordination messages and the Erlang
level (including tracking which site is master).
Note that Berkeley DB's replication layer makes its own connections
-- replication traffic does *not* pass through Erlang processes.
The BDB APIs for this require TCP/IP addresses, of the form
{HostName, Port} tuples.
But the Erlang-level coordination (tracking of master, forwarding
'transaction functions' to be executed on the master and then
returning the results) use Distributed Erlang.
However, the code restricts itself to a 'safe subset' of Distributed
Erlang features. In particular, BDB replication is designed to work
correctly in even if the network is partitioned temporarily. Some
Distributed Erlang features do not work/recover well in the presence
of network partitioning.
The code does use the following features of Distributed Erlang:
- the basic connection setup/heartbeat/teardown/retry stuff
(named nodes, epmd, net_kernel etd)
- 'remote pids', including link/monitor of pids on remote nodes
In particular, the code makes frequent use of *local* process
registration and the
{ProcName, NodeName}
form of addressing remote-processes (e.g. to gen_server:call() et al).
We also use remote-pids as 'cached' forms of these addresses.
The code does *not* use the following features of Distributed Erlang
as they are rumoured or known to have problems under network
partitioning.:
- dist_ac
- global
- pg, pg2
- mnesia
Another potential issue with using Distributed Erlang is that it
only uses direct routing, and by default it attempts to maintain a
fully-connected network (N^2 connections). This imposes a ultimate
limit to scalability. net_kernel can be set to do lazy on-demand
connections rather than transitive proactive connections, but unless
application communication patterns are constrained, it will
eventually create a fully-connected network. The largest
Distributed Erlang system I've heard of (from a mailing list post)
is 80 nodes. It seems quite feasible to have low-hundreds, but
possibly not thousands until the epoll patch is made official.
The replication_server code is designed to run with relatively small
groups (3 to maybe 20 or 30 nodes). It has only been tested with up
to 5 nodes.
One option is to use Distributed Erlang within each replication
group/cluster, and some other protocol between clusters.
berkeley_db_replication_group_helpers.erl
The fact that replication groups don't use any kind of global
process/site registration service invites the question, how does
application code find a site in order to run a transaction? This is
a particularly relevant question given that the whole point of
replication groups is to be highly available even if some percentage
of their sites are down or unreachable.
First note that some applications don't have this problem because
they use only a single replication_group, which has a site on
every node in the system. Therefore the application just attempts
to use the site on it's local node, as if the application is
running then it's highly likely that the replication site will
be running (or will be soon, if it is running recovery).
The local site will forward any write transactions to the
master, and will perform any read transactions locally, so
long as the local site is 'in sync' with the master (processing
live updates, not doing bulk-recovery, aka 'bootstrapping').
But if a system does not have a replication site on every node,
(e.g. if the system is a set of replication-clusters), then we
need to be able to find and use 'remote' replication sites.
It is assumed that the entire user application (i.e. every node)
*does* know the names and nodes of all the replication sites (this
information is required to start each replication_group server, so
it must be globally available somehow).
So all we really need to access 'remote' replication sites are some
helper functions:
- to remember the latest 'master location hint' (to optimize
routing of write transactions)
- to load-balance read transactions across all sites that are
believed to be reachable
- to remember which, if any sites, we have found to be unreachable
(so we don't try to send requests to them for a while)
- to periodically try to contact any unreachable sites, to see if
they have recovered yet
As replication groups normally exist in a steady state without
failures (most/all sites running, and a stable master), we simply
need to discover that state, remember what it is (as an
optimization), and re-discover it fairly quickly after a change
(e.g. if the master becomes unreachable we need to stop trying to
connect to that site for a while, and also find the location of the
new master).
The functions in this module accomplish this. They manipulate an
opaque object called a caller_state (record). The definition of
this record is available in a header file incase it is useful, but
most of the time it should be possible to treat it as an opaque
token. The idea is that each node stores the caller_state for each
replication group somewhere -- an ETS table is perfect. Then when a
process wants to do a transaction on a replication group, it simply
looks up the caller_state for that group, and passes it to one of
the helper functions in this module (do_txn_on_master or
do_txn_on_any_site), along with the {Module, Function, Args} tuple
that it wants to execute.
In the candidate 1.5 releases this module had not been fully tested.
It has now been tested, and the API should be stable.
Also, support has been added for distributed transactions against
the masters of separate replication groups (see
do_distributed_txn_unordered).
List of Changes since EDTK v1.1 (might not be totally complete --
there were a lot of changes).
Please see doc/EDTK_BerkeleyDB.ppt for some design rationale and
explanation.
The changes can be categorizes as follows:
- Enhancements to EDTK
- Enhancements to BDB driver
- Bug fixes to EDTK
- Bug fixes to BDB driver
Enhancements to EDTK framework (not driver-specific)
- The previous release of EDTK used the Erlang VM's 'async IO
threadpool' for Berkeley DB operations, but that has several
critical problems.
- Long-running BDB operations would block IO by other important
drivers (like efile_drv which provides access to the file-system).
e.g. Some BDB operations may take a very long time
(e.g. txn_checkpoint, memp_trickle, db_compact), or may block a
thread indefinitely (e.g. db_get with the DB_CONSUME_WAIT option).
- The native Erlang VM's threadpool schedules jobs on a round-robin
bases, or by using a hash of an application-specified key. The
round robin mechanism would violated BDB preconditions on thread
use (if it did not deadlock first), and the hash mechanism would
require a dedicated thread for each *possible* concurent BDB
transaction (so hundreds of threads, most of which would be idle),
and would still not be safe from deadlock (as some BDB operations
use private internal transactions, so the application does not
have a 'key' to hash to a thread id).
- The native Erlang threadpool is sized at runtime (with 'erl +A'),
but we want more flexibility than that.
This is fixed in this release by allowing each port instance to
create multiple private worker thread pools, dedicated to specific classes of
operations. Each threadpool is fed by a producer-consumer queue
(unlike the native Erlang VM threadpool in which each thread has its own
queue, so some threads may be busy and have work stuck in their queues
while other threads sit idle).
The threadpools may be resized at runtime (threads added/removed),
the stack size for threads may be specified (to reduce virtual memory
usage with large numbers of threads -- Berkeley DB only needs about
32KB -- 64KB stack), and the application can place limits on the
lengths of the threadpool queues, to reject new commands under
saturation. (The length limits can be changed at runtime.)
- Added support for shared-access valmaps.
Some libraries may produce resources (valmaps) that are thread-safe,
and can safely be used by multiple Erlang processes at once.
(e.g. Berkeley DB DB_ENV and DB handles can be made thread-safe).
So a valmap type may now be declared as 'shared', and appropriate
reference-counting will be done to manage automatic cleanup. Also,
some operations on shared valmaps may need exlcusive access
(e.g. any 'stop' operation like txn commit/abort, or cursor close).
Such operations do test-for and obtain exclusive-acces. This
feature is an important enabler for the BDB driver, as DB_ENV and DB
handles are documented as being slow to open (and expensive in terms
of memory and other resources). So for large numbers of Erlang
processes to use BDB together, it is essential to share these
handles. Also many BDB features (e.g. replication) are more
difficult to use if more than a single DB_ENV handle is open on an
environment at a time.
- Added support for parent/child relationships between valmaps slots
(*of the same type*). The relationship force correct ordering of
cleanup operations. i.e. If a valmap with any children is 'stopped'
then all of its children (recursively) are automatically 'stopped'
first. This feature was necessary to support BDB nested
transactions correctly.
- Added EDTK support for converting numeric library parameters and
return codes to/from atoms.
Berkeley DB has a _lot_ of public #define constants of the form
DB_xxx which are used as flags, enums, and/or return codes. These
constants can and frequent do change numeric value between releases
of BDB. That's fine for a C application (it just needs to recompile
to pick up the new constants), but a distributed Erlang application
must deal with different sites running different versions of BDB
(during an site-by-site upgrade), and it is critical that the
meaning of the parameter/return code values not be
mis-interpretted. So now the Erlang stubs accept atoms or lists of
atoms for these parameters, and convert them to the correct numeric
constants at the last possible moment before calling the C code.
- Added EDTK support for pattern matching on selected return values
from the C driver, and taking arbitrary actions (e.g. throwing,
exiting, transforming the result before returning it).
- Added EDTK support for throwing all {error, Reason} return values as
Erlang exceptions by default, and pattern-matching selected return
values to be returned as {error, Reason} tuples. The Berkeley DB
API now uses this form of interface, which makes application code
*much* more convenient. The previous EDTK release predates system
probably did not have this feature due to the fact that Ericsson
only recently added greatly improved exception handling to Erlang
(try/catch/after, rather than the old catch construct).
- Added EDTK support for returning large complex structures from the
driver using the 'Erlang external term' format (the 'ei' and
'erl_interface' C libraries provided by Ericsson). This is used in
supporting the BDB 'statistics' APIs.
- Improved logging when debugging EDTK and/or driver code. e.g. Each
port instance may now have a textual 'label' per port-instance, to
distiguish log messages from different instances running on the same
node. (The BDB port-coordinator sets this label to the registered
name of the port-coordiantor process.)
Enhancements to BDB driver
- EDTK v1.1 did not support some essential parts of the Berkeley DB
API, largely due to safety/correctness issues arising from the lack
of flexible threadpools.
Support for the following has been added:
- transactions, including nested transactions
- distributed transactions, and recovery (including a GlobalTransactionManager)
- important 'housekeeping' functions such as txn_checkpoint, log_archive, memp_trickle etc.
- all 'statistics' APIs : these return large structures full of useful internal data
- replication : a large, complex topic on its own
- db_compact : 'defragmentation' of live btree databases
- db_get(DB_CONSUME_WAIT) : blocking consume-from-DB_QUEUE-database
See examples/berkeley_db/berkeley_db_api_support_status.txt for a summary by API.
- One of Erlang's main strengths is that it is reasonable/normal to
have thousands of Erlang processes running on a single node. And of
course we want a lot of Erlang processes to be able to
simultaneously use the C libraries that are wrapped by EDTK.
The previous release of EDTK had an implicit (but totally
reasonable) assumption; that.concurrent access to a C library from
Erlang would be achieved by opening a separate port (a separate
instance of the driver/library) from each Erlang process that wants
to use that library.
It's hard to argue with that model -- its clean and simple, and most
importantly, the Erlang VM guarantees that all ports opened by a
process will be automatically closed if that process exits. So
(with appropriate C code in the driver's stop() function), cleanup
is automatic.
Unfortunately that model is not a good fit for Berkeley DB, for
important practical reasons.
- BDB requires coordinated (single-threaded) startup and shutdown.
Also there are houskeeping jobs like txn_checkpoint, log_archive
etc. that should be run per environment (and must be coordinated).
- Opening BDB environment and database handles is slow. Fortunately
these handles can support shared-access (they are thread-safe),
and the BDB docs strongly advise applications to share them to
achieve good performance. But with transient worker processes
would be opening and closing DbEnv and Db handles per transaction
-- i.e. performance would be dire
- Even if opening a separate DB_ENV handle per port was fast, many
BDB features work best with only a single DB_ENV handle
(e.g. replication).
- The Erlang VM places a hard limit on the total number of ports (of
any kind) that can be opened simultaneously.
- We probably want to limit the number of processes using BDB (to
avoid thrash), but in the one-process-one-port model there is no
way to limit the degree of concurrent access (number of open port
instances) other than running into the VM's hard-limit (which
would starve the system of other kinds of ports).
All of those problems are fixed in this release allowing multiple
Erlang processes to share one instance of a port.
The sharing of a port is accomplished by a few changes
- A new module called 'berkeley_db_port_coordinator.erl' that
coordinates startup, shutdown, sharing of DB handles, incremental
resolution of distributed transactions, and other things.
- A small change to linked-in mode to use driver_caller and
driver_send_term instead of driver_output_term, to allow the reply
to be sent to the process that sent the command, rather than
always to the ports connected process
- All commands to an EDTK-generated driver now carry a unique tag
(currently term_to_binary({Ref,SendPid}) to precisely associate
replies with their commands. In pipe mode (when all replies
arrive at the port's 'connected process' -- i.e. the port
coordinator), we have enough information to forwarded the reply to
the process that sent the command.
The tag mechanism also allows asynchronous (non-blocking) variants
of commands to be generated. i.e. It is possible to generate a
function to send the command (and return the generated tag), and
another function to do a blocking receive for the reply to that
command (the function requires a tag of course). This can be
useful in some circumstances -- e.g. servers that want to use
Berkeley DB but don't want to spawn worker processes for each
operation. The port-coordiantor uses this mechanism for some
internal operations (e.g. during shutdown). Currently only a few
BDB API functions have non-blocking variants -- simply to avoid
code-bloat (most applications won't need them).
- dbenv_close and dbenv_remove now flagged as global-exclusive
operations. These are not allowed if any valmaps are in-use at all.
Bug fixes to EDTK
- EDTK's 'valmap' system for managing handles had a known, dangerous
weakness; Erlang applications may accidentally use expired valmap
entries (i.e. references to transactions that have been
commited/aborted), which will crash the Erlang VM. This has been
fixed with per-slot generation-ids. Each valmap slot has a 32-bit
counter. Eadh 'valmap record' returned to Erlant contains the value
of that counter. Any valmap 'stop' ('free') function increments the
value of the counter. So if valmap record is used after it is
'stopped' (e.g. a db_txn handle is used after it is committed or
aborted), the counter value shows that the valmap record has
expired, and an error is thrown.
- The value returned by the wrapped library was leaked if no valmap
slots are free after the call. The solution was to reserve the slot
when the command is received.
- Driver shutdown didn't acquire a critical mutex, but operations may
still be in-flight (race condition/undefined behavior)
- Driver shutdown didn't wait for in-progress operations involving
first valmap type to finish before it starts closing valmaps of the
second type (and so on)
- Driver shutdown did not call cleanup_.._index() for any valmap
entries that are INUSE, so the code in cleanup_.._index() that sets
DELAYED_CLEANUP is never activated, so some valmaps (that are in-use
when stop is called) are never cleaned up. (DELAYED_CLEANUP would
self-deadlock if it was ever used, as cleanup_index() was called
while desc->mutex was held, and cleanup_index tried to acquire the
(non-recursive) mutex again).
- Driver shutdown could consume infinite stack if it was not safe to
return (valmaps were still in use)
- Cleanup_index held mutex while calling the valmap cleanup function
(e.g. txn_abort, for BDB) - which may take a long time.
- Several bugs in 'pipe mode' prevented Berkeley DB from being shut down cleanly.
- The DELAYED_CLEANUP feture would self-deadlock if used, as it re-acquired a non-recursive mutex.
- If a successful valmap 'start' (allocation) operation is performed
after driver shutdown has begun then the object would be leaked (the
async_free call is currently just sys_free (to free the callstate
object), so nothing currently calls the valmap cleanup_func on the
result of the library call made by invoke -- i.e. an object returned
from the library (bdb) is leaked)
Bug fixes to BDB driver
- The EDTK v1.1 examples/berkeley_db driver installed allocation
functions that called driver_alloc_binary in async workers (from
within BDB code). This resulting in memory corruption and segfaults
under concurrent load.
Amusingly, this practice actually became legal in Erlang R11B
but only when SMP support is enabled. I have a TODO to add this
back into the driver (it is a potentially important optimization,
as it avoids otherwise redundant allocations and copying of data).
See this thread on the Erlang mailing list for related details:
http://www.erlang.org/pipermail/erlang-questions/2006-October/023500.html
- db_rename, db_remove, db_upgrade were not tagged as valmap "stop"
operations, so the Db handle was left in the valmap array, and would
crash during _stop().
- Removed support for env_set_errpfx as it made BDB refer to memory
that had already been released (undefined behaviour)
- Data returned by db_get, c_get etc would be leaked if driver
shutdown was started before they completed.
- To avoid potential deadlocks of the Erlang VM, BDB's auto-deadlock
detection feature is now explicitly enabled during driver shutdown,
as the application may have been calling the explicit deadlock
detector frequently, but it can no longer do so once shutdown has
begun, and operations might still be in progress that could deadlock
Support
I intend to support and enhance this library as time permits.
Please send questions, suggestions, bug reports, and patches
to chris.newcombe@gmail.com
End of file.