5.10. I just used SLONIK MOVE SET to move the
origin to a new node. Unfortunately, some subscribers are still
pointing to the former origin node, so I can't take it out of service
for maintenance without stopping them from getting updates. What do I
do?

5.13. If you have a slonik script
something like this, it will hang on you and never complete, because
you can't have wait for event inside a
try block. A try block is
executed as one transaction, and the event that you are waiting for
can never arrive inside the scope of the transaction.

5.16. One of my nodes fell over (slon / postmaster was
down) and nobody noticed for several days. Now, when the slon for
that node starts up, it runs for about five minutes, then terminates,
with the error message: ERROR: remoteListenThread_%d: timeout
for event selection What's wrong, and what do I do?

6.6. I'm noticing in the logs that a slon is frequently
switching in and out of "polling" mode as it is
frequently reporting "LISTEN - switch from polling mode to use
LISTEN" and "UNLISTEN - switch into polling
mode".

7.1. The slon processes servicing my
subscribers are growing to enormous size, challenging system resources
both in terms of swap space as well as moving towards breaking past
the 2GB maximum process size on my system.

1. Slony-I FAQ: Building and Installing Slony-I

1.1. I am using Frotznik Freenix
4.5, with its FFPM (Frotznik Freenix
Package Manager) package management system. It comes with
FFPM packages for PostgreSQL 7.4.7, which are what
I am using for my databases, but they don't include Slony-I in the
packaging. How do I add Slony-I to this?

Frotznik Freenix is new to
me, so it's a bit dangerous to give really hard-and-fast definitive
answers.

The answers differ somewhat between the various combinations of
PostgreSQL and Slony-I versions; the newer versions generally
somewhat easier to cope with than are the older versions. In general,
you almost certainly need to compile Slony-I from sources; depending
on versioning of both Slony-I and PostgreSQL, you
may need to compile PostgreSQL from scratch.
(Whether you need to use the PostgreSQL compile
is another matter; you probably don't...)

Slony-I version 1.0.5 and earlier require having a
fully configured copy of PostgreSQL sources available when you compile
Slony-I.

Hopefully you can make the configuration
this closely match against the configuration in use by the packaged
version of PostgreSQL by checking the configuration using the command
pg_config --configure.

Slony-I version 1.1 simplifies this considerably;
it does not require the full copy of PostgreSQL sources, but can,
instead, refer to the various locations where PostgreSQL libraries,
binaries, configuration, and #include files are
located.

PostgreSQL 8.0 and higher is generally easier to deal
with in that a "default" installation includes all of the
#include files.

If you are using an earlier version of PostgreSQL, you may find
it necessary to resort to a source installation if the packaged
version did not install the "server
#include" files, which are installed by the
command make install-all-headers .

In effect, the "worst case" scenario takes place
if you are using a version of Slony-I earlier than 1.1 with an
"elderly" version of PostgreSQL, in which case you can
expect to need to compile PostgreSQL from scratch in order to have
everything that the Slony-I compile needs even though you are using a
"packaged" version of PostgreSQL.

If you are running a recent PostgreSQL and a recent Slony-I,
then the codependencies can be fairly small, and you may not need
extra PostgreSQL sources. These improvements should ease the
production of Slony-I packages so that you might soon even be able to
hope to avoid compiling Slony-I.

1.2. I tried building Slony-I 1.1 and got the following
error message:

configure: error: Headers for libpqserver are not found in the includeserverdir.
This is the path to postgres.h. Please specify the includeserverdir with
--with-pgincludeserverdir=<dir>

You are almost certainly running version PostgreSQL 7.4
or earlier, where server headers are not installed by default if you
just do a make install of PostgreSQL.

You need to install server headers when you install PostgreSQL
via the command make install-all-headers.

1.3. Slony-I seemed to compile fine; now, when I run a
slon, some events are moving around, but no
replication is taking place.

On AIX and Solaris (and possibly elsewhere), both
Slony-Iand PostgreSQL must be compiled with the
--enable-thread-safety option. The above results
when PostgreSQL isn't so compiled.

What breaks here is that the libc (threadsafe) and libpq
(non-threadsafe) use different memory locations for errno, thereby
leading to the request failing.

Problems like this crop up with disadmirable regularity on AIX
and Solaris; it may take something of an "object code audit" to
make sure that ALL of the necessary components have been
compiled and linked with --enable-thread-safety.

For instance, I ran into the problem one that
LD_LIBRARY_PATH had been set, on Solaris, to point to
libraries from an old PostgreSQL compile. That meant that even though
the database had been compiled with
--enable-thread-safety, and
slon had been compiled against that,
slon was being dynamically linked to the
"bad old thread-unsafe version," so slon didn't work. It
wasn't clear that this was the case until I ran ldd
against slon.

Note that with libpq version 7.4.2, on Solaris, a
further thread patch was
required; similar is also required for PostgreSQL version 8.0.

1.4. I'm trying to upgrade to a newer version of Slony-I
and am running into a problem with SLONIK UPDATE FUNCTIONS. When I run SLONIK UPDATE FUNCTIONS, my
postmaster falls over with a Signal 11.
There aren't any seeming errors in the log files, aside from the
PostgreSQL logs indicating that, yes indeed, the postmaster fell
over.

I connected a debugger to the core file, and it indicates that
it was trying to commit a transaction at the time of the
failure.

By the way I'm on PostgreSQL 8.1.[0-3].

Unfortunately, early releases of PostgreSQL 8.1 had a
problem where if you redefined a function (such as, say,
upgradeSchema(text)), and then, in the same
transaction, ran that function, the
postmaster would fall over, and the
transaction would fail to commit.

Unfortunately, on PostgreSQL 8.1.0, 8.1.1, 8.1.2, and 8.1.3,
this conflicts with a bug where using and modifying a plpgsql function
in the same transaction leads to a crash.

Several workarounds are available.

The preferred answer would be to upgrade PostgreSQL to
8.1.4 or some later version. Changes between minor versions do not
require rebuilding databases; it should merely require copying a
suitable 8.1.x build into place, and restarting the
postmaster with the new version.

If that is unsuitable, it would be possible to perform
the upgrade via a series of transactions, performing the equivalent of
what slonik does "by hand":

Take slony1_funcs.sql and do three replacements within it:

Replace "@CLUSTERNAME@" with the name of the cluster

Replace "@MODULEVERSION@" with the Slony-I version string, such as "1.2.10"

Replace "@NAMESPACE@" with the "double-quoted" name of the cluster namespace, such as "_MyCluster"

Load that "remapped" set of functions into the database.

Run the stored function via select upgradeSchema('1.2.7'); , assuming that the previous version of Slony-I in use was version 1.2.7.

Restarting all slon processes would probably be a wise move with this sort of "surgery."

1.5. Problem building on Fedora/x86-64

When trying to configure Slony-I on a Fedora x86-64 system,
where yum was used to install the package
postgresql-libs.x86_64, the following complaint
comes up:

configure: error: Your version of libpq doesn't have PQunescapeBytea
this means that your version of PostgreSQL is lower than 7.3
and thus not supported by Slony-I.

This happened with PostgreSQL 8.2.5, which is certainly rather
newer than 7.3.

configure is looking for
that symbol by compiling a little program that calls for it, and
checking if the compile succeeds. On the gcc
command line it uses -lpq to search for the
library.

Unfortunately, that package is missing a symlink, from
/usr/lib64/libpq.so to
libpq.so.5.0; that is why it fails to link to
libpq. The true problem is that the compiler failed to
find a library to link to, not that libpq lacked the function call.

Eventually, this should be addressed by those that manage the
postgresql-libs.x86_64 package.

Note that this same symptom can be the indication of
similar classes of system configuration problems. Bad symlinks, bad
permissions, bad behaviour on the part of your C compiler, all may
potentially lead to this same error message.

Thus, if you see this error, you need to look in the log file
that is generated, config.log. Search down to
near the end, and see what the actual complaint
was. That will be helpful in tracking down the true root cause of the
problem.

2. Slony-I FAQ: How Do I?

Up to version 1.2, this is fairly nontrivial,
requiring careful choice of nodes, and some moderately heavy
"procedure". One methodology is as follows:

First, dump the schema from the node that has the
"master" role. That is the only place, pre-2.0, where
you can readily dump the schema using
pg_dump and have a consistent schema. You
may use the Slony-I tool Section 21.5 to do
this.

Take the resulting schema, which will not
include the Slony-I-specific bits, and split it into two pieces:

Firstly, the portion comprising all of the creations
of tables in the schema.

Secondly, the portion consisting of creations of indices, constraints, and triggers.

Pull a data dump, using pg_dump --data-only, of some node of your choice. It doesn't need to be for the "master" node. This dump will include the contents of the Slony-I-specific tables; you can discard that, or ignore it. Since the schema dump didn't contain table definitions for the Slony-I tables, they won't be loaded.

Finally, load the three components in proper order:

Schema (tables)

Data dump

Remainder of the schema

In Slony-I 2.0, the answer becomes simpler: Just take
a pg_dump --exclude-schema=_Cluster against
any node. In 2.0, the schemas are no longer
"clobbered" on subscribers, so a straight
pg_dump will do what you want.

2.2. I'd like to renumber the node numbers in my cluster.
How can I renumber nodes?

The first answer is "you can't do that" -
Slony-I node numbers are quite "immutable." Node numbers
are deeply woven into the fibres of the schema, by virtue of being
written into virtually every table in the system, but much more
importantly by virtue of being used as the basis for event
propagation. The only time that it might be "OK" to
modify a node number is at some time where we know that it is not in
use, and we would need to do updates against each node in the cluster
in an organized fashion.

To do this in an automated fashion seems like a
huge challenge, as it changes the structure of
the very event propagation system that already needs to be working in
order for such a change to propagate.

If it is enormously necessary to
renumber nodes, this might be accomplished by dropping and re-adding
nodes to get rid of the node formerly using the node ID that needs to
be held by another node.

3. Slony-I FAQ: Impossible Things People Try

3.1. Can I use Slony-I to replicate changes back and forth on my database between my two offices?

At one level, it is theoretically
possible to do something like that, if you design your
application so that each office has its own distinct set of tables,
and you then have some system for consolidating the data to give them
some common view. However, this requires a great deal of design work
to create an application that performs this consolidation.

In practice, the term for that is "multimaster
replication," and Slony-I does not support "multimaster
replication."

3.2. I want to replicate all of the databases for a shared-database system I am managing. There are multiple databases, being used by my customers.

For this purpose, something like PostgreSQL PITR (Point
In Time Recovery) is likely to be much more suitable. Slony-I
requires a slon process (and multiple connections) for each
identifiable database, and if you have a PostgreSQL cluster hosting 50
or 100 databases, this will require hundreds of database connections.
Typically, in "shared hosting" situations, DML is being
managed by customers, who can change anything they like whenever
they want. Slony-I does not work out well when
not used in a disciplined manner.

3.3. I want to be able to make DDL changes, and have them replicated automatically.

Slony-I requires that Section 17 be planned for explicitly and carefully. Slony-I captures changes using triggers, and PostgreSQL does not provide a way to use triggers to capture DDL changes.

Note: There has been quite a bit of discussion, off and on, about how
PostgreSQL might capture DDL changes in a way that would make triggers
useful; nothing concrete has emerged after several years of
discussion.

3.4. I want to split my cluster into disjoint partitions that are not aware of one another. Slony-I keeps generating Section 9 that link those partitions together.

The notion that all nodes are aware of one another is
deeply imbedded in the design of Slony-I. For instance, its handling
of cleanup of obsolete data depends on being aware of whether any of
the nodes are behind, and thus might still depend on older data.

3.5. I want to change some of my node numbers. How do I "rename" a node to have a different node number?

You don't. The node number is used to coordinate inter-node communications, and changing the node ID number "on the fly" would make it essentially impossible to keep node configuration coordinated.

It is worth noting that oids, as a regular table
attribute, have been deprecated since PostgreSQL version 8.1, back in
2005. Slony-I has never collected oids to
replicate them, and, with that functionality being deprecated, the
developers do not intend to add this functionality.

PostgreSQL implemented oids as a way to link its internal
system tables together; to use them with application tables is
considered poor practice, and it is recommended
that you use sequences to populate your own ID column on application
tables.

Of course, nothing prevents you from creating a table
without oids, and then add in your own
application column called oid, preferably with type
information SERIAL NOT NULL UNIQUE, which
can be replicated, and which is likely to be
suitable as a candidate primary key for the table.

4. Slony-I FAQ: Connection Issues

4.1. I looked for the _clustername namespace, and
it wasn't there.

If the DSNs are wrong, then slon
instances can't connect to the nodes.

This will generally lead to nodes remaining entirely untouched.

Recheck the connection configuration. By the way, since slon links to libpq, you could have password information
stored in $HOME/.pgpass, partially filling in
right/wrong authentication information there.

4.2. I created a "superuser" account,
slony, to run replication activities. As
suggested, I set it up as a superuser, via the following query:
update pg_shadow set usesuper = 't' where usename in ('slony',
'molly', 'dumpy');
(that command also deals with other users I set up to run vacuums and
backups).

Unfortunately, I ran into a problem the next time I subscribed
to a new set.

See? 127314921 is indeed older than 127314958, and it's still
running.

A long running G/L report, a runaway
RT3 query, a
pg_dump, all will open up transactions that
may run for substantial periods of time. Until they complete, or are
interrupted, you will continue to see the message " data copy
for set 1 failed - sleep 60 seconds ".

By the way, if there is more than one database on the PostgreSQL
cluster, and activity is taking place on the OTHER database, that will
lead to there being "transactions earlier than XID
whatever" being found to be still in progress. The fact that
it's a separate database on the cluster is irrelevant; Slony-I will
wait until those old transactions terminate.

4.4. Same as the above. What I forgot to mention, as well,
was that I was trying to add TWO subscribers,
concurrently.

That doesn't work out: Slony-I can't work on the
COPY commands concurrently. See
src/slon/remote_worker.c, function
copy_set()

This happens to be a COPY transaction
involved in setting up the subscription for one of the nodes. All is
well; the system is busy setting up the first subscriber; it won't
start on the second one until the first one has completed subscribing.
That represents one possible cause.

This has the (perhaps unfortunate) implication that you cannot
populate two slaves concurrently from a single provider. You have to
subscribe one to the set, and only once it has completed setting up
the subscription (copying table contents and such) can the second
subscriber start setting up the subscription.

4.5. We got bitten by
something we didn't foresee when completely uninstalling a slony
replication cluster from the master and slave...

Warning

MAKE SURE YOU STOP YOUR APPLICATION RUNNING
AGAINST YOUR MASTER DATABASE WHEN REMOVING THE WHOLE SLONY
CLUSTER, or at least re-cycle all your open connections
after the event!

The connections "remember" or refer to OIDs which
are removed by the uninstall node script. And you will get lots of
errors as a result...

There are two notable areas of
PostgreSQL that cache query plans and OIDs:

Prepared statements

pl/pgSQL functions

The problem isn't particularly a Slony-I one; it would occur
any time such significant changes are made to the database schema. It
shouldn't be expected to lead to data loss, but you'll see a wide
range of OID-related errors.

The problem occurs when you are using some sort of
"connection pool" that keeps recycling old connections.
If you restart the application after this, the new connections will
create new query plans, and the errors will go
away. If your connection pool drops the connections, and creates new
ones, the new ones will have new query plans, and
the errors will go away.

In our code we drop the connection on any error we
cannot map to an expected condition. This would eventually recycle all
connections on such unexpected problems after just one error per
connection. Of course if the error surfaces as a constraint violation
which is a recognized condition, this won't help either, and if the
problem is persistent, the connections will keep recycling which will
drop the effect of the pooling, in the latter case the pooling code
could also announce an admin to take a look...

4.6. I upgraded my cluster to Slony-I version
1.2. I'm now getting the following notice in the logs:

NOTICE: Slony-I: log switch to sl_log_2 still in progress - sl_log_1 not truncated

Both sl_log_1 and sl_log_2 are
continuing to grow, and sl_log_1 is never getting
truncated. What's wrong?

This is symptomatic of the same issue as above with
dropping replication: if there are still old connections lingering
that are using old query plans that reference the old stored
functions, resulting in the inserts to sl_log_1

Closing those connections and opening new ones will resolve the
issue.

In the longer term, there is an item on the PostgreSQL
TODO list to implement dependancy checking that would flush cached
query plans when dependent objects change.

4.7. I pointed a subscribing node to a different provider
and it stopped replicating

We noticed this happening when we wanted to re-initialize a node,
where we had configuration thus:

The problem was that there was not a suitable set of
"listener paths" in sl_listen to allow the events from
node 1 to propagate to node 3. The events were going through node 2,
and blocking behind the SLONIK SUBSCRIBE SET event that
node 2 was working on.

The following slonik script dropped out the listen paths where
node 3 had to go through node 2, and added in direct listens between
nodes 1 and 3.

Immediately after this script was run, SYNC
events started propagating again to node 3.
This points out two principles:

If you have multiple nodes, and cascaded subscribers,
you need to be quite careful in populating the SLONIK STORE LISTEN entries, and in modifying them if the
structure of the replication "tree"
changes.

The table sl_nodelock is used as an
"interlock" to prevent two slon processes from trying
to manage the same node at the same time. The slon tries inserting
a record into the table; it can only succeed if it is the only node
manager.

This error message is typically a sign that you have
started up a second slon process for a given node. The slon asks
the obvious question: "Do you already have a slon running
against this node?"

Supposing you experience some sort of network outage,
the connection between slon and database may fail, and the slon
may figure this out long before the PostgreSQL instance it was
connected to does. The result is that there will be some number of
idle connections left on the database server, which won't be closed
out until TCP/IP timeouts complete, which seems to normally take about
two hours. For that two hour period, the slon will try to connect,
over and over, and will get the above fatal message, over and
over.

An administrator may clean this out by logging onto the server
and issuing kill -2 to any of the offending
connections. Unfortunately, since the problem took place within the
networking layer, neither PostgreSQL nor Slony-I have a direct way of
detecting this.

You can mostly avoid this by making sure
that slon processes always run somewhere nearby the server that
each one manages. If the slon runs on the same server as the
database it manages, any "networking failure" that could
interrupt local connections would be likely to be serious enough to
threaten the entire server.

Generally, it's no big deal to shut down a slon
process. Each one is "merely" a PostgreSQL client,
managing one node, which spawns threads to manage receiving events
from other nodes.

The "event listening" threads are no big deal; they
are doing nothing fancier than periodically checking remote nodes to
see if they have work to be done on this node. If you kill off the
slon these threads will be closed, which should have little or no
impact on much of anything. Events generated while the slon is
down will be picked up when it is restarted.

The "node managing" thread is a bit more
interesting; most of the time, you can expect, on a subscriber, for
this thread to be processing SYNC events. If you
shut off the slon during an event, the transaction
will fail, and be rolled back, so that when the slon restarts, it
will have to go back and reprocess the event.

The only situation where this will
cause particular"heartburn" is if
the event being processed was one which takes a long time to process,
such as COPY_SET for a large replication
set.

The other thing that might cause trouble
is if the slon runs fairly distant from nodes that it connects to;
you could discover that database connections are left idle in
transaction. This would normally only occur if the network
connection is destroyed without either slon or database being made
aware of it. In that case, you may discover
that "zombied" connections are left around for as long as
two hours if you don't go in by hand and kill off the PostgreSQL
backends.

There is one other case that could cause trouble; when the
slon managing the origin node is not running,
no SYNC events run against that node. If the
slon stays down for an extended period of time, and something
like Section 6.3 isn't running, you could be left
with one big SYNC to process
when it comes back up. But that is only a concern if that slon is
down for an extended period of time; shutting it down for a few
seconds shouldn't cause any great problem.

4.10. Are there risks to doing so? How about
benefits?

In short, if you don't have something like an 18
hour COPY_SET under way, it's normally not at all a
big deal to take a slon down for a little while, or perhaps even
cycle all the slons.

5. Slony-I FAQ: Configuration Issues

When I run the sample setup script I get an error message similar
to:
stdin:64: PGRES_FATAL_ERROR load '$libdir/xxid'; - ERROR: LOAD:
could not open file '$libdir/xxid': No such file or directory

Evidently, you haven't got the
xxid.so library in the $libdir
directory that the PostgreSQL instance is
using. Note that the Slony-I components
need to be installed in the PostgreSQL
software installation for each and every one of
the nodes, not just on the origin node.

This may also point to there being some other mismatch between
the PostgreSQL binary instance and the Slony-I instance. If you
compiled Slony-I yourself, on a machine that may have multiple
PostgreSQL builds "lying around," it's possible that the
slon or slonik binaries are asking to load something that isn't
actually in the library directory for the PostgreSQL database cluster
that it's hitting.

Long and short: This points to a need to "audit"
what installations of PostgreSQL and Slony-I you have in place on the
machine(s). Unfortunately, just about any mismatch will cause things
not to link up quite right. See also thread safety concerning threading issues on Solaris
...

Life is simplest if you only have one set of PostgreSQL
binaries on a given server; in that case, there isn't a "wrong
place" in which Slony-I components might get installed. If
you have several software installs, you'll have to verify that the
right versions of Slony-I components are associated with the right
PostgreSQL binaries.

5.2. I tried creating a CLUSTER NAME with a "-" in it.
That didn't work.

Slony-I uses the same rules for unquoted identifiers
as the PostgreSQL main parser, so no, you probably shouldn't put a "-"
in your identifier name.

You may be able to defeat this by putting "quotes" around
identifier names, but it's still liable to bite you some, so this is
something that is probably not worth working around.

5.3. ps finds passwords on command line

If I run a ps command, I, and everyone else,
can see passwords on the command line.

Take the passwords out of the Slony configuration, and
put them into $(HOME)/.pgpass.

5.5. Replication has fallen behind, and it appears that
the queries to draw data from sl_log_1/sl_log_2 are taking a long time
to pull just a few
SYNCs.

Until version 1.1.1, there was only one index on
sl_log_1/sl_log_2, and if there were multiple replication sets, some
of the columns on the index would not provide meaningful selectivity.
If there is no index on column log_xid, consider
adding it. See slony1_base.sql for an example of
how to create the index.

5.6. I need to rename a column that is in the
primary key for one of my replicated tables. That seems pretty
dangerous, doesn't it? I have to drop the table out of replication
and recreate it, right?

Actually, this is a scenario which works out remarkably
cleanly. Slony-I does indeed make intense use of the primary key
columns, but actually does so in a manner that allows this sort of
change to be made very nearly transparently.

Suppose you revise a column name, as with the SQL DDL alter table accounts alter column aid rename to cid; This
revises the names of the columns in the table; it
simultaneously renames the names of the columns
in the primary key index. The result is that the normal course of
things is that altering a column name affects both aspects
simultaneously on a given node.

The ideal and proper handling of this
change would involve using SLONIK EXECUTE SCRIPT to deploy
the alteration, which ensures it is applied at exactly the right point
in the transaction stream on each node.

Interestingly, that isn't forcibly necessary. As long as the
alteration is applied on the replication set's origin before
application on subscribers, things won't break irrepairably. Some
SYNC events that do not include changes to the
altered table can make it through without any difficulty... At the
point that the first update to the table is drawn in by a subscriber,
that is the point at which
SYNC events will start to fail, as the provider
will indicate the "new" set of columns whilst the
subscriber still has the "old" ones. If you then apply
the alteration to the subscriber, it can retry the
SYNC, at which point it will, finding the
"new" column names, work just fine.

5.7. I have a PostgreSQL 7.2-based system that I
really, really want to use Slony-I to help me
upgrade it to 8.0. What is involved in getting Slony-I to work for
that?

Rod Taylor has reported the following...

This is approximately what you need to do:

Take the 7.3 templates and copy them to 7.2 -- or otherwise
hardcode the version your using to pick up the 7.3 templates

Remove all traces of schemas from the code and sql templates. I
basically changed the "." to an "_".

Bunch of work related to the XID datatype and functions. For
example, Slony creates CASTs for the xid to xxid and back -- but
7.2 cannot create new casts that way so you need to edit system
tables by hand. I recall creating an Operator Class and editing
several functions as well.

sl_log_1 will have severe performance problems with any kind of
data volume. This required a number of index and query changes
to optimize for 7.2. 7.3 and above are quite a bit smarter in
terms of optimizations they can apply.

Don't bother trying to make sequences work. Do them by hand
after the upgrade using pg_dump and grep.

Of course, now that you have done all of the above, it's not compatible
with standard Slony now. So you either need to implement 7.2 in a less
hackish way, or you can also hack up slony to work without schemas on
newer versions of PostgreSQL so they can talk to each other.

Almost immediately after getting the DB upgraded from 7.2 to 7.4, we
deinstalled the hacked up Slony (by hand for the most part), and started
a migration from 7.4 to 7.4 on a different machine using the regular
Slony. This was primarily to ensure we didn't keep our system catalogues
which had been manually fiddled with.

All that said, we upgraded a few hundred GB from 7.2 to 7.4
with about 30 minutes actual downtime (versus 48 hours for a dump /
restore cycle) and no data loss.

That represents a sufficiently ugly set of
"hackery" that the developers are exceedingly reluctant
to let it anywhere near to the production code. If someone were
interested in "productionizing" this, it would probably
make sense to do so based on the Slony-I 1.0 branch, with the express
plan of not trying to keep much in the way of
forwards compatibility or long term maintainability of replicas.

You should only head down this road if you are sufficiently
comfortable with PostgreSQL and Slony-I that you are prepared to hack
pretty heavily with the code.

5.8. I had a network "glitch" that led to my
using SLONIK FAILOVER to fail over to an alternate node.
The failure wasn't a disk problem that would corrupt databases; why do
I need to rebuild the failed node from scratch?

The action of SLONIK FAILOVER is to
abandon the failed node so that no more Slony-I
activity goes to or from that node. As soon as that takes place, the
failed node will progressively fall further and further out of sync.

The big problem with trying to
recover the failed node is that it may contain updates that never made
it out of the origin. If they get retried, on the new origin, you may
find that you have conflicting updates. In any case, you do have a
sort of "logical" corruption of the data even if there
never was a disk failure making it "physical."

As discusssed in Section 8, using SLONIK FAILOVER should be considered a last
resort as it implies that you are abandoning the origin
node as being corrupted.

5.9. After notification of a subscription on
another node, replication falls over on one of
the subscribers, with the following error message:

If you see a slon shutting down with
ignore new events due to shutdown log entries,
you typically need to step back in the log to
before they started failing to see indication of
the root cause of the problem.

This demonstrates yet another example of the need to not do
things in a rush; you need to be sure things are working right
before making further configuration changes.

5.10. I just used SLONIK MOVE SET to move the
origin to a new node. Unfortunately, some subscribers are still
pointing to the former origin node, so I can't take it out of service
for maintenance without stopping them from getting updates. What do I
do?

You need to use SLONIK SUBSCRIBE SET to
alter the subscriptions for those nodes to have them subscribe to a
provider that will be sticking around during the
maintenance.

Warning

What you don't do is to SLONIK UNSUBSCRIBE SET; that would require reloading all data
for the nodes from scratch later.

5.11. After notification of a subscription on
another node, replication falls over, starting
with the following error message:

If you see a slon shutting down with
ignore new events due to shutdown log entries,
you'll typically have to step back to before they
started failing to see indication of the root cause of the problem.

This is yet another example of the need to not do things too
terribly quickly; you need to be sure things are working right
before making further configuration changes.

5.12. Is the ordering of tables in a set significant?

Most of the time, it isn't. You might imagine it of
some value to order the tables in some particular way in order that
"parent" entries would make it in before their "children"
in some foreign key relationship; that isn't the case since
foreign key constraint triggers are turned off on subscriber nodes.

(Jan Wieck comments:) The order of table ID's is only
significant during a SLONIK LOCK SET in preparation of
switchover. If that order is different from the order in which an
application is acquiring its locks, it can lead to deadlocks that
abort either the application or slon.

(David Parker) I ran into one other case where the
ordering of tables in the set was significant: in the presence of
inherited tables. If a child table appears before its parent in a set,
then the initial subscription will end up deleting that child table
after it has possibly already received data, because the
copy_set logic does a delete,
not a delete only, so the delete of the parent will
delete the new rows in the child as well.

5.13. If you have a slonik script
something like this, it will hang on you and never complete, because
you can't have wait for event inside a
try block. A try block is
executed as one transaction, and the event that you are waiting for
can never arrive inside the scope of the transaction.

The table IDs used in SLONIK SET ADD TABLE
are required to be unique ACROSS ALL SETS. Thus,
you can't restart numbering at 1 for a second set; if you are
numbering them consecutively, a subsequent set has to start with IDs
after where the previous set(s) left off.

5.16. One of my nodes fell over (slon / postmaster was
down) and nobody noticed for several days. Now, when the slon for
that node starts up, it runs for about five minutes, then terminates,
with the error message: ERROR: remoteListenThread_%d: timeout
for event selection What's wrong, and what do I do?

The problem is that the listener thread (in
src/slon/remote_listener.c) timed out when trying
to determine what events were outstanding for that node. By default,
the query will run for five minutes; if there were many days worth of
outstanding events, this might take too long.

On versions of Slony-I before 1.1.7, 1.2.7, and 1.3, one answer would be to increase the timeout in
src/slon/remote_listener.c, recompile slon, and retry.

Another would be to treat the node as having failed,
and use the slonik command SLONIK DROP NODE to drop the
node, and recreate it. If the database is heavily updated, it may
well be cheaper to do this than it is to find a way to let it catch
up.

In newer versions of Slony-I, there is a new
configuration parameter called slon_conf_remote_listen_timeout; you'd alter the config
file to increase the timeout, and try again. Of course, as mentioned
above, it could be faster to drop the node and recreate it than to let
it catch up across a week's worth of updates...

6. Slony-I FAQ: Performance Issues

6.1. Replication has been slowing down, I'm seeing
FETCH 100 FROM LOG queries running for a long
time, sl_log_1/sl_log_2 is growing, and performance is, well,
generally getting steadily worse.

Another " proximate cause " for this growth is for
there to be a connection connected to the node that sits IDLE IN TRANSACTION for a very long time.

That open transaction will have multiple negative effects, all
of which will adversely affect performance:

Vacuums on all tables, including pg_listener, will
not clear out dead tuples from before the start of the idle
transaction.

The cleanup thread will be unable to clean out
entries in sl_log_1, sl_log_2, and sl_seqlog, with the result that
these tables will grow, ceaselessly, until the transaction is
closed.

You can monitor for this condition inside the database
only if the PostgreSQL postgresql.conf
parameter stats_command_string is set to true. If that
is set, then you may submit the query select * from
pg_stat_activity where current_query like '%IDLE% in transaction'; which will find relevant activity.

You should also be able to search for " idle in
transaction " in the process table to find processes that are
thus holding on to an ancient transaction.

It is also possible (though rarer) for the problem to
be a transaction that is, for some other reason, being held open for a
very long time. The query_start time in pg_stat_activity may show you some query that has been
running way too long.

There are plans for PostgreSQL to have a timeout
parameter, open_idle_transaction_timeout , which would
cause old transactions to time out after some period of disuse. Buggy
connection pool logic is a common culprit for this sort of thing.
There are plans for pgpool to provide a better alternative, eventually,
where connections would be shared inside a connection pool implemented
in C. You may have some more or less buggy connection pool in your
Java or PHP application; if a small set of real
connections are held in pgpool, that will
hide from the database the fact that the application imagines that
numerous of them are left idle in transaction for hours at a time.

This is a common scenario in versions before 1.0.5, as
the "clean up" that takes place when purging the node
does not include purging out old entries from the Slony-I table,
sl_confirm, for the recently departed
node.

The node is no longer around to update confirmations of what
syncs have been applied on it, and therefore the cleanup thread that
purges log entries thinks that it can't safely delete entries newer
than the final sl_confirm entry, which rather
curtails the ability to purge out old logs.

Diagnosis: Run the following query to see if there are any
"phantom/obsolete/blocking"sl_confirm entries:

In version 1.0.5, the SLONIK DROP NODE function
purges out entries in sl_confirm for the
departing node. In earlier versions, this needs to be done manually.
Supposing the node number is 3, then the query would be:

oxrsbar=# delete from _oxrsbar.sl_confirm where con_origin not in (select no_id from _oxrsbar.sl_node) or con_received not in (select no_id from _oxrsbar.sl_node);
DELETE 6

General "due diligence" dictates starting with a
BEGIN, looking at the contents of
sl_confirm before, ensuring that only the expected
records are purged, and then, only after that, confirming the change
with a COMMIT. If you delete confirm entries for
the wrong node, that could ruin your whole day.

You'll need to run this on each node that remains...

Note that as of 1.0.5, this is no longer an issue at all, as it
purges unneeded entries from sl_confirm in two
places:

6.3. The slon spent the weekend out of
commission [for some reason], and it's taking a long time to get a
sync through.

You might want to take a look at the tables sl_log_1
and sl_log_2 and do a summary to see if there are any really enormous
Slony-I transactions in there. Up until at least 1.0.2, there needs
to be a slon connected to the origin in order for
SYNC events to be generated.

Note: As of 1.0.2,
function generate_sync_event() provides an
alternative as backup...

If none are being generated, then all of the updates until the
next one is generated will collect into one rather enormous Slony-I
transaction.

Conclusion: Even if there is not going to be a subscriber
around, you really want to have a
slon running to service the origin
node.

Slony-I 1.1 provides a stored procedure that allows
SYNC counts to be updated on the origin based on a
cron job even if there is no slon daemon running.

6.4. Some nodes start consistently falling behind

I have been running Slony-I on a node for a while, and am
seeing system performance suffering.

I'm seeing long running queries of the form:

fetch 100 from LOG;

This can be characteristic of pg_listener (which is
the table containing NOTIFY data) having plenty of
dead tuples in it. That makes NOTIFY events take a
long time, and causes the affected node to gradually fall further and
further behind.

You quite likely need to do a VACUUM FULL on
pg_listener, to vigorously clean it out, and need to vacuum
pg_listener really frequently. Once every five minutes would likely
be AOK.

Slon daemons already vacuum a bunch of tables, and
cleanup_thread.c contains a list of tables that
are frequently vacuumed automatically. In Slony-I 1.0.2,
pg_listener is not included. In 1.0.5 and later, it is regularly
vacuumed, so this should cease to be a direct issue. In version 1.2,
pg_listener will only be used when a node is only receiving events
periodically, which means that the issue should mostly go away even in
the presence of evil long running transactions...

There is, however, still a scenario where this will still
"bite." Under MVCC, vacuums cannot delete tuples that
were made "obsolete" at any time after the start time of
the eldest transaction that is still open. Long running transactions
will cause trouble, and should be avoided, even on subscriber
nodes.

Is it possible that you are running
pg_autovacuum, and it has taken out locks
on some tables in the replication set? That would somewhat-invisibly
block Slony-I from performing operations that require acquisition of exclusive locks.

You might check for these sorts of locks using the following
query: select l.*, c.relname from pg_locks l, pg_class c
where c.oid = l.relation ; A
ShareUpdateExclusiveLock lock will block the Slony-I
operations that need their own exclusive locks, which are likely
queued up, marked as not being granted.

6.6. I'm noticing in the logs that a slon is frequently
switching in and out of "polling" mode as it is
frequently reporting "LISTEN - switch from polling mode to use
LISTEN" and "UNLISTEN - switch into polling
mode".

The thresholds for switching between these modes are
controlled by the configuration parameters slon_conf_sync_interval and slon_conf_sync_interval_timeout; if the timeout value
(which defaults to 10000, implying 10s) is kept low, that makes it
easy for the slon to decide to return to "listening"
mode. You may want to increase the value of the timeout
parameter.

7. Slony-I FAQ: Slony-I Bugs in Elder Versions

7.1. The slon processes servicing my
subscribers are growing to enormous size, challenging system resources
both in terms of swap space as well as moving towards breaking past
the 2GB maximum process size on my system.

By the way, the data that I am replicating includes some rather
large records. We have records that are tens of megabytes in size.
Perhaps that is somehow relevant?

Yes, those very large records are at the root of the
problem. The problem is that slon normally draws in
about 100 records at a time when a subscriber is processing the query
which loads data from the provider. Thus, if the average record size
is 10MB, this will draw in 1000MB of data which is then transformed
into INSERT or UPDATE
statements, in the slon process' memory.

If you are experiencing this problem, you might modify the
definition of SLON_DATA_FETCH_SIZE , perhaps reducing
by a factor of 10, and recompile slon. There are two
definitions as SLON_CHECK_CMDTUPLES allows doing some
extra monitoring to ensure that subscribers have not fallen out of
SYNC with the provider. By default, this option is turned off, so the
default modification to make is to change the second definition of
SLON_DATA_FETCH_SIZE from 10 to 1.

In version 1.2, configuration values sync_max_rowsize and sync_max_largemem are associated with a new
algorithm that changes the logic as follows. Rather than fetching 100
rows worth of data at a time:

The fetch from LOG query will draw
in 500 rows at a time where the size of the attributes does not exceed
sync_max_rowsize. With default values, this
restricts this aspect of memory consumption to about 8MB.

Tuples with larger attributes are loaded until
aggregate size exceeds the parameter sync_max_largemem. By default, this restricts
consumption of this sort to about 5MB. This value is not a strict
upper bound; if you have a tuple with attributes 50MB in size, it
forcibly must be loaded into memory. There is no
way around that. But slon at least won't be trying
to load in 100 such records at a time, chewing up 10GB of memory by
the time it's done.

This should alleviate problems people have been experiencing
when they sporadically have series' of very large tuples.

PostgreSQL 8.1 is quite a lot more strict about what
UTF-8 mappings of Unicode characters it accepts as compared to version
8.0.

If you intend to use Slony-I to update an older database to 8.1, and
might have invalid UTF-8 values, you may be for an unpleasant
surprise.

Let us suppose we have a database running 8.0, encoding in UTF-8.
That database will accept the sequence '\060\242' as UTF-8 compliant,
even though it is really not.

If you replicate into a PostgreSQL 8.1 instance, it will complain
about this, either at subscribe time, where Slony-I will complain
about detecting an invalid Unicode sequence during the COPY of the
data, which will prevent the subscription from proceeding, or, upon
adding data, later, where this will hang up replication fairly much
irretrievably. (You could hack on the contents of sl_log_1, but
that quickly gets really unattractive...)

There have been discussions as to what might be done about this. No
compelling strategy has yet emerged, as all are unattractive.

If you are using Unicode with PostgreSQL 8.0, you run a
considerable risk of corrupting data.

If you use replication for a one-time conversion, there is a risk of
failure due to the issues mentioned earlier; if that happens, it
appears likely that the best answer is to fix the data on the 8.0
system, and retry.

In view of the risks, running replication between versions seems to be
something you should not keep running any longer than is necessary to
migrate to 8.1.

7.3. I am running Slony-I 1.1 and have a 4+ node setup
where there are two subscription sets, 1 and 2, that do not share any
nodes. I am discovering that confirmations for set 1 never get to the
nodes subscribing to set 2, and that confirmations for set 2 never get
to nodes subscribing to set 1. As a result, sl_log_1/sl_log_2 grow
and grow, and are never purged. This was reported as
Slony-Ibug 1485 .

Apparently the code for
RebuildListenEntries() does not suffice for this
case.

RebuildListenEntries() will be replaced
in Slony-I version 1.2 with an algorithm that covers this case.

In the interim, you'll want to manually add some sl_listen entries using SLONIK STORE LISTEN or storeListen(),
based on the (apparently not as obsolete as we thought) principles
described in Section 9.

7.4. I am finding some multibyte columns (Unicode, Big5)
are being truncated a bit, clipping off the last character. Why?

This was a bug present until a little after Slony-I
version 1.1.0; the way in which columns were being captured by the
logtrigger() function could clip off the last
byte of a column represented in a multibyte format. Check to see that
your version of src/backend/slony1_funcs.c is
1.34 or better; the patch was introduced in CVS version 1.34 of that
file.

7.5. Bug #1226 indicates an error condition that can come up if
you have a replication set that consists solely of sequences.

The short answer is that having a replication set
consisting only of sequences is not a best practice.

The problem with a sequence-only set comes up only if you have
a case where the only subscriptions that are active for a particular
subscriber to a particular provider are for
"sequence-only" sets. If a node gets into that state,
replication will fail, as the query that looks for data from
sl_log_1/sl_log_2 has no tables to find, and the query will be
malformed, and fail. If a replication set with
tables is added back to the mix, everything will work out fine; it
just seems scary.

This problem should be resolved some time after Slony-I
1.1.0.

7.6. I need to drop a table from a replication set

This can be accomplished several ways, not all equally desirable ;-).

You could drop the whole replication set, and
recreate it with just the tables that you need. Alas, that means
recopying a whole lot of data, and kills the usability of the cluster
on the rest of the set while that's happening.

If you are running 1.0.5 or later, there is the
command SET DROP TABLE, which will "do the trick."

If you are still using 1.0.1 or 1.0.2, the
essential functionality of SLONIK SET DROP TABLE
involves the functionality in droptable_int().
You can fiddle this by hand by finding the table ID for the table you
want to get rid of, which you can find in sl_table, and then run the following three queries,
on each host:

The schema will obviously depend on how you defined the Slony-I
cluster. The table ID, in this case, 40, will need to change to the
ID of the table you want to have go away.

You'll have to run these three queries on all of the nodes,
preferably firstly on the origin node, so that the dropping of this
propagates properly. Implementing this via a slonik
statement with a new Slony-I event would do that. Submitting the
three queries using SLONIK EXECUTE SCRIPT could do that.
Also possible would be to connect to each database and submit the
queries by hand.

7.8. I set up my cluster using pgAdminIII, with cluster
name "MY-CLUSTER". Time has passed, and I tried using
Slonik to make a configuration change, and this is failing with the
following error message:

ERROR: syntax error at or near -

The problem here is that Slony-I expects cluster names
to be valid SQL Identifiers, and slonik enforces this. Unfortunately,
pgAdminIII did not do so, and allowed using
a cluster name that now causes a problem.

If you have gotten into this spot, it's a problem that
we mayn't be help resolve, terribly much.

On the other hand, when the problem has been experienced, users
have found they needed to drop replication and rebuild the
cluster.

A change in version 2.0.2 is that a function runs as
part of loading functions into the database which checks the validity
of the cluster name. If you try to use an invalid cluster name,
loading the functions will fail, with a suitable error message, which
should prevent things from going wrong even if you're using tools
other than slonik to manage setting up the cluster.

8. Slony-I FAQ: Hopefully Obsolete Issues

After an immediate stop of PostgreSQL (simulation of system
crash) in pg_listener a tuple with relname='_${cluster_name}_Restart' exists. slon doesn't
start because it thinks another process is serving the cluster on this
node. What can I do? The tuples can't be dropped from this
relation.

The logs claim that

Another slon daemon is
serving this node already

The problem is that the system table pg_listener, used
by PostgreSQL to manage event notifications, contains some entries
that are pointing to backends that no longer exist. The new slon instance connects to the database, and is convinced,
by the presence of these entries, that an old
slon is still servicing this Slony-I
node.

The "trash" in that table needs to be thrown
away.

It's handy to keep a slonik script similar to the following to
run in such cases:

As of version 1.0.5, the startup process of slon looks for this
condition, and automatically cleans it up.

As of version 8.1 of PostgreSQL, the functions that manipulate
pg_listener do not support this usage, so for Slony-I versions after
1.1.2 (e.g. - 1.1.5), this
"interlock" behaviour is handled via a new table, and the
issue should be transparently "gone."

It appears the Slony-Ixxid functions are
claiming to be capable of hashing, but cannot actually do so.

What's up?

Slony-I defined an XXID data type and operators on
that type in order to allow manipulation of transaction IDs that are
used to group together updates that are associated with the same
transaction.

Operators were not available for PostgreSQL 7.3 and earlier
versions; in order to support version 7.3, custom functions had to be
added. The = operator was marked as supporting
hashing, but for that to work properly, the join operator must appear
in a hash index operator class. That was not defined, and as a
result, queries (like the one above) that decide to use hash joins
will fail.

This has not been considered a
"release-critical" bug, as Slony-I does not internally
generate queries likely to use hash joins. This problem shouldn't
injure Slony-I's ability to continue replicating.

Future releases of Slony-I (e.g.
1.0.6, 1.1) will omit the HASHES indicator, so that

Supposing you wish to repair an existing instance, so
that your own queries will not run afoul of this problem, you may do
so as follows:

8.3. I can do a pg_dump
and load the data back in much faster than the SUBSCRIBE
SET runs. Why is that?

Slony-I depends on there being an already existant
index on the primary key, and leaves all indexes alone whilst using
the PostgreSQLCOPY command to load the data.
Further hurting performance, the COPY SET event (an
event that the subscription process generates) starts by deleting the
contents of tables, which leaves the table full of dead tuples.

When you use pg_dump to dump the contents of
a database, and then load that, creation of indexes is deferred until
the very end. It is much more efficient to
create indexes against the entire table, at the end, than it is to
build up the index incrementally as each row is added to the
table.

If you can drop unnecessary indices while the
COPY takes place, that will improve performance
quite a bit. If you can TRUNCATE tables that
contain data that is about to be eliminated, that will improve
performance a lot.

Slony-I version 1.1.5 and later versions should handle
this automatically; it "thumps" on the indexes in the
PostgreSQL catalog to hide them, in much the same way triggers are
hidden, and then "fixes" the index pointers and reindexes
the table.

8.4. Replication Fails - Unique Constraint Violation

Replication has been running for a while, successfully, when a
node encounters a "glitch," and replication logs are filled with
repetitions of the following:

The transaction rolls back, and
Slony-I tries again, and again, and again.
The problem is with one of the last SQL
statements, the one with log_cmdtype = 'I'. That
isn't quite obvious; what takes place is that
Slony-I groups 10 update queries together
to diminish the number of network round trips.

A certain cause for this has been
difficult to arrive at.

By the time we notice that there is a problem, the seemingly
missed delete transaction has been cleaned out of sl_log_1, so there
appears to be no recovery possible. What has seemed necessary, at
this point, is to drop the replication set (or even the node), and
restart replication from scratch on that node.

In Slony-I 1.0.5, the handling of purges of sl_log_1 became
more conservative, refusing to purge entries that haven't been
successfully synced for at least 10 minutes on all nodes. It was not
certain that that would prevent the "glitch" from taking
place, but it seemed plausible that it might leave enough sl_log_1
data to be able to do something about recovering from the condition or
at least diagnosing it more exactly. And perhaps the problem was that
sl_log_1 was being purged too aggressively, and this would resolve the
issue completely.

It is a shame to have to reconstruct a large replication node
for this; if you discover that this problem recurs, it may be an idea
to break replication down into multiple sets in order to diminish the
work involved in restarting replication. If only one set has broken,
you may only need to unsubscribe/drop and resubscribe the one set.

In one case we found two lines in the SQL error message in the
log file that contained identical insertions
into sl_log_1. This ought to be impossible as
is a primary key on sl_log_1. The latest (somewhat) punctured theory
that comes from that was that perhaps this PK
index has been corrupted (representing a PostgreSQL bug), and that
perhaps the problem might be alleviated by running the query:

# reindex table _slonyschema.sl_log_1;

On at least one occasion, this has resolved the problem, so it
is worth trying this.

This problem has been found to represent a PostgreSQL
bug as opposed to one in Slony-I. Version 7.4.8 was released with
two resolutions to race conditions that should resolve the issue.
Thus, if you are running a version of PostgreSQL earlier than 7.4.8,
you should consider upgrading to resolve this.

8.5. I started doing a backup using
pg_dump, and suddenly Slony
stops

Ouch. What happens here is a conflict between:

pg_dump, which has taken
out an AccessShareLock on all of the tables in the
database, including the Slony-I ones, and

A Slony-I sync event, which wants to grab a
AccessExclusiveLock on the table sl_event.

The initial query that will be blocked is thus:

select "_slonyschema".createEvent('_slonyschema, 'SYNC', NULL);

(You can see this in pg_stat_activity, if you
have query display turned on in
postgresql.conf)

The actual query combination that is causing the lock is from
the function Slony_I_ClusterStatus(), found in
slony1_funcs.c, and is localized in the code that
does:

That trigger initiates the action of logging all updates to the
table to Slony-Isl_log_1/sl_log_2 tables.

On a subscriber node, this involves disabling
triggers and rules, then adding in the trigger that denies write
access using the denyAccess() function to
replicated tables.

Up until 1.1 (and perhaps onwards), the
"disabling" is done by modifying the
pg_trigger or pg_rewritetgrelid to point to the OID of the "primary
key" index on the table rather than to the table
itself.

A somewhat unfortunate side-effect is that this handling of the
rules and triggers somewhat "tramples" on them. The
rules and triggers are still there, but are no longer properly tied to
their tables. If you do a pg_dump on the
"subscriber" node, it won't find the rules and triggers
because it does not expect them to be associated with an index.

The trouble would seem to be that you have added
triggers on tables whose names conflict with triggers that were hidden
by Slony-I.

Slony-I hides triggers (save for those "unhidden"
via SLONIK STORE TRIGGER) by repointing them to the
primary key of the table. In the case of foreign key triggers, or
other triggers used to do data validation, it should be quite
unnecessary to run them on a subscriber, as equivalent triggers should
have been invoked on the origin node. In contrast, triggers that do
some form of "cache invalidation" are ones you might want
to have run on a subscriber.

The Right Way to handle such triggers is
normally to use SLONIK STORE TRIGGER, which tells
Slony-I that a trigger should not get deactivated.

But some intrepid DBA might take matters into their
own hands and install a trigger by hand on a subscriber, and the above
condition generally has that as the cause. What to do? What to do?

The answer is normally fairly simple: Drop out the
"extra" trigger on the subscriber before the event that
tries to restore them runs. Ideally, if the DBA is particularly
intrepid, and aware of this issue, that should take place
before there is ever a chance for the error
message to appear.

If the DBA is not that intrepid, the answer is to connect to
the offending node and drop the "visible" version of the
trigger using the SQLDROP
TRIGGER command. That should allow the event to proceed.
If the event was SLONIK EXECUTE SCRIPT, then the
"not-so-intrepid" DBA may need to add the trigger back,
by hand, or, if they are wise, they should consider activating it
using SLONIK STORE TRIGGER.

9.3. Behaviour - all the subscriber nodes start to fall
behind the origin, and all the logs on the subscriber nodes have the
following error message repeating in them (when I encountered it,
there was a nice long SQL statement above each entry):

You then need to find the rows in sl_log_1/sl_log_2 that have
bad entries and fix them. You may want to take down the slon daemons
for all nodes except the master; that way, if you make a mistake, it
won't immediately propagate through to the subscribers.

This points to a case where you'll
need to do "event surgery" on one or more of the nodes.
A STORE_LISTEN event remains outstanding that wants
to add a listen path that cannot be created
because node 1 and all paths pointing to node 1 have gone away.

Let's assume, for exposition purposes, that the remaining nodes
are #2 and #3, and that the above error is being reported on node
#3.

That implies that the event is stored on node #2, as it
wouldn't be on node #3 if it had not already been processed
successfully. The easiest way to cope with this situation is to
delete the offending sl_event entry on node #2.
You'll connect to node #2's database, and search for the
STORE_LISTEN event:

The next time the slon for node 3
starts up, it will no longer find the "offensive"STORE_LISTEN events, and replication can continue.
(You may then run into some other problem where an old stored event is
referring to no-longer-existant configuration...)

9.5. I have a database where we have been encountering
the following error message in our application:

permission denied for sequence sl_action_seq

When we traced it back, it was due to the application calling
lastval() to capture the most recent sequence
update, which happened to catch the last update to a Slony-I internal
sequence.

Slony-I uses sequences to provide primary key values for log
entries, and therefore this kind of behaviour may (perhaps
regrettably!) be expected.

Calling lastval(), to
"anonymously" get "the most recently updated
sequence value", rather than using
currval('sequence_name') is an unsafe thing to do
in general, as anything you might add in that uses DBMS features for
logging, archiving, or replication can throw in an extra sequence
update that you weren't expecting.

In general, use of lastval() doesn't seem
terribly safe; using it when Slony-I (or any similar trigger-based
replication system such as Londiste or
Bucardo) can lead to capturing unexpected
sequence updates.