Saturday, February 26, 2011

The Case For Logical Replication

Replication, as it exists in PostgreSQL today, is physical replication. That is, the disk files as they exist on the standby are essentially byte-for-byte identical to the ones on the master (with the exception that the same hint bits may not be set). When a change is made on the master, the write-ahead log is streamed to the standby, which makes the same change to the corresponding disk block on the standby. The alternative is logical replication, in which what is transferred is not an instruction to write a certain sequence of bytes at a certain location, but the information that a certain tuple was inserted, or that a table with a given schema was created.
The big advantage of physical replication is that it has very, very low overhead. Most of the write-ahead log records that are needed to make physical replication work are needed anyway for crash recovery. There is certainly some overhead, but it is quite low, making it the best form of replication for disaster recovery scenarios. Having said that, physical replication has a number of serious disadvantages:

1. You can't replication to a different major version of PostgreSQL.
2. You certainly can't replicate to a database other than PostgreSQL.
3. You can't replicate part of the database.
4. You can't write any data at all on the standby.
5. You certainly can't do multi-master replication.
6. MVCC bloat on the master propagates to the standby.
7. Anything that would bloat the standby causes query cancellations instead, or delays in recovery (or in 9.1, you'll be able to avoid the query cancellation by bloating the master).

Logical replication, in theory, can work around all of these problems, and in fact there are already a number of existing projects which aim to provide logical replication for PostgreSQL, including Slony, Bucardo, and Londiste. All of these projects, however, have been hindered by lack of core support (as discussed at the last developer meeting, which I blogged a little bit about), and I think also by the fact that replication is just a really hard problem. There are also forks of PostgreSQL that aim to add replication functionality, such as Postgres-R and the now-defunct Mammoth Replicator, but it seems we don't yet have a solution that everyone is 100% happy with. Rather than a whole series of separate projects, I think we need one high-quality solution in core, or at the very least a set of powerful tools in core upon which solutions can be built more easily than what our current architecture allows. I think that trigger-based solutions are always going to carry an uncomfortably high performance cost.

Given the low overhead of physical replication and the fact that we already have it, one might ask whether it would make more sense to continue refining that technology, rather than invent something new. Perhaps, but I'm skeptical. Logical replication would be a very hard project, but it would also cut a pretty big hole through the solution space. Even the most ambitious plans for improving our current hot-standby-based replication system don't involve relaxing more than one or two of the seven restrictions listed above, and the difficulties in getting even that far seem quite formidable. I'm glad we have a really high-quality physical replication solution for high availability and disaster recovery, but I fear that it's never going to really meet the full range of needs in this area.

19 comments:

IMHO, core could start by providing SQL support for replication. For example, adding ALTER TABLE tablename FOR REPLICATION, which could, say, set a boolean in pg_class. Altering a database for replication would then set in motion the propagation of changes or make them available for external processes.

To allow writes on a standby or for multi-master, you need smart policies for dealing with conflicts. For example, there used to be a product (whose name escapes me) that allowed a change to a customer name in one master to be merged properly with a change to the same customer's phone from another master.

You seem to have missed entirely the existence of the pg-cluster-hackers list and the clustering summit (at pgCon this year). The purpose of that group is to get tools into core which make building replication solutions easy/possible. Progress has been slow.

Maybe you should get involved? You got invited to the clustering summit, you could show up.

Several of the fundamental pieces needed here, the ones that always pop up when you poll people about how to actually start building this feature, are already listed at ClusterFeatures. "API into the Parser", "DDL Triggers", and "Modification trigger into core / Generalized Data Queue" are the three most relevant here. The more detailed page for General Modification Trigger and Generalized Data Queue is probably the biggest and most important of those.

Josh, I know that there is a pgsql-cluster-hackers mailing list, but it seems like there's very little activity - 8 messages in the last 9 months. There have been no significant patches in this area during the 9.1 development cycle.

As for the cluster-hackers summit, I saw an invitation on pgsql-hackers for people actively doing work in this area, but since this is more of a wish than a current project I'm not sure I qualify. I may also be giving a tutorial that day.

We could learn much from Sybase's Replication Server. It was really fast (at France Telecom, we actually used it as a message queue), supported function replication, filtering, routing and other advanced capabilities, many of which have yet to be replicated by other commercial vendors.

I'm pointing out that the idea of having features and hooks to enable a variety of replication systems for PostgreSQL is not a new idea, even if it's new to you. There's even a wiki page:

http://wiki.postgresql.org/wiki/ClusterFeatures

What's been lacking ... for the last 6 years, in fact ... is sufficient concentrated work on any of these features. Or, for that matter, even enough urgency around the features that the various replication teams are willing to compromise on exactly how they work. Sometimes it seems like -hackers is the only place where people are willing to compromise on specs.

Josh, I'm not sure if hooks are enough, or if we really need to have the whole thing in core. But certainly hooks would be a good place to start. I know there has been work done on this in the past, but as you say I think we need to step up our efforts in that area.

Logical replication is a useful project and does not need to be deeply enmeshed in the PostgreSQL core. It does need some changes to logging to work well. We are pushing Tungsten Replicator for MySQL into open source the first half of this year. I hope that later in the year we can start to look for companies in the PostgreSQL space willing to sponsor reading the PG log. Maybe we'll even have a chance to work on this together. :)

@Fazal, interesting you bring up Rep Server. A couple of us working on Tungsten used to be in the Sybase Rep Server group. It's still a great product. We have been working for the last 3 years getting similar features (and better) into Tungsten Replicator.

I'm not convinced ALTER TABLE FOR REPLICATION hooks that call into a replication system living outside of core will make deploying any of the logical replication systems that much easier.

The big issues with logical replication are i)none of the projects do everything and knowing the limitations which one for a particular job is difficultii) Many of them are difficult to use and difficult to troubleshoot when things go wrongiii) The performance overhead of the trigger based ones can be significant for a lot of workloadsiv) Lack of DDL triggers means DDL changes requires special intervention

I think to some extent people on the cluster-hackers mailing list knows how hard these problems are so they ignore them.

@Steve Singer: I agree with all of your comments. I'd like to see us tackle some of those hard problems, particularly performance and ease of use. What do you think the right forum for that discussion is?

After reading the post, i think PostgreSQL can use method used by oracle to read from xlogs. Oracle uses "SQL apply" method for Logical standby database. Oracle reads transactions from the transactions logs and convert it to SQL. Oracle uses primary-key or unique-constraint/index supplemental logging to logically identify a modified row in the logical standby database.

When database-wide primary-key and unique-constraint/index supplemental logging is enabled, each UPDATE statement also writes the column values necessary in the redo log to uniquely identify the modified row in the logical standby database.

If a table has a primary key defined, then the primary key is logged along with the modified columns as part of the UPDATE statement to identify the modified row.

In the absence of a primary key, the shortest nonnull unique-constraint/index is logged along with the modified columns as part of the UPDATE statement to identify the modified row.

In the absence of both a primary key and a nonnull unique constraint/index, all columns of bounded size are logged as part of the UPDATE statement to identify the modified row.

I agree with what Denish Patel proposed. Oracle like logical standby is a good idea. I don't like Slony because it incurs load on primary and it is very troublesome to maintain database schema changes with slony. Extracting DML or DDL from Xlog and reapply it to standby is a good way to go... I hope those PostgreSQL hackers can seriously consider this.

I don't think anyone wants to extract statements. What would be nice is to extract tuples, from which insert/update/delete SQL statements could be generated. So a statement that deletes 3 tuples would eventually turn into three delete statements, targeted by primary key.

Hey Robert hii,can you tell something about how i can extract tuples and then generate sql statements which you are talking about in your last comment, am also working on similar project means logical replication n was trying to extract sql statements from xlog files of postgresql in pg_xlog directory. thanks in advance :)