Concurrency Controls in Data Replication

Data distribution is commonly used in high-performance computing (HPC). Basically there are two fundamental data distribution topologies: replication and partition.
With data partitioning, you can achieve parallel processing of a large amount of data. With data replication, you can achieve load balancing (LB) and high availability (HA). This article focuses on the requirements for data replication.

Even though a data item has several replicas in a data replication environment, it should have some degree of data coherence by only appearing to be one virtual global item to the end users. The biggest challenge using data replication is the proper trade-off between data coherence and performance based on your business requirements.

We will use a sample of the Distributed Airline Ticketing System (DATS hereafter) throughout the discussion. DATS has two databases for HA and LB: one is in NY and the other is in LA. Depending on your replication scheme, data can be updated either at one site only and replicated to the other, or at both sites and replicated to each other.
Further, let’s suppose that the following actions take place in time order:

Both local database replicas are synchronized at this time and there is only one ticket left. This only ticket can be taken either by NY or LA;

A NY customer bought this ticket. This action was updated in the local NY database and it will be replicated to LA somehow depending on the replication scheme;

Depending on your replication scheme, the LA database can show the same ticket as being either still available or already taken by a NY customer. If the same ticket still appears available in the LA database, it will be sold to a LA customer. This will create an oversold situation.

Since the DATS is only suited for asynchronous replication in a WAN environment, this variant of DATS will be used in synchronous replication environment: Instead of one LA database, we assume there is a second NY database that is in the same data center as the first one.

We also implicitly assume the optimistic concurrency control mechanism throughput in the discussion. Here is how it works in the DATS:

In order to have good performance, most multi-tiered applications use optimistic concurrency control that can create a lost update problem. For example, if we use optimistic concurrency control in both databases in DATS, the application tier in Step 3 could first read the LA database before Step 2 and then sell the same ticket to a LA customer after Step 2.

The application must use "optimistic concurrency control with version checking" to fix this issue. One version checking scheme can just be a version number that increases whenever there is any corresponding data change.

Suppose the version is 0 at Step 1. Step 2 updates it to 1. The version found by the application tier read at Step 3 is also 0. But when the application tier tries to sell the same ticket, it will fail because it will find the version has changed to 1 from its cached value 0.

1. Synchronous Replication Using Distributed Locks and Local Transactions

Oracle RAC (formerly Oracle Parallel Server (OPS) in version 8i and prior) allows the same physical database to be accessed at multiple instance sites. In order to allow users to read and write any data at any time at any instance site, Oracle RAC uses "Cache Fusion" to ensure data coherence.

"Cache Fusion" basically uses synchronous replication with a distributed locking manager (DLM). DLM acts as a distributed lock (DL) coordinator among other functions so that the same resource - e.g. a table row - can only be granted to one instance site for changing at a time while other sites have to wait.

DLM also acts as a global resource directory. For example, when instance site 1 updates a row, it doesn't need to actively push the new version of data to all the other instance sites. Instead it just invalidates its replicas. When instance site 2 later requests the same row, DLM can tell it to get the latest version from instance site 1.
Also, instance site 1 doesn't need to use a distributed transaction thanks to DLM and the fact that there is still only one physical database.

The pros include a very high degree of data coherence and load balance for both reads and writes (usually synchronous replication doesn’t balance writes because the same write is replicated across all sites. However, thanks to RAC’s light invalidations, it can relatively balance writes).

The cons include unscalable write performance (even though invalidations are light, too many invalidations still flood the shared interconnects; so RAC’s writes are still not scalable) and requirements of high-availability and high-speed interconnects due to replication synchrony and distributed locks (distributed locks usually consist of quite a few daemon processes and data structures at each site whose coordination performs poorly or are even impossible in low-speed networks including LAN and WAN. For Oracle "Cache Fusion", distributed locks are implemented with Global Cache Service (GCS), Global Enqueue Service (GES) and Global Resource Directory (GRD) in a clustered environment).

You should also notice that this scheme is unique to RAC. If you have two or more transactional data sources, using local transactions will sometimes cause data inconsistency. However synchronous replications that use distributed transactions along with distributed locks are very expensive even though distributed transactions can ensure data atomicity. The author hasn't seen such a scheme in any product so far.
Because synchronous replication is impractical or even impossible in WAN, we will apply this scheme to the DATS variant. Step 3 has to wait for the DL to be release by Step 2. When Step 3 gets the DL, the same ticket will show already taken by Step 2.

2. Synchronous Replication Using Local Locks and Distributed Transactions

Oracle's multi-master replication (also called peer-to-peer or n-way) has two data coherence protocols, one of which is synchronous replication. The other one will be covered in Section 4.

Synchronous replication applies any DML changes or executes any replicated procedures at all sites participating in the replication environment as part of a single distributed transaction (each site has its own physical database, which is different from there being only one physical database in Oracle RAC). If the DML statement or procedure fails at any site, then the entire transaction rolls back.

The distributed transaction ensures data consistency at all sites in real time. However it doesn't use any distributed locking. Instead it only uses local locks in the participant local transaction.

This is the case when an application performs a synchronous update to a replicated table. Oracle first locks the local row and then uses an after row trigger to lock the corresponding remote row. Oracle releases the locks when the transaction commits at each site. As you can imagine, deadlocks can happen if multiple sites try to change the same resource at the same time. Other than this, the same resource can only be granted to one instance site for changing at a time while other sites have to wait.

The pros include a high degree of data coherence, simple implementation and easy administration by avoiding distributed locks.

The cons include possible deadlock due to temporal behavior of local and remote locking, poor write performance, and requirements of high-availability and high-speed networks due to replication synchrony and to two-phase commit (2PC) needed by distributed transactions.

The deadlock will become a major problem in case of high concurrency. In case of deadlock, Oracle rolls back one offending transaction while keeping the other one. The rolled-back transaction returns an error code to the front-end application.

Due to the same reason mentioned in Section 1, we will apply this scheme to the DATS variant. Step 3 has to wait for Step 2 to release the local and remote locks. When Step 3 gets the locks, the same ticket will appear already taken by Step 2.

3. Synchronous Replication Using Local Locks and Local Transactions

TimesTen's unidirectional active-standby pair configuration only uses so called "return twosafe replication". It provides fully synchronous replication between the master (the active site) and subscriber (the standby site).

No distributed transaction or distributed lock is involved. Instead only local transactions and local locks are used. Specifically, the local transaction on the subscriber is first committed before the transaction on the master. If the subscriber can't commit, the master will not commit either.

At any time, only the active site can be updated, which greatly simplifies the data updating complexity (otherwise using a local lock and local transaction will be insufficient) and ensures fast fail-over to the standby site in case of the active site failure.

This scheme has similar benefits and drawbacks to the previous scheme in Section 2.
However its performance is much better thanks to the avoiding of 2PC required by distributed transactions. It also eliminates deadlocks because only the active site allows updates.

Although the standby site seems to be a waste of capacity, you can co-locate it with another active site as shown in Figure 1 (in particular, the site has a different set of data from the collocated standby site).

This scheme has a lower degree of data coherence due to the inconsistency resulted from master commit failure, even if the subscriber has successfully committed (the root cause is that it doesn't use distributed transactions. But you should also know that temporary data inconsistency can still result from the second commit or rollback phase failure in a 2PC process).

TimesTen's practice in this scenario is consistent with its configurations for high performance in other areas such as asynchronous logging and data caching with write-behind policy.Gigaspaces IMDB has a very similar topology called primary-backup replication. The only difference is that it uses distributed transactions instead of local transactions only. So it has a higher degree of data coherence than TimesTen.
Another advantage with Gigaspaces IMDB is that the fail-over happens in Gigaspaces IMDB transparently to end users while TimesTen customers still need to resort to third-party or custom cluster management software.

Due to the same reason mentioned in Section 1, we will apply this scheme to the DATS variant. One of the two NY sites will be the active site and the other standby site has to connect to the active site for data updating. The local lock in the active site prevents the oversold situation.

This scheme along with data partitioning as shown in Figure 1 is strongly recommend compared to the two previous synchronous schemes based on the following factors:

It greatly simplifies data updating complexity while still providing HA;

Although the two previous synchronous schemes allow updating anywhere, updating the same resource entails costly lock coordination and / or distributed transactions over the network. Scalable updating is usually achieved by data partitioning;

Although the two previous synchronous schemes allow distributed and scalable reads, you can also fine-tune your partitions to allow more concurrent reads.

Figure 1: The Primary-Backup Partition in Gigaspaces IMDB

4. Asynchronous Replication with Update Anywhere

Another data coherence protocol in Oracle's multi-master replication is asynchronous replication that allows users to update data at any participant site. This scheme is also used in Oracle's updatable materialized view replication and TimesTen's bidirectional master-subscriber replication for general distributed workloads.

With this scheme the data changes at one site will be queued for propagation to other sites and committed locally. The queued changes will be propagated in batches in a separate transaction, so it doesn't use any distributed lock or distributed transaction. Instead it only uses local locks required in the corresponding local transaction.
The pros include great read and write performance, easy implementation, suited to low-speed networks including both LAN and WAN, and disconnected updating. Especially, the WAN deployment makes disaster recovery truly possible across geographically dispersed data centers.

The cons include a limited degree of data coherence depending on data refresh frequency, and possible data change conflicts.

Because there is no distributed lock or distributed transaction involved, a replication conflict occurs if two transactions originating from different sites update the same row at nearly the same time. (When the queued changes are propagated to the other site, the other site will have two versions of data changes. In that case, the application needs to decide which one should take place.)

A conflict resolution method must be provided to resolve the data inconsistency. Both Oracle and TimesTen have a pre-built "latest timestamp" resolution method that makes the change with the latest timestamp the winner. Oracle also allows you to customize a resolution method based on your business requirements.

This scheme can't be applied to DATS if oversold situations are not allowed because the changes at NY and LA sites can be committed independently in two different transactions that result in the same ticket being sold to two customers.

If occasional oversold situations are allowed, the NY and LA sites can sell tickets at different times by taking advantage of the three hours time zone difference. If a replication conflict does occur, relevant information should be recorded in the database based on which your front-end application takes proper actions (in reality a reservation system like DATS doesn't use this scheme).

5. Asynchronous Replication with Update at the Master Site only

Broadly speaking, this scheme is also employed when you create multiple database sessions using optimistic locking. When you first query data in a session, you are actually returned a copy of the master data from the database. Then, when you want to save your changes, you persist them to the backend database.

Because changes only take place at the master site, distributed locks or distributed transactions simply don't apply. This scheme basically has similar pros and cons to the previous scheme in Section 4. However, because it only allows updates at the master, it eliminates the notorious replication conflicts, which proves to be a very sound design most of the time in an asynchronous replication environment.

If we apply this scheme to DATS and suppose NY is the master site or a third site is the master, NY has to wait if LA first gets the local lock at the master site. The local lock in the master site prevents the oversold situation.

Similar to the argument near the end of Section 3, this scheme along with data partitioning is also recommended. The DATS can be enhanced by partitioning so that NY is in charge of the east-coast flights while LA is in charge of the west-coast flights.
Life is much easier using Gigaspaces IMDB's master-local topology as shown in Figure 2 because it automatically delegates your local cache updating to the master that propagates the same updating to other local caches. Gigaspaces IMDB also supports optimistic locking with versioning.

You must do both by yourself if you use Oracle's read-only materialized view replication and TimesTen's unidirectional master-subscriber replication.

6. Conclusion

Data replication can be generally categorized as synchronous and asynchronous. Synchronous replication ensures a high degree of data coherence at the cost of expensive high-availability and high-speed networks. It is usually used to protect mission-critical data such as in financial industries.

Asynchronous replication provides better write scalability at the cost of a low degree of data coherence. It is usually used to balance writes and provides disaster recovery.
For each replication category, there are also several schemes that provide different concurrency control. Although your specific business requirements may decide which scheme is suited to you, we recommend the schemes in Section 3 and 5.

Finally readers should be aware of two points: One is there are still some other interesting replication approaches not covered here. For example there are so-called “semi-synchronous replication” by TimesTen’s “return receipt replication” and MySQL 5.5, and combinations of synchronous and asynchronous replications adopted by Gigaspaces’ data write-behind feature.

The other is related to the current NoSQL trend. Since most of NoSQL alternatives boast of scale-first capability and assume that failure is inevitable, they rely on data replication to ensure read/write LB and HA. This article will only mention three typical NoSQL implementations. CouchDB, built on the Erlang OTP platform, allows distributed and even disconnected document updates by its bi-directional asynchronous incremental replication. Cassandra allows across-data centers replications and provides different degrees of data coherence.

Finally Gigaspaces diminishes the roles of traditional relational databases by working as an IMDB that replies on replications to implement its HA and write-behind capabilities. The latest 8.0 version even supports a new document interface besides the original key-value map interface.

About the Author

Yongjun Jiao is a technical manager with SunGard Global Services. He has been a professional software developer for the past 10 years. His expertise covers Java SE, Java EE, Oracle, application tuning and high performance and low latency computing.

i can't understand "Figure 1: The Primary-Backup Partition in Gigaspaces IMDB".would you tell me how to replication each other.and Explain the relationship between primary 1 and primary 4. and Explain the relationship between primary 1 and backup 1?