Percona XtraDB Cluster: Multi-node writing and Unexpected deadlocks

Percona XtraDB Cluster (PXC) and the technology it uses (Galera) is an exciting alternative to traditional MySQL replication. For those who don’t know, it gives you:

Fully Synchronous replication with a write latency increase equivalent to a ping RTT to the furthest node

Automatic cluster synchronization, both incremental and full restores

The ability to read and write on every node without needing to worry about replication delay

However, good engineers know there is always a trade-off. If someone tries to sell you on a technology that sounds amazing without telling you the tradeoffs, be careful.

One of the tradeoffs in Galera is how multi-node writing is implemented. This is highly simplified, but locking inside of an RDBMS can either be optimistic or pessimistic.

Pessimisic locking is what Innodb does as a transaction executes. All the locks it needs (or it thinks it needs) will be acquired throughout the course of the transaction. If a given statement in the transaction needs a lock, it will wait for that lock before proceeding (this is where deadlocks and lock wait timeouts can happen). By the time the transaction reaches the COMMIT, it should be guaranteed to succeed (unless something weird happens like the disk fails or the server crashes) because it acquired all the locks it needed in advance.

Now, think about pessimistic locking in terms of a cluster. It would be incredibly expensive to go gain every lock on every node you need for all running transactions. How would deadlock detection work? On a single node, a deadlock is fairly easy to see by looking at all transactions waiting for locks and checking for circular dependencies, but that would be much more complicated in a cluster. It’s not to say that pessimistic locking isn’t possible in a cluster, but it is pretty tricky, and all that extra network traffic would really slow down transactions.

You may have already guessed, but Galera does not use pessimistic locking cluster-wide, it uses optimistic locking. Initially a transaction on PXC behaves much like a regular Innodb transaction. It does pessimistic locking on the node it is started from all the way up to point of the commit. At that point it knows it has gained all the locks necessary on the local node, but it has no idea about the rest of the cluster. So, it optimistically shoots the transaction out to the rest of the nodes in the cluster to see if they will agree to it. If they do, then the commit succeeds at a simple cost of one network roundtrip per node (which is done in parallel to all nodes at once).

This sounds all well and good, but what is the tradeoff?

Not locking resources cluster-wide means that there can be locking conflicts. In short, these take the form of deadlock errors to the clients of affected transactions. Technically these aren’t necessarily deadlocks, but instead a cross-node locking conflict of some kind. These deadlocks can happen in places where you wouldn’t necessarily expect one.

So, enough talking, let’s see what one looks like. Suppose we have a 3 node PXC cluster. Let’s create a table to test on and insert some data:

So far, so good. We have an open transaction on node1 updating row 1 in our table. Note we haven’t committed yet. Now, let’s go to node3:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

node3 mysql>set autocommit=off;

Query OK,0rows affected(0.00sec)

node3 mysql>select*from autoinc;

+---+-------+

|i|j|

+---+-------+

|1|node2|

|4|node2|

|7|node2|

+---+-------+

3rows inset(0.00sec)

node3 mysql>update autoinc setj="node3"wherei=1;

Query OK,1row affected(0.01sec)

Rows matched:1Changed:1Warnings:0

node3 mysql>commit;

Query OK,0rows affected(0.00sec)

node3 mysql>select*from autoinc;

+---+-------+

|i|j|

+---+-------+

|1|node3|

|4|node2|

|7|node2|

+---+-------+

3rows inset(0.00sec)

Our commit succeeded! On a single node, this should have blocked waiting for our first transaction, but it does not. Let’s go back to node1 and see what happened:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

node1 mysql>commit;

ERROR1213(40001):Deadlock found when trying togetlock;tryrestarting transaction

node1 mysql>select*from autoinc;

+---+-------+

|i|j|

+---+-------+

|1|node3|

|4|node2|

|7|node2|

+---+-------+

3rows inset(0.00sec)

Whoops! We got a deadlock error on commit, that doesn’t happen in normal Innodb.

Galera is allowing the first transaction to commit to “win”. Any other transactions on other nodes with competing locks are automatically rolled back and will receive a deadlock error no matter what they try to do next (even a SELECT or something else seemingly innocuous).

Dealing with this problem, quite simply, is the tradeoff to multi-node writing in a synchronous replication system with optimistic locking.

So, when is this really a problem? Really whenever transactions across multiple nodes can take the same write locks. Whether or not this is a problem for your system is purely application-dependent, but if you’re not sure, I’d say you should assume it will be a problem.

What can you do to work with/around this problem? Here are some ideas from simplest to most complex:

Only write to a single node in the cluster (at a time)-- all the pessimistic locking will happen as usual on that node only and there should be no conflicts from other nodes. This is pretty easy to rig up with an HAproxy configuration or similar so if your main writer node fails, clients fail over gracefully.

Update your application DB library to handle sudden deadlocks– If you end up with one of these deadlock errors, simply retry the transaction. Any software using transactions should have some facility to do this anyway (IMO).

Restructure your application/schema to avoid data hot-spots– this is probably easier said than done in most cases, but in principle the least likely it is for a single piece of data to be written on multiple nodes at the same time, the less often this problem will happen. For some databases this may be perfectly natural, but I believe most conventional relational databases will naturally have these kinds of problems.

About Jay Janssen

Jay joined Percona in 2011 after 7 years at Yahoo working in a variety of fields including High Availability architectures, MySQL training, tool building, global server load balancing, multi-datacenter environments, operationalization, and monitoring. He holds a B.S. of Computer Science from Rochester Institute of Technology.

So for application recovery for a deadlock situation, would the log be to rollback the transaction and retry again? Or any other special handling needed since it seems to be committed on the local node?

The process of commit is to “certify” transaction before it is able to commit, which is essentially get permission to commit within the cluster. Once transaction is certified it will commit or if problem happens the node will drop out of the cluster, so it is not possible for transaction to commit locally but when deadlock.

Partha, Any transaction that fails in this way is already rolled-back, whether the client knows it yet or not.

The cluster is capable of retrying autocommitted statements automatically (this is a configuration setting), but not regular transactions — I have heard some talk of retrying full transactions, but it seems inherently unsafe with MVCC (in my own limited understanding of the issues involved).

The cluster takes care of the transaction states the way you’d expect — you don’t have to manually rollback local transactions, it all “just works” in that sense.

Hi Jay, thanks for this post. The Galera rollback behavior on conflicts is fairly easy to understand for individual rows, but I think there may be more to the story. It would be interesting to understand the deadlocking behavior a little more fully, including whether you can get rollback due to index page update conflicts or due to locks across tables that are induced by foreign key constraints. For instance, do index page splits count when considering conflicting updates? (Perhaps Seppo or Alexey can comment?)

Meanwhile, it has always seemed to me that the place where synchronous rollback behavior gets interesting is for large transactions. At that point you really have to pick a single node for all operations or face a very large number of rollbacks. As the transactions get larger you are dealing with the sort of math that Jim Gray described in “The Dangers of Replication and a Solution.”

When application gets deadlock error, the correct action is just to retry with better luck. To help in this, we have added variable wsrep_retry_autocommit=, which will automatically retry a deadlocked autocommit transaction inside the server (at most n times). This hides the deadlock error from application and is the most effective way for retrying. This variable is useful especially when dealing with application, which can not be modified. We have not enabled retrying for multi-statement transactions, as it is not generally safe operation. We are also working on further optimizations for cluster wide conflicts, more news about this later.

Currently, if application does not tolerate deadlocks and wsrep_retry_autocommit does not help. the remaining option is to limit the number of write nodes or eventually to fall back to master-slave model. Usually however, there is only one or two hotspot tables and they often can be dealt with separately, e.g. by an intelligent proxy, ScaleBase has something to offer here.

And in the end, even in master-slave topology, Galera replication gives many benefits over mysql replication. Many projects migrate to G world by first running in master-slave and escalate to multi-master model later on.

hmm.. some markup language rules filtered out part of my post above. Anyways, wsrep_retry_autocommit takes an argument about how many times the statement will be retried until giving up and returning the deadlock error. Small numbers make sense here, with wsrep_retry_autocommit=3, you will try at most three times.

Is there the variable which counts number of times autocommit statements had to be retried ? I’d see this as a good value to watch for, to see when you’re wasting a lot of time and resources on deadlocks

@Seppo, thanks for the answer. I have another question since conflict detection is purely logical.

How does Galera handle cases where tables are linked by constraints? Suppose you have table A with primary key and table B with a foreign key contraint on A’s primary key. Now suppose a transaction on host 1 deletes a primary key in table A while another host 2 updates a row in B to depend on that primary key. These would have to be totally ordered to avoid an inconsistency between hosts. Do you include the key from A in host 2’s change set?

In async multi-master this conflict is very hard to resolve. It seems to be a case where synchronous replication helps by avoiding it in the first place.

Currently such status variable does not exist, but it feels like a good idea, I’ll put it in the wish list (wsrep_retry_count or smth).

OTOH we have statistics for conflicts (wsrep_local_cert_failures & wsrep_local_bf_aborts), and each conflict, even though successfully retried later, will add up here. So, the total conflict rate is visible already now, wsrep_retry_count would show how effective it was to use wsrep_retry_autocommit configuration.

Robert, Galera implements pretty much fundamental dependency tracking for shared and exclusive references in server, database, table and row level. Foreign key references account for shared row references in this taxonomy, and they are dealt with along all other possible references there might be.

Hi, where can I find wsrep_local_cert_failures & wsrep_local_bf_aborts explained in detail? I have a Percona Cluster where wsrep_local_cert_failures on each node grows up (~1000 at the moment, 3 days uptime) but wsrep_local_bf_aborts = 0. I set wsrep_retry_autocommit=5.

I having same problem when i am trying to run a huge insert statement to a new table it is failing with ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction its a new table and no body is trying to access the table. I am running 2 physical node cluster with galera arbitrator on third node all looks ok but dont know why insert is failing? I even set below no use wsrep_retry_autocommit=5.