Imprint

Some basics about distributed databases

This is a replay of a much older blog post, which was available in German language in the old blog. It’s from 2012, and neither GTID nor Galera cluster or Group Replication existed back then.

Wonka> The http://www.toppoint.de probably will never have meaningful load, but I would like to know how one would make this highly available. Some kind of Redundant Array of Inexpensive Databases.

Lalufu> MySQL with replication? Or DRBD?

Isotopp> With DRBD. Not with replication.

Wonka> Lalufu: Hm, Master-Master replication is with two hosts. If you want more than that you could build rings, but only singularly linked.

Isotopp> Wonka: Argh! Master-Master does not work with replication, ever.

Wonka> huh?

Isotopp> Thread 1 writes to Master 1:

INSERTINTO t (id, d)VALUES(NULL,'one');

insert into t (id, d) values (NULL, 'one');

At the same time, Thread 2 writes to Master 2:

INSERTINTO t (id, d)VALUES(NULL,'two');

insert into t (id, d) values (NULL, 'two');

Isotopp> What’s the content of the database master1, what’s the content of database master2? If you assume auto_increment_increment and auto_increment_offset to be configured correctly?

Wonka> Isotopp: Ok, Problem. Still, there are many HOWTOs about that.

Isotopp> Wonka: Not a problem, yet. You will have (1, ‘one’) and (2, ‘two). So far it works.

kv_> After UPDATE things will look worse.

Isotopp> Now Thread 1 runs an UPDATE against Master 1.

UPDATE t SET d ='een'WHERE id =1;

update t set d = 'een' where id = 1;

And Thread 2 does an UPDATE against Master 2:

UPDATE t SET d ='eins'WHERE id =1;

update t set d = 'eins' where id = 1;

Isotopp> What’s the content of Master 1 and Master 2 now? The point being that there is no global timeline for the entire ring, so there is no one global serialization of the rings history. Instead each local node has it’s own local order of events. That means on Master 1 you can have an order of events one, een, eins and on master 2 it can be one, eins, een.

Lalufu> Actually the problem is that people would like to have such a global order, but MySQL can’t deliver that.

Isotopp> It’s even more complicated. Let me explain to the end.

Wonka> http://www.howtoforge.com/mysql_master_master_replication “This tutorial describes how to set up MySQL master-master replication. We need to replicate MySQL servers to achieve high-availability (HA). In my case I need two masters that are synchronized with each other so that if one of them drops down, other could take over and no data is lost. Similarly when the first one goes up again, it will still be used as slave for the live one.”

Wonka> So they did not understand this.

Isotopp> Exactly. They think they can win, but you cannot cheat the universe. So you want to enforce such a serialisation. In SQL you enforce an ordering of events with locks on the domain you want to enforce an order on. That means you need a locking that works on the entire domain, all servers in that.

Because your domain is no longer a single box. For a single box you do have local locking. But now you have that cluster, or a ring. And MySQL replication specifically has no locking protocol. Such locking protocols do exist, 2PC, 3PC, Paxos, and a few more. 2PC is the ‘fastest’ in the sense that it uses a minimal number of round trips in the case of no special events happening during the lock synchronisation in the cluster. Paxos is the best with respect to recoverability.

Without a locking protocol you can’t do concurrent writes safely, because you do not create a globally identical serialisation of events. Any setup with more than one writer without a distributed locking protocol is broken.

Now we read the instructions for mmm, MySQL Multi Master. They say: Writes all need to go to one single node. So, Ring, but not actually Multi Master – the name is fake.

Looking elsewhere, does not matter where: EITHER synchronisation by locks, OR one master/write node OR broken.

These are all available choices. There are no others.

kv_> Wonka: And if you ask people why they are using Master-Master or circular replication, they answer is always the wrong one. “High Availability” or “Load Balancing”. For high availability, take shared storage and an operating system level shared storage, NetApp or DRBD. And for load balancing take one way replication. Write Distribution MySQL does not do. People who need this will do application level sharding.

Isotopp> Precisely.

And now coming back to MySQL and delivering: There is a product from MySQL with multiple nodes and 2PC. It’s called MySQL cl-uster. It does not use MySQL replication to achieve this.

And wrt to HA: MySQL 5.1, 5.5 and 5.6 have different increments of Semi Syncronous Replication (SSR). You can have one master, but multiple slaves, which will delay the commit on the master until there is at least one slave that has the same data as the master. That combines the disadvantages of 2PC (slower) with the disadvantages of replication, as they are:

In MySQL replication each slave is dependent on the binlog position, which each slave has locally. That means, you can’t simply move Slave 3 from Master 1 to Slave 2, even if Slave 2 is known to be at the same position as Master 1 thanks to SSR.

That’s because the binlog position of Master 1 is expressed as (mname, mpos), but the same position on Slave 2 would be (s2name, s2pos). You would need a translation mechanism for each of those in order to change between them.

Beginning with MySQL 5.6 you get Global Transaction ID (GTID), which is such a translation mechanism. With SSR and GTID together you can finally use replication as a HA mechanism and could build a stable ring with exactly one active master. You still have a bit of waits because of SSR, but failover is smooth.

You still have no simple way of doing proper multi-master, because MySQL Cluster is not a good replacement for vanilla MySQL as being used by vanilla MySQL applications.

One Comment

It’s anything but trivial, and of course it has a catch too – the system still ensures a global ordering with the help of well synchronized time (no, NTP alone isn’t enough), and transactions may or may not succeed, which you’ll only find out in an asynchronous notification of the commit operation. In fact, the transaction success rate is extremely low compared to typical SQL databases (note: the software I am working on has few places where transactions have to work and retrying logic is necessary, and many places where we fire-and-forget transactions and – if they get lost – just redo them on the next batch run over the table, so all we do is handle failures and graph them).

Having said that, master master replication can be made work too if you design your software for it – but it absolutely won’t “just work” for an off the shelf web forum. It’s hard as you constantly have to think about replication conflicts and explicitly handle them. Way too often you’ll end up with a batch job diffing the tables and resolving conflicts… so just don’t do this unless you really have to.