Making the impossible: 3 nodes intercontinental replication

In this post I want to show new possibilities which open with Percona XtraDB Cluster.
We will create 3 nodes Cluster with nodes on different continents (Europe, USA, Japan) and each node will accept write queries.
Well, you theoretically could create 3 node traditional MySQL ring replication, but this is not what you want to use day-to-day.

To show how it works I will use Amazon m1.xlarge instances, by one in Tokyo, Ireland and North California, running RedHat Entreprise 6.2 64bit.

In fact to create instances is most time consuming task. After that using my script you will have cluster running in 5 min or less.

There however some precautions needed if you run Amazon instances.
First, you need to open ports in the firewall. For the communication the nodes need, by default, 4444, 4567, 4568 ports
(see our FAQ why)

Second, please take into account that the communication is done using open channels, and you may want to establish an encrypted connection, using it in real life.

Now, as we have running instances, you can install Cluster packages from RPM repositories.
You can follow steps from documentation.

Well, one may argue that 0.27 sec for single query is kind of big, but this is the physic law, you cannot
go faster than speed of light, and the round trip between Europe and Japan takes time (or at least until scientists figure out how to attach transaction to Faster than light neutrino).
Also note, that XtraDB Cluster can apply events in parallel and throughput should be less affected by big distance.

Related

Vadim Tkachenko co-founded Percona in 2006 and serves as its Chief Technology Officer. Vadim leads Percona Labs, which focuses on technology research and performance evaluations of Percona’s and third-party products. Percona Labs designs no-gimmick tests of hardware, filesystems, storage engines, and databases that surpass the standard performance and functionality scenario benchmarks.
Vadim’s expertise in LAMP performance and multi-threaded programming help optimize MySQL and InnoDB internals to take full advantage of modern hardware. Oracle Corporation and its predecessors have incorporated Vadim’s source code patches into the mainstream MySQL and InnoDB products.
He also co-authored the book High Performance MySQL: Optimization, Backups, and Replication 3rd Edition.

To continue Harrison’s trail, it now depends on what you are comparing against whether this is truly amazing, ok or a degradation. The question, is what have you done with 3 continents until today?

If the answer is that you have a readonly datacenter with mysql slaves, and all writes are directed to one master data center, then this is a lot faster. You can now write at any datacenter, which is closest to the user. The commit will be slower, but the actual inserts and updates are now fast. On average it is a net gain (if you have users spread around the world).

If the answer is that you already write in multiple data centers, for instance using a “Master of Record” scheme, then this is not faster and might be slower. Otoh this is much simpler to manage as you don’t need to map each user to a “home” datacenter.

Florian: Codership has done a similar EU-US test with throughput and latency numbers published here: http://www.codership.com/content/synchronous-replication-loves-you-again

Ingo, thanks a lot, but while that post claims to compare performance both in a LAN environments and in EC2 instances, all I actually see is single-node vs. EC2 numbers. Am I missing something? Please point me to the single-LAN, multi-node, multi-master replication numbers if that post (or another) has them. Thanks.

I’m actually not so much looking for multi-DC setups at this point (although having something like that for MySQL will come in handy), I’m just trying to get an idea about the performance in comparison to “traditional” MySQL HA solutions.

And http://openlife.cc/blogs/2012/january/re-doing-galera-disk-bound-benchmark looks like it’s just what I’ve been looking for, thanks!

Oh, sorry, I didn’t publish a lot of latency numbers though. As far as I remember (I could dig them up from logs one day) they are like for semi-sync replication: noticeable but not of any practical consequence. Most importantly tps on a parallel workload is equal or better than a single node, so 1 ms of latency up or down is a nice tradeoff to make.

only “sync” mode is supported.
As soon as we do “async” we immediately get into problems inherent for regular MySQL replication,
and it would make it as a product with totally different characteristics.

Max: If you want async replication you can just use MySQL replication alongside Galera. For instance 1 node in a Galera cluster can also be a MySQL replication master, to which some other MySQL servers connect.

Otoh Vadim’s tests here show that the most likely reason you’d want to do that is moot: You can just run a Galera cluster over long distances and it will work and the overhead is still acceptable.

Max: If you want your write to be synchronously written to 2 or more continents, then it’s not going to be faster than that. All of the magical NoSQL solutions (MongoDB, Voldemort, etc…) would be equally slow (or more, considering they don’t support multiple operations within one transaction) too, or you would have to run them in a mode where you don’t really commit to more than one data center after all. Here too you can of course choose to use MySQL replication for Data Center redundancy and only use Galera within a data center (like you would with NDB). But then you’ll have to deal with the mess of eventual consistency some other way. A 0.2 sec latency on commit might be a nice tradeoff to pay for that convenience.

Note that this latency only applies for writes. Reads will not have this latency at all. Also the latency happens at commit time – it will often be possible to design your application (with AJAX or other asynchronous means) such that it is very responsive for any interactive part, and then the commit is more or less silent when everything else is done (only errors would bring something back to the user).

Henrik:Totally agree with you. I was referring to being able to do some operations in asynch mode to speed up the writes. I agree that 0.25 latency for intercontinental transactions is good. It might still be too much in some cases though.

It’d be cool to have benchmarks of a setup that I think a lot of companies looking for more or less HA are looking for that are not operating worldwide: a cluster in 2 datacenters in 2 cities in the same country.

Maarten: Benchmark results are of course the real facts, but based on my experience, a Galera cluster spanning something like 50 km with good fiber connection between will perform really well. Most workloads you might not see any overhead at all.

However, note that the way Galera handles failovers by majority, you’d really want to have 3 separate data centers.

Hi, thanks for your notes and the setup script. I had a quick question: I set up three nodes on three different Availability Zones (US East, US West and Europe) to simulate this cluster, and when I run the script, it gives me this error message:
“mysql_install_db: command not found”

I’m not able to find mysql_install_db anywhere on the file system either.

I tried this with both Amazon’s default as well as RHEL 6.2 stock AMI images (I had to modify the script slightly for the Amazon image) but it didn’t work on either.

Should I install Percona server: https://www.percona.com/doc/percona-server/5.5/installation.html
on these images before running this intercontinental script?

Great stuff but I worry what happens when the link is poor, like we have in China to the outside world. If the link is down/bad for an hour, one server will be thrown out of the cluster, I assume, but what to do with the clients of that DB, and how to redirect them to other DBs, especially if the orphaned server takes a long time to recover. I believe on the LAN Galera will move virtual IPs, but that’s useless for WAN use, I’d think.

Also, it’s nice to talk about re-sync by snapshot or backup but at 100GB+ and global links, this is a many hours/days effort to just move the data, so not realistic. It would have to re-sync automatically as best speed, like a regular MySQL slave would.

1) I guess in WAN case you could use DNS-based failover, which is analogous to VIP moving. However in your case the server in China must be dedicated to Chinese clients probably because it is too slow for the clients to reach the outside. And so if it looses connection to the outside, most likely the clients do too or it is too slow for them anyways.

2) As of version 2.0 Galera now supports missed transactions replay on rejoining the cluster, like regular MySQL slave. Except that it can do it much faster thanks to parallel applying.

Yes, we could, and we do this even in LAN based failover if necessary, via local host files or local/remote DNS. Fast enough as long as not global public DNS.

Bigger problem is the slow reads this creates – the cluster is great because normally only writes are slow, but reads are local and fast, but with the local node out of the cluster, now both reads and writes are slow.

Can we have a local read-only slave from the cluster, i.e. use normal MySQL replication with the cluster ?

1) yes, you can use normal MySQL master-slave replication from/to any node in the cluster.
2) you still can use disconnected node for reads if stale data is acceptable. Just set wsrep_on=0 globally or per session in addition to read_only.