Finding your MySQL High-Availability solution â€“ The questions

After having reviewed the definition my the previous post (The definitions), the next step is to respond to some questions.

Do you need MySQL High-Availability?

That question is quite obvious but some times, it is skipped. It can also be formulated “What is the downtime cost of the service?”. In the cost, you need to include lost revenue from the service and you also need to consider less direct impact like loss of corporate image and other marketing costs. If your downtime cost is under $10/h, you can stop reading this document, you don’t need HA. For the others, let’s move on!

How to determine which MySQL High-Availability solution is best?

What is really tricky with MySQL is the number of possible HA solutions. From the simplest the most complex let’s list the most common ones:

These technologies are by far, not a one size fits all and many deployments use combination of solutions. I will not cover ScaleDB and Continuent because I know almost nothing of these solutions. There are many more questions you need to ask yourself before being able to pick the right one. Below, I listed the most common questions, I might have missed some.

1. What level of HA do you need?

Since all the technologies do not offer the same level of availability, this is a first important sorting factor. Here are estimates of the level of availability offered by the various solutions.

Level of availability

Simple replication

98 to 99.9+%

Master-Master with MMM manager

99%

Heartbeat/SAN (depends on SAN)

99.5% to 99.9%

Heartbeat/DRBD

99.9%

NDB Cluster

99.999%

From the table, if your requirements are for 99.99%, you are restricted to NDB Cluster while if it is only 99% you have more options. I recall that the level of availability is hard to estimate and subject to debate. These are the usually accepted level of availability for these technologies.

2. Can you afford to lose data?

Obviously, if you are concerned about loss of data, you are most likely using the InnoDB storage engine, since MyISAM is not transactional and do not sync data to disk. Similarly, MySQL replication is an asynchronous process and although it is fairly fast at transferring data between the master and the slaves, there is a window of time where data loss is possible.

If you can afford to lose some data, you can consider “MySQL replication” and “Master-Master with MMM manager” otherwise, you can only consider the other three solutions.

Data 100% safe

MySQL replication

no

Master-Master with MMM manager

no

Heartbeat/SAN (depends on SAN)

yes

Heartbeat/DRBD

yes

NDB Cluster

yes

3. Does your application use MyISAM only features?

There are some features like Full text indexes and GIS indexes that are supported only by MyISAM. The HA solutions that work well with MyISAM are “MySQL replication” and “Master-Master with MMM manager”. Depending on the application, the MyISAM Full text indexes might be replaced by another search engine like Sphinx in order to remove the restriction. There is no HA solution other than the ones based on replication that handles GIS indexes.

HA solutions

Need MyISAM Full text or GIS indexes

“MySQL replication” and “Master-Master with MMM manager”

Don’t use any special MyISAM feature

All

Can change MyISAM Full text to Sphinx

All

4. What is the write load?

The HA solutions we present are not equal in term of their write capacity. Due to the way replication is implemented, only one thread on the slave can handle the write operations. If the replication master is multi-cores servers and is heavily writing using multiple threads, the slaves will likely not be able to keep up. Replication is not the only technology that put a strain on the write capacity, DRBD, a shared storage emulator for Linux, also reduce by about 30% (very dependent on hardware) the write capacity of a database server. In term of write capacity here are you choices.

Write capacity

MySQL replication

Fair

Master-Master with MMM manager

Fair

Heartbeat/SAN (depends on SAN)

Excellent

Heartbeat/DRBD

Good

NDB Cluster

Excellent

5. For what level of growth are you planning?

Since NDB Cluster is an integrated sharding environment, if you are planning for a growth that will need sharding (splitting the database over multiple servers), then you might need to take a serious at that solution. If not, then, apart from the write capacity, all the solutions are approximately equal.

6. How qualified is your staff or support company?

There is a quite direct relationship between the level of availability and the complexity of the solution. In order to reach the promised level of availability, the staff maintaining the HA setup, either internal or external, must have the required level of expertise. The required expertise level is summarized in the table below.

7. How deep are your pocket?

The last aspect that needs to be considered is the budget, complexity is expensive. We will consider two types of setup. The first one is a basic proof of concept of the technology with the hardware tested, the data imported and basic testing and documentation. A proof of concept setup is a good way to get used to a technology and experiment with it in a test environment. The other type of setup we will consider is a full production setup that includes extensive testing, fire drills, full documentation, monitoring, alerting, backups, migration to production and post migration monitoring. Of course, it is the safest way to migrate an HA solution to production. All the times here are estimates based on field experience, the values presented here are fairly typical and contains some buffers for unexpected problems. Although an HA solution can be built remotely through a KVM over IP and adequate remote power management, an on site intervention with physical access to the servers is the preferred way, especially for the most complex solutions.

Proof of concept

Migration to Production

MySQL replication

4 hours

12 hours

Master-Master with MMM manager

8 hours

24 hours

Heartbeat/SAN (depends on SAN)

32 hours

120 hours

Heartbeat/DRBD

40 hours

120 hours

NDB Cluster

40 hours

120 hours+

Editor’s Note: We’ve gotten many questions about the time estimates mentioned here. The above estimates shouldn’t be used to compare against any specific situation. Time will vary greatly depending on your project. For example, “setting up replication” can be as simple as CHANGE MASTER TO, and can take as little as a few minutes in some circumstances. Yves’s estimate is for a project to create a replication slave for HA purposes, not for “setting up replication.” There is a big difference between an HA project and a DBA task. — Baron Schwartz

Related

Yves is a Principal Consultant at Percona, specializing in distributed technologies such as MySQL Cluster, Pacemaker and XtraDB cluster. He was previously a senior consultant for MySQL and Sun Microsystems. He holds a Ph.D. in Experimental Physics.

Also the solutions you present have caveats. Example: talking about cluster again, it’s storage is limited by the amount of memory the NDB nodes have. Storing data on disk can seriously degrade performance compared to 100% in memory operation.

Also each of the solutions you present have hardware and, in general, infrastructure costs that have to be brought into the ROI equation as well.

One question not addressed there is how many copies of the data does each method store? For example with Replication and MMM you have more than one distinct copy of the data, where as with DRBD,SAN, and NDB you have a single copy of your data. This is important to the idea of % uptime, depending on what time of failure.

I know from experience that if you somehow wreck your single copy of the data in either of those solutions (for example corrupt ibdata files), you are now offline and need to restore from backups. Whereas with replication your second copy is your failover and can result in much less downtime in these scenarios.

Do the availability numbers for NDB assume that your datacenter and the network connections to it are also 99.999? And if you aren’t willing to assume that can you use NDB + geographic replication to compensate? Problems like having your country removed from the internet sometimes get in the way of uptime.

You can also use disk based tables with only the indexed data in memory using its page buffers and OS file system cache to minimize IO.
In simple terms this is what InnoDB does with the buffer pool and MyISAM with its key_buffer and OS cache.

You’re underestimating replication availability. Do note, that for physical storage based solutions (SAN and DRBD) you have to do crash recovery, whereas replication switchover/failover can be immediate. In proper environments this can mean seven nines or something 🙂

In our production environment using muti-master for 6+ years we now achieve 99.998% uptime across our DB servers. It seems in the article you are assuming a single database / applicaiton. For example for a single database you can achieve higher writes with a SAN but if your already logically separating your customers over hundreds of servers then putting them all on a single SAN with not improve performance in my opinion especially if you talk about putting SSD on the local servers.

when things crash quickly that is the best situation for failover. When a drive fails in a bad way and causes the server to lose 99% of it’s performance this is much more difficult to correct automatically.

Our downtime falls into these categories:

* kernel panics – with hundreds of DB servers we get kernel panics weekly.
* hardware failure – typically losing a disk. Usually the RAID controller does the right thing though when hard drives don’t just die but become unresponsive often they get into a bad state.
* mysql server crash due to a new bug
* mysql server crash due to innodb corruption. Over hundreds of servers this does happen and is a particular nasty situation. There were a few that have been reproducible and crashed the masters and all slaves via replicaiton (past bugs) though typically the data is corrupted on only one server. You can try to mitigate this risk by rebuilding your tables occasionally (on the secondary) and also rebuild them after upgrades. Corrupt data is a reality unless your using something like ZFS to prevent if from ever happening.

With SAN based solutions you are putting all reliance on the SAN and it’s built in redundancy. You can count on anything failing in the worst case leaving you with huge downtimes and restore times.

I have heard of several horror stories with DRBD with the data being corrupted on the master then corrupting the secondary.

With Muti-master the main advantage is the share nothing nature. It solves the problem where data gets corrupted on one node. There is no shared infrastructure (at least they way we are setup) so you have the best chance of having an available master when the worst happens. The main dis-advantage is the replication delays with muti-master. This is where muti-threaded slaves would be a great new feature.

Along the same lines as the requirement to use MyISAM (and the restrictions on HA solutions that come with it) if you need certain MyISAM-only features and can’t use Sphinx or some other workaround, there are some features which you might want in your database that aren’t supported by the NDB and would require ruling it out or potentially some serious application redesign. (No foreign keys, 8kb row limit, etc.)

Gerry: Maintenance and upgrades are done using a rolling-restart so that this does not cause any downtime. One of the key features of cluster is no single point of failure so if maintenance and upgrades went against that it would be a bit pointless.
RAM is cheap and disk operations can still be pretty fast (all the other solutions are disk based anyway).

Mark: Your datacenter internet connection may be down but at least your database will be up 😉 You could also say that datacenter power is not 99.999% reliable.

1. If a organization is using GIS and they want to use a Open Source database is MySQL really the right decision? PostGIS is a mature solution that might well be ideal for a variety of reasons. HA Postgres is beyond the scope of this response but several solutions exist (WAL shipping isn’t brain surgery…).

2. I think the complexity of NDB might be being under rated here:

2.1 The monitoring requirements are significantly greater than the alternatives. Tablespaces are fixed size and several of the memory settings can cause the cluster to go down if they are exhausted.

2.2 With 2.1 in mind significant consideration must be payed to the actual size of the data set after being imported into NDB. Of particular note is that blob/text columns (for several reason) are less than ideal for NDB.

2.4 I think it worth pointing out NDB is less than ideal if one needs to replicate to geographically dispersed location. There are way to get around this (external replication…) but this introduces even more complexity.

2.5 NDB has bugs like all software and the userbase is not that large and the number of experts is tiny, however the LOC count is massive (with admitidly a giant test suite). The list of MySQL Certified Cluster DBA* is ~100 people, many from Sun. This should scare you.

A few months back I saw someone on #mysql-ndb with a down cluster offering 200 USD per hours for help to get the cluster back up. As far as I can tell no one ever responded to him.

2.6 MySQL-NDB is not a complete solution without putting a good bit of thought into what happens when a API node dies. One way or another a complete solution either involves a very smart application (have fun with that) or making use something to automate failing over to an IP to a API node that is up.

2.7 Just adding more data nodes to increase capacity can easily decrease performance.

2.8 Security. Most communication is in the clear.

2.9 The MySQL database does not replicate with cluster. Need to add a user or update a sproc? You will need to hit every API node.

Yes, prior planning is needed when setting up and configuring a cluster (I would imagine the same is true of many database deployments). Most memory settings should not cause the cluster to go down if exhausted, just cause transactions to be aborted.
I would argue that replication with NDB is more reliable that regular MySQL when configured correctly.
Text/Blobs are not ideal in Cluster, that is correct. This is something that is changing, but it wasn’t something Cluster was originally designed for.
We have some very large customers with some very large deployments of cluster. We have customers with > 99.999% uptime.
I think you’ll actually find only about 10% of the certified Cluster DBAs work for us. But this is kind of irrelevant because the cert is massively out of date (there is a new cert on the current versions coming soon).
I did not see the $200 offer, but if they had tried asking on the cluster mailing list or forum me or one of my colleagues would have seen it and we would have most likely been able to help.
If you use MySQL for NDB then yes you have to give some thought into switching API nodes. There are many possible load-balance options for client apps for this and several other HA solutions require the same thing.
Adding data nodes should not decrease performance (unless your network is slow).
Security is not something that will change, due to the way cluster works you could have problems if it was not on its own private network. This is by design and well documented. Security overhead will cause more problems than it solves.
Cluster is not designed for joins (again this can be seen in its roots). That said there is work-in-progress to improve this.

LinuxJedi:
I yield to your obviously superior knowledge on the subject. I had misremembered some distribution awareness material in terms of the performance impact of adding boxes. As for running out of IndexMemory, DataMemory, etc… a read only database seems only slightly better than database that is down.

I agree that NDB has the possibility of being used for very high availability, however my point is that it complicated solution that will require both a good bit of time to setup the cluster correctly, probably change application schema and code (etc), as well as a continuing need to monitor the servers. I have spent a good bit of time in front of the docs and I have a lot to learn still.

I agree with your points there, and the developers are taking steps to help with this including better monitoring in upcoming releases.

It is a complicated soultion, and yes it could well take some time to set up. I am hoping over time this will get easier and cluster will get better as a more general-use database. Luckily there are some good blogs and webinars on it, and Johan Andersson has written an online tool to help you configure cluster. Hopefully we are gradually making it easier to pick-up (its certainly easier than back when I started with cluster in 5.0) but as always feedback is greatly received.

That’s fun to have so many great comments. I also understand that people, including myself, do trust and like solutions they know and are skeptical about solutions they don’t know. I’ll try to be neutral, but be aware that my own background is a lot with Heartbeat/DRBD and NDB cluster.

@Gerry:

I have seen cluster that have been up for more than a years and I maintain it is 5 9s. NDB is designed in a way that nearly all the maintenance task can be handled without downtime with rolling restarts, even software upgrades. The exceptions I see is (was with the latest versions) adding new ndbd nodes and of course doing alter tables. For alter tables, only replication can reduce the downtime.

I agree that for good performance, data needs to be in memory but… that about the same with InnoDB, at least the active dataset needs to be in the Buffer Pool.

You are right that I should consider the cost of the solution in term of hardware. Usually, people think NDB is very expensive compared to other solution but it is not always true, just consider an application that cannot lose data, ruling out replication, and needs HA and sharding. The only solution there is to have multiple Heartbeat/DRBD pairs, that’s a lot of boxes too.

Yes, I might be underestimating replication availability but in the last 2 years I ran into many clients having serious corruption issues with replication. I must admit that those issues were nearly all caused by stored procs and triggers and most likely switching to row based replication would have help.

@Ryan H:

Yes, my discussion is about a single database/application. I am guessing from the figures you are giving that you have implemented sharding. You have probably tune your schema in a way to reduce cross-node interactions. How do you manage things like searching through the whole dataset, cross-nodes transactions? Not all the application can be built like yours and for existing ones, implementing sharding in the application might a complex and costy task. For those, NDB Cluster, a hash based sharding framework, is the easy and cheap solution.

Maybe I have lucky, but the only InnoDB corruption issues that I have ever seen has been while developing WaffleGrid, when we forgot the insert buffer merge code. I have seen way more replication corruption issues than InnoDB corruption ones.

Looking at the other downtime reasons you are giving, most of them would have covered by Heartbeat/DRBD or NDB Cluster.

@Todd:

Replication are two different ways of replicating data, from my experience, DRBD is more reliable than replication. It is also fairly well established that a cluster manager over a shared storage, not accounting the share storage availability, has an availability of 99.9%. Since DRBD has nothing shared, one could argue that it has an even higher availability.

@Rob:

1. Yes, for GIS I would use PostGIS, I completely agree.
2.5 I have done many NDB engagements and… I am not cluster certifified.
2.6 It depends on your architecture. Most of the time, the API node, the mysqld process, sits on the same node as the application server (apache, tomcat, etc). Losing an API node is the same as losing an application server node.
2.7 Only true if the schema is not partitioned correctly. I have worked with NDB having 24 ndbd nodes handling insanely huge loads.
2.8 What the point here, do you use SSL between your application servers and your MySQL servers?
2.9 Good point and yes, this is a pain.
2.10 The BKA algorithm might help but there are ways to improve the joins. If I find some time, I’ll backport BKA to 5.1. Mark C. published the patch to back port BKA to 5.0, it should not be rocket science to adapt it to 5.1. I blogged about NDB joins on www.bigdbahead.com.

Thank you all for clarifying that Cluster now supports rolling updates. I guess I need to read the docs and announcements in more detail.

@Yves: Replication, SAN / NAS, DRBD and Cluster can all achieve 5 9s (and higher) in practice and if you introduce the “Human Factor”, that can easily shave 1 ‘9’ out of the numbers if the DBA doesn’t know the chosen solution like the back of your hand. Working with the solution the DBA knows best can, on the other hand, add 1 ‘9’.

@Yves: It may be a redundant question, but what’s the specific problem that prevents replicating MyISAM tables with DRBD?
I did a few tests recently and it worked fine. Yet it’s not a first time I come across the “you can’t use MyISAM with DRBD” statement.

Funny how different experiences lead to different evaluations of tools. The MySQL HA solutions the MySQL Performanceblog list, are almost listed in the complete opposited order of what my impressions are.

@Michael with DRBD (assuming mode C) you pay a heavy cost to make sure that replication is synchronous, and you don’t loose data. This only works successfully because InnoDB is ACID compliant and makes sure it can recover from every update.

Since MySQL is not ACID compliant, DRBD can not truly be synchronous – there may be some data that has been modified but still in an OS buffer somewhere. Because you can no longer guarantee that the secondary device will have all the data, it doesn’t usually make sense to pay the cost of DRBD.

Great article. I’m wondering how a requirement for geographically distributed servers might impact the choices given. Certainly, having shared disk becomes an issue, so the Heartbeat/SAN method becomes problematic. Simple MySQL replication should still work, though you’d have the caveat of needing sufficient bandwidth to keep up with the relay log (though as you indicate, the problem in keeping up with the log is rarely replicating it, but instead has to do with processing it on the slave). But how do the other options fare with this requirement?

MyISAM does not sync to disk, it only writes to the OS write cache. For the data files, (.MYD) you could probably get around by mounting the partition with the sync flag at the expense of a smaller write throughput. The real problem is the index files (.MYI), if you look at the “show global status” of a busy MyISAM database, you will notice a high number of “Key_blocks_not_flushed”. Those are updates to the .MYI files that has not been written to disk. If the server crash while Key_blocks_not_flushed is not 0, indexes are corrupted and will need to be rebuilt. Rebuilding MyISAM indexes on a large table can take many, many hours.

For geographical redundancy, I think replication is the best solution and it can be applied to all solutions. Losing some data when a whole goes down is usually acceptable. Even NDB cluster now support replication (row based only) and I have even done some experiments with NDB to InnoDB replication and, it works (not heavily tested).

Your write-up is pretty broad but it seems to gloss over some key issues.

1.) Maintenance. It’s the main cause of downtime for most sites. Solutions like DRBD or SAN that implement shared disk semantics don’t support it especially well. The reason is that if you only have a single copy of data you can’t do maintenance that requires both to be online.

2.) Uptime. I’m curious what is the basis for your up-time numbers. Uptime is a factor of your database, your admin procedures (e.g., heartbeat w/ VIP for failover vs. manual procedure with application reconfiguration), your ability to avoid configuration errors, etc. Also as @Ryan H pointed out, you can make unreliable things much more reliable by having many of them.

3.) Failures. Understanding types and consequences of failures is important. If NDB somehow does not work, your data are inaccessible with no easy fall-back solution. SANs fail in some strange and scary ways. They are also SPOFs. Approaches that create walled off copies are very robust across a lot of nasty failures. This is one of the big benefits of database replication, whether using MySQL or Tungsten Replicator.

4.) Backups. People tend to forget this is the first line of defense for data and jump straight to more complicated approaches.

I would like to invite you to have a look at Tungsten whenever you have time. We have a pretty broad view of availability and also take into consideration issues like data protection.

Hello Ewen,
Is it possible to get the NDB Cluster for Datawarehouse Appliances. In term of performance, is this solution can perform as well as Oracle RAC? What are the limits?
Thank you. Best Regards.

NDB cluster does not really make sense for Data warehouse applications. There was a discussion on what is good for cluster on the mailing list, check out the thread that starts here: http://lists.mysql.com/cluster/6960. In terms of performance versus Oracle RAC I think they are very different solutions and really answer different problems. Tom Hanlon’s post in the thread sums up rather well what scales well with NDB.

In response to feedback through other channels, I’ve added a note to the end of the blog post to add some context to the time estimates. These time estimates are for HA projects, not for simple administration tasks. Anyone can follow a WikiHow tutorial and install the tools and technologies mentioned in a relatively short time. Not many people have Yves’s level of expertise in architecting a complete HA solution, which is a whole different ballgame.

Hi LinuxJedi, i agree but what’s a “well configured cluster” ?
In this case I think a “well configured cluster” is a cluster with poor performances !
“Poor performances” means a lot of disk write and so, you lose a part of interest of the cluster…

A cluster running on good hardware with a good configuration, both tuned to the workload. We have customers who get very good performance and no downtime (ie. no total cluster failure).

There are a lot of disk writes with LCPs and redo logs but typically LCPs are quite slow writes (default 10MB/sec). With cluster being real-time the redo data (GCP) does need to be put to disk quickly, so fast disks are important for this aspect, especially on a database with a lot of writes. I would rather a lot of disk IO than the possibility of data loss.

I hope you guys don’t mind. How do you deal with many branches and all accessing the same database and then connection between branches (e.g. lease line and DSL) suddenly becomes offline? Real-time transactions are necessary with our operations. Do you think that any of the mentioned HA solutions would fit our current setup? or is there any other way to achieve the ‘Real-time’ requirement?