A primary objective of MHA is automating master failover and slave promotion within short (usually 10-30 seconds) downtime, without suffering from replication consistency problems, without spending money for lots of new servers, without performance penalty, without complexity (easy-to-install), and without changing existing deployments. MHA also provides a way for scheduled online master switch: changing currently running master to a new master safely, within a few seconds (normally 0.5-2 seconds) of downtime (blocking writes only).

Difficulties of master failover is one of the biggest issues in MySQL. Many people have been aware of this issue, but in most cases there were not practical solutions. I created MHA to make our (DeNA's) existing 100+ 5.0/5.1/5.5 and future MySQL applications highly available. I think many outside people can also use MHA pretty easily.

Overview

Master Failover is not as trivial as you might think. Suppose you run single master and multiple slaves, which is the most common MySQL deployments. If the master crashes, you need to pick one of the latest slaves, promote it to the new master, and let other slaves start replication from the new master. This is actually not trivial. Even though you could identify the latest slave, other slaves might have not received all binary log events. If you let other slaves connect to the new master and start replication, these slaves lose some transactions. This will cause consistency problems. To avoid consistency problems, you need to identify which binlog events are not sent to each slave, and need to apply lost events to each slave before starting replication. This is very complex approach and manually doing recovery correctly is very difficult. This is illustrated in the slides (especially in p.10 as below) that I presented at the MySQL Conference and Expo 2011.

Fig: Master Failover: What makes it difficult?

Currently most of MySQL Replication users are forced to do manual failover on master crashes. But it is not uncommon to result in more than one hour downtime to complete failover. Each slave is not likely to have received the same relay log events, so you may need to fix consistency problems later. Though master crash does not happen so often, it is really serious once it happens.

MHA is invented to fix these issues. MHA provides the following functionality, and can be useful in many deployments where requirements such as high availability, data integrity, almost non-stop master maintenance are desired.

* Automated master monitoring and failover

MHA has a functionality to monitor MySQL master in an existing replication environment, detecting master failure, and doing master failover automatically. Even though some of slaves have not received the latest relay log events, MHA automatically identifies differential relay log events from the latest slave, and applies differential events to other slaves. So all slaves can be consistent. MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds). In addition, you can define a specific slave as a candidate master (setting priorities) in a configuration file. Since MHA fixes consistencies between slaves, you can promote any slave to a new master and consistency problems (which might cause sudden replication failure) will not happen.

* Interactive (manual) Master Failover

You can also use MHA for just failover, not for monitoring master. You can use MHA for master failover interactively.

* Non-interactive master failover

Non-interactive master failover (not monitoring master, but doing failover automatically) is also supported. This feature is useful especially when you have already used a software that monitors MySQL master. For example, you can use Pacemaker(Heartbeat) for detecting master failure and virtual ip address takeover, and use MHA for master failover and slave promotion.

* Online switching master to a different host

In many cases, it is necessary to migrate an existing master to a different machine (i.e. the current master has H/W problems on RAID controller or RAM, you want to replace with faster machine, etc). This is not a master crash, but scheduled master maintenance is needed to do that. Scheduled master maintenance causes downtime (at least you can not write master) so should be done as quickly as possible. On the other hand, you should block/kill current running sessions very carefully because consistency problems between different masters might happen (i.e "updating master1, updating master 2, committing master1, getting error on committing master 2" will result in data inconsistency). Both fast master switch and graceful blocking writes are required. MHA provides a way to do that. You can switch master gracefully within 0.5-2 seconds of writer block. In many cases 0.5-2 seconds of writer downtime is acceptable and you can switch master even without allocating scheduled maintenance window. This means you can take actions such as upgrading to higher versions, faster machine, etc much more easily.

Architecture

When a master crashes, MHA recovers rest slaves as below.

Fig: Steps for recovery

Basic algorithms are described in the slides presented at the MySQL Conference and Expo 2011, especially from page no.13 to no.34.

In relay log files on slaves, master's binary log positions are written at "end_log_pos" sections (example). By comparing the latest end_log_pos between slaves, we can identify which relay log events are not sent to each slave. MHA internally recovers slaves (fixes consistency issues) by using this mechanism. In addition to basic algorithms covered in the slides at the MySQL Conf 2011, MHA does some optimizations and enhancements, such as generating differential relay logs very quickly (indenpendent from relay log file size), making recovery work with row based formats, etc.

MHA Components

MHA Node has failover helper scripts such as parsing MySQL binary/relay logs, identifying relay log position from which relay logs should be applied to other slaves, applying events to the target slave, etc. MHA Node runs on each MySQL server.

Advantages

* Master failover and slave promotion can be done very quickly

MHA normally can do failover in seconds (9-12 seconds to detect master failure, optionally 7-10 seconds to power off the master machine to avoid split brain, a few seconds or more for applying differential relay logs to the new master, so total downtime is normally 10-30 seconds), unless all slaves delay replication seriously. After recovering the new master, MHA recovers the rest slaves in parallel. Even though you have tens of slaves, it does not affect master recovery time, and you can recover slaves very quickly.

* Master crash does not result in data inconsistency

When the current master crashes, MHA automatically identifies differential relay log events between slaves, and applies to each slave. So finally all slaves can be in sync, as long as all slave servers are alive. By using together with Semi-Synchronous Replication, (almost) no data loss can also be guaranteed.

* No need to modify current MySQL settings (MHA works with regular MySQL (5.0 or later))

One of the most important design principles of MHA is to make MHA easy to use as long as possible. MHA works with existing traditional MySQL 5.0+ master-slaves replication environments. Though many other HA solutions require to change MySQL deployment settings, MHA does not force such tasks for DBAs. MHA works with the most common two-tier single master and multiple slaves environments. MHA works with both asynchronous and semi-synchronous MySQL replication. Installing/Uninstalling/Starting/Stopping/Upgrading/Downgrading MHA can be done without changing (including starting/stopping) MySQL replication. When you need to upgrade MHA to newer versions, you don't need to stop MySQL. Just replace with newer MHA versions and restart MHA Manager is fine.

MHA works with normal MySQL versions starting from MySQL 5.0. Some HA solutions require special MySQL versions (i.e. MySQL Cluster, MySQL with Global Transaction ID, etc), but you may not like to migrate applications just for master HA. In many cases people have already deployed many legacy MySQL applications and they don't want to spend too much time to migrate to different storage engines or newer bleeding edge distributions just for master HA. MHA works with normal MySQL versions including 5.0/5.1/5.5 so you don't need to migrate.

* No need to increase lots of servers

MHA consists of MHA Manager and MHA Node. MHA Node runs on the MySQL server when failover/recovery happens so it doesn't require additional server. MHA Manager normally runs on a dedicated server so you need to add one (or two for HA) server(s), but MHA Manager can monitor lots of (even 100+) masters from single server, so the total number of servers is not increased so much. Note that it is even possible to run MHA Manager on one of slave servers. In this case total number of servers is not increased at all.

* No performance penalty

MHA works with regular asynchronous or semi-synchronous MySQL replication. When monitoring master server, MHA just sends ping packets to master every N seconds (default 3) and it does not send heavy queries. You can expect as fast performance as regular MySQL replication.

* Works with any storage engine

MHA works with any storage engines as long as MySQL replication works, not limited to InnoDB (crash-safe, transactional storage engine). Even though you use legacy MyISAM environments that are not easy to migrate, you can use MHA.

Production case study

I'm using MHA on our (DeNA's) production environments. We manage more than 100 MySQL applications (master/slave pairs) from a few old (32bit, 3GB RAM) manager servers (one manager per datacenter), and so far working very well. MHA does not spend resources at monitoring stage so managing hundreds of MySQL applications from single manager running on an old machine is totally possible (CPU util is 0-3% in total). We have been frequently using MHA for online master switch. Some popular social games grow more rapidly than we expect. In many cases scaling out (sharding) is chosen, but scaling up (increasing RAM, replacing HDD with SSD, etc) is sometimes better than scaling out. We switch master from a slower machine to a faster machine (and vice versa) by using MHA (MHA has a separated online master switch command), and we have been able to switch more than 10 masters with only 0.5-1 second of downtime (not being able to connect to master) each. 0.5-1 second downtime is acceptable in our cases. Social game users (especially paying users) tend to be very strict on performance and availability, but we haven't received any inquiries/complaints when switching masters with MHA.

SkySQL provides commercial support for MHA

After I presented about MHA at the MySQL Conference in April, many people told me that they were interested in trying MHA. I'm happy if many people use my software and satisfied with it. On the other hand, I'm a full time employee at DeNA, and DeNA does not provide software support/consulting business so I can't provide 24x7 support/consulting by myself. What if you want such services? Hopefully SkySQL has decided to offer that. You can get 24x7 support of MHA (and of course, MySQL) from SkySQL! I have many ex-MySQL friends at SkySQL and they have excellent expertise to provide MySQL related support services. If you are interested, go to SkySQL website and talk with their sales representatives.

I'm attending OSCON and introduce MHA at my session, so if you are interested and staying at OSCON, I'd like to talk with you.

About Me

I am a database engineer at Facebook.
Before joining Facebook, I was a principal database and infrastructure architect at DeNA. My primary responsibility at DeNA was to make our database infrastructure more reliable, faster and more scalable. Before joining DeNA, I worked at MySQL/Sun/Oracle as a lead MySQL consultant in APAC for four years.
You can contact me on Yoshinori.Matsunobu_at_gmail.com (replace _at_ with @).