WL#4209: Integrate Backup with Replication

Rationale
---------
Replication and backup must work together in a consistent way.
Overview
--------
The MySQL backup system is part of the MySQL data integrity toolset. As such,
it needs to be compatible with other data protection and recovery mechanisms.
Replication is used for a variety of data integrity methods including, but not
limited to, redundancy, recovery, and scale out.
The focus of this worklog shall be how backup can be used to improve
replication. Since this work is the main starting point for replication and
backup integration, it is necessary to take the perspective of replication in
identifying how backup and restore can be used to enhance replication. However,
related worklogs (e.g. WL#4280) may be written from the perspective of the
backup system.
The MySQL backup system is not complete and therefore this work shall be
limited to discussing the enhancement of replication based on what backup and
restore can do now. Where appropriate, notes and future enhancements concerning
features of the backup system and how it can be used with replication shall be
noted.
Common Replication Tasks
------------------------
In an effort to polarize the focus of this work, this section explores the more
common replication tasks with suggestions on how backup and restore could be
used to enhance the process.
1) Make a copy of a master to setup a new slave.
This is a very common customer need when replication is initiated or the
replication topology is being expanded to multiple slaves. We create a slave by
making a copy of the data on the master and restoring it to a new MySQL server
then configuring the new slave to replicate from the master. There is a similar
use where a slave has failed and the need is to restore redundancy capabilities
without impacting production service on the master.
There are several possible solutions for this task. The best solution is to
backup the master and restore it on the slave. Another possibility is to start
the slave and tell it to request the data from the master in the form of a
backup on the master piped to a restore on the slave. Note: This is future work
and is described by WL#4273.
2) Make a copy of a slave to another server that will become the master of the
slave as soon as possible.
This is a case where the master has failed or needs to be replaced and the
need is to failover to a new master. Typically, the process is to keep the
slave in service until the master has been restored resume replication with the
restored master. It could also be a case of normal master failover to a slave –
the slave takes the place of the master.
While there are many ways to perform this task, the slave needs to be
synchronized with the master. Whether this is done by first promoting the slave
(failover) then restoring the master via replication from the promoted slave or
by first restoring the master and reestablishing replication, backup can
provide a method of creating a copy of the data in either case to be restored
on the other server.
3) Make a copy of a slave to another slave so the new slave can join the
replication topology.
This is similar to the first common task above, but in this case, we are
creating an additional slave. This differs in that a backup performed on a
slave has no binary log information to record. Indeed, unless the slave is also
a master to another sub-topology, the binary log is not typically used. In this
case, additional work may be needed to allow the backup to record information
about the master’s binary log in the backup logs for later use.
The same basic backup and restore process can be used in this task. But this
requires metadata about the master’s binary log in order to start the new slave
at the correct point. This information could be recorded when the backup is
performed on the slave, but could also be noted by the user when the backup is
performed on the slave.
4. My master died, I have ten slaves, each at a different position.
The goal of this task is to synchronize the slaves enough so that one can be
used in a failover process to replace the master. This is typically the slave
that has the least amount of data loss as measured by how much of the master’s
binary log events it has executed.
Using backup and restore in this task requires that one first apply the
master's binary log (for point in time recovery) then promote that server to
the master and point the others to it. While this is a normal replication
recovery effort, backup could be used to make a copy of the promoted slave and
restored on the old master and then failover the new master thereby restoring
the original topology.
Note: If a topology containing slaves setup to replicate a portion of the data
on the master and these segments are not divided by database, backup and
restore may not be possible until selective restore is available.
Using Backup and Restore in Replication
---------------------------------------
The normal use of backup and restore in replication is to recover from errors
due to machine or data failures. The goal of using backup is to ensure a
consistent copy of the data and the goal of using restore is to ensure the
lossless recovery of that data.
There are two basic scenarios where replication can benefit from backup and
restore. The first is making a backup of the master to recover from failures
(complete server recovery) and the second is the restoration of either a master
or a slave to a specific point (point-in-time recovery). There are also
scenarios where backup and restore can be used on slaves but these are less
likely scenarios.
The following illustrate the use of backup and restore in a basic replication
topology of a single master and one or more slaves.
1) When a Master Fails
This is when the master has been taken offline either under an intentional
shutdown (taking offline for preventive maintenance) or as the result of a
crash. There are many ways backup can be used in this scenario.
* Backup as a preventive measure by making regular copies of data.
* Backup from a known good slave and restore to a new master.
* Restore data to repair the master.
* Restore data to create a new master with the backup data.
A sample process for this scenario follows.
1. Failover to slave is performed.
2. A backup is taken on new master (former slave).
3. Backup image is restored to make the old master a slave.
Another variant of this process follows.
1. After failure of master, master is restored from backup image.
2. A new backup is performed on master.
3. The resulting backup image is restored on slave.
4. Replication is restarted.
2) Point-in-time Recovery
When a situation exists where some operations (transactions, queries, etc.)
in the binary log cause a problem (crash or loss of data) and one wishes to
restore the master (or slave, but that is a special case) to a certain point
prior to the problem, it is possible to use backup to enhance this process.
The most common way to use backup and restore in this scenario is to use a
known good backup image to restore on both the master and slave then replaying
the binary logs from the master on the master allowing the changes to
replicate. This requires the following. For additional information on point-in-
time recovery, see the MySQL Users Guide.
* Backups are being made on the master as a preventive measure.
* Copies of the binary logs are saved.
* Copies of the backup logs are saved (optional).
A sample process for this scenario follows.
1. Replication is stopped manually (if not already interrupted).
2. A restore is performed on both master and slave.
3. Replication is resumed.
4. The binary log information as stored in the backup image file is
examined and a starting and stopping point are determined. Note:
this can be obtained using the backup_history log or via the MySQL
backup client (future release).
5. The binary log is played on the master and the slave is allowed to
catch up (process all events in its relay log).
Note: Some users will use grep to filter out the castrophic operation
from the binary log. In that case, the grep on the binlog must
include removing the restore event (see below).
Use Cases
---------
A brief discussion of how backup and restore can be used in a replication
scenario has been presented above. This section examines the details of the
expected behavior and enhancements to the backup system under a variety of use
cases.
Given the base replication pair of a master and a slave, the following use
cases explain how backup and restore will enhance replication. For those
scenarios where a slave can be a master to another slave, both use cases shall
apply. For example, use case 1 and 3 apply when backup is run on a slave that
is also a master.
Use Case 1 - Backup performed on a master.
When a backup is performed on a master, the master shall not log the backup
event nor shall the master replicate any data produced (logged) by the backup.
Therefore, the slave remains unaffected by a backup run on the master.
The backup shall record the binary log position, time, and file name and store
the information in the backup_history log as well as in the backup image file.
Use Case 2 - Restore performed on a master.
When a restore is run on the master, a signal must be generated to signal
the slave that a significant change that cannot be replicated has occurred on
the master. This shall be accomplished by issuing a restore incident event
which will cause the slave to stop.
Note: Once WL#4273 is complete, it may be possible for the slave upon reading
the restore incident event to request a backup from the master and thereby
allow for automatic propagation of the restore on the master to the slave.
Until such time, the slave shall stop when a restore incident event is
encountered.
Use Case 3 - Backup performed on a slave
When backup is performed on a slave, the slave shall not log the backup
event in its binary log (if enabled). Since data is only read, there is no
affect on replication. Also, when a backup is run on a slave, no binary log
information is stored in the backup history log because the slave's relay log
is not the same as the binary log on the master.
Note: If the slave is behind the master, care should be taken to ensure the
binary log from the master is preserved and the information about the slave WRT
the master’s binary log position recorded. The information from the slave
concerning the master’s binary log position and the slave's relay log position
shall be stored in the backup logs.
Use Case 4 - Restore performed on a slave.
When restore is run on a slave and no binary log is enabled, the restore is
performed as a normal server. However, it should be noted this requires that
the slave stop replication with the master during the restore. In this case,
the slave’s recorded position will need to be recorded and replication
restarted from that point once the restore is complete.
If a slave has its binary log enabled, a signal shall be written to the binary
log in the form of a restore incident event. This shall permit the slave to
act as a master and therefore signal its slaves to stop.
Note: Conflicts with the data from the master are possible if a restore is
performed on a replicated database. In this case, the replicated events from
the master will be blocked until after the restore is complete. While the MySQL
backup system shall not prohibit this use case, the user must consider the side
effects of running a restore on a slave. Also, to protect any connected slaves
(where the slave is also a master) from conflicting changes as a result of the
restore, replication shall be blocked during the restore is complete and no
additional slaves shall be permitted to connect until after the restore is
complete and the restore incident event is written.
Use Case 5 - Backup run with no binary log.
Since there is no logging, the server cannot act as a master and therefore
backup shall run as normal with nothing logged in the binary log.
Use Case 6 - Restore run with binary log turned on but no slaves attached.
When restore is run in this scenario, it is similar to running restore on a
master but there are no slaves. However, slaves could attach at a later time
and request data from the master. Therefore, a restore incident event shall
be written to the binary log.
Requirements
------------
The following additional requirements are considered the minimal set which
describes how MySQL backup and replication shall work together.
R01: BACKUP commands shall not be logged.
R02: BACKUP commands run on a slave shall record the slave's binary log
information (if enabled) in the backup logs.
R03: RESTORE commands shall not be logged when run on a master.
R04: RESTORE commands run on a slave are not permitted unless replication is
turned off.
R05: The effects of the RESTORE shall not be logged. A restore incident
event shall be issued instead to signal any connected slaves to stop.
R06: Changes to the backup logs shall not be logged.
R07: Replication shall not be allowed to execute while doing restore. This
means replication is not allowed to start during restore.
R08: Backup on a master shall not affect its slaves.
R09: Restore shall not record any data events in the binary log unless
specifically requested.
Note: If the slave is also a master then it shall generate a restore incident
event. But if just a slave, then no such event shall be generated.
R10: At no time shall the BACKUP or RESTORE commands be written to the binary
log.
Restrictions
------------
MySQL backup must not disrupt or interfere with any of the uses of replication.
Indeed, it is the purpose of this worklog to ensure that backup is designed and
implemented in a manner that is complimentary with replication.
There are many uses for replication and many more possible replication
topologies. This work shall focus on the most basic replication topologies; a
single master with a single slave connected. All other replication topologies
can be represented given this base pairing. For example, a star replication
topology is simply a single master with two or more slaves connected.
There are no cases where replication inhibits backup or restore aside from the
normal locking issues common to applicable DML and DDL events. For example, a
long transaction event could block backup or restore commands.
Restore of a subset of the replicated databases is not supported at this time.
A restore incident event triggering the copying of databases from a master to
a slave (WL#4273) is not within the current scope.
Notes
-----
In later versions, the slave position will be saved and synched with the backup
image and thus the restore will set the replication state correctly so that it
may continue replication (see WL#4105).
In later versions, a special option may be provided to override not logging
restore. The override would permit the logging of the writes to the data, not
the actual restore command. This could be useful for situations where the
default backup drivers are used (these are currently the only type of drivers
that can support logging of data during restore).
RESTORE commands run on a master may require blocking of new slaves from
connecting until the restore operation is complete.
Replaying the binary log should normally be with binary logging off.
When a master fails and a new backup image is created from a slave, the
restore incident event shall not be sent to slave after failover to the
original master.
Users who perform point-in-time recovery should take care to remove the
restore incident event from master’s binary log prior to replaying the
events.

The backup system shall be modified to support replication services. Under no
circumstances shall the implementation of backup interfere or disrupt
replication. There must be no loss of data or functionality as a result of a
backup or restore operation.
Note: Temporary suspension of service and destructive restore notwithstanding.
Description of Operations
-------------------------
There are several replication conditions under which backup and restore
operations can be performed. Each of these present a unique set of challenges.
The following lists all of the possible replication conditions that backup and
restore operations can encounter and the behavior of the command with respect
to replication.
Operation Condition MySQL Backup behavior WRT Replication
--------- -------------- ------------------------------------------
backup on master Master does not log anything.
Replication is not affected.
restore on master A restore incident event will be written
to the binary log.
backup on slave No binary log information is saved unless
slave is also a master.
Replication is not affected.
restore on slave A restore incident event will be written
to the binary log if enabled.
backup no logging No affect - nothing is logged.
restore no replication A restore incident event will be written
to the binary log.
Server Services Needed
----------------------
The following services are required in order to successfully implement the
requirements of this worklog and behavior of backup and restore under the above
replication conditions.
* Detect when replication is underway (binary log is active).
* Suspend/resume replication (turn binary log on/off).
* Disable/enable connections from slaves.
Note: Additional services may be provided pending completion of the design for
WL#4280.
It was decided that these operations shall be provided via a service interface
which encapsulates the varied mechanisms for executing these operations. The
service interface work is being completed under WL#4280.
Additional Information
----------------------
* This work shall ensure that the defect in BUG#36533 is solved by
this solution.
* Advanced testing techniques are needed to fully cover all tests
of the requirements. See WL#4612 for details.

The operations described above shall be implemented as calls to the service
interface (WL#4280) from the MySQL backup code.
Overview
--------
Where ever possible, the calls to the service interface methods shall be used
and at no time shall any calls be made to the server facilities directly (data
items, classes, methods, etc.).
There are two sets of operations that are needed to realize this worklog. There
are operations concerned with the control or status of the binary log. There
are also operations concerned with the status and control of slave- and master-
related operations. The design of each of these is described in more detail
below.
Binary log Operations
-----------------
All operations associated with the binary log shall be made via the server
service interface (see WL#4280). These calls shall be placed as high as
possible in the execution path of MySQL backup code.
Slave and Master Operations
---------------------------
All operations associated with the status and control of a master or slave
shall be made via the server service interface. These calls shall be placed in
the locations where best suited to meet the requirements. Some of these calls
may require modification of the server code. For example, when a slave attempts
connect, a call to the service interface methods may be required to check that
it is ok to allow the connection (no backup is running on the master).
Fulfillment of Requirements
---------------------------
The following section describes how each of the requirements described above
shall be satisfied by the design and/or implementation of these features.
R01: BACKUP commands are not written to the binary logs. The code to initiate
the write does not exist in the backup source code. No further action is
required.
R02: BACKUP commands run on a slave record the master's binary log information
by writing a note to the backup_progress log. The information shall include the
name of the master's binary log and the position.
R03: RESTORE commands are not written to the binary logs. The code to initiate
the write does not exist in the backup source code. No further action is
required.
R04: RESTORE commands run on slaves shall generate an error and the restore
command shall fail. Code shall be added prior to checking privileges for the
restore command to call the service interface to identify whether the server is
acting as a slave and if so generate an error and abort the command.
R05: The effects of the RESTORE shall not be logged. A restore incident event
shall be issued instead to signal any connected slaves to stop. This
requirement requires two actions. First, the code must be changed to turn of
binlogging while the restore is running and restarted after the restore. The
position of these commands shall in no way affect replication or logging of
other commands. Second, the code must be inserted to generate an incident event
if the binlog is engaged.
R06: When data is written to the backup logs and the log destination includes
the TABLE option, the code shall be modified to turn of binlogging for the
local thread while the write is in progress and turned back on after the write.
R07: Replication shall not be allowed to execute while doing restore. This
means replication is not allowed to start during restore. This requirement is
satisfied by adding code to disable slave connections during the restore. This
code shall be inserted prior to restore starting and then slave connections
enabled after restore.
R08: Backup on a master shall not affect its slaves. Since neither the backup
command nor the backup log writes are written to the binary log, this
requirement is satisfied by R01 and R06.
R09: Restore shall not record any data events in the binary log unless
specifically requested. Since neither the restore command nor the backup log
writes are written to the binary log, this requirement is satisfied by R03,
R05, and R06.
R10: At no time shall BACKUP or RESTORE commands be written to the binary log.
This is satisfied by R01 and 03.