Wednesday, April 13, 2011

With the addition of the new tables to implement crash-safe replication we also get access to replication information through the SQL interface. This might not seem like a big advantage, but it should not be taken lightly. To demonstrate the power of using this approach, I will show how to implement a multi-source round-robin replication described at other places (including our book). However, compared to the other implementations—where the implementation requires a client to parse the output of SHOW SLAVE STATUS—the twist is that the implementation is entirely done in the server, using pure SQL.

If you're familiar with replication, you know that a slave can just replication from a single master. The trick used to replicate from multiple master—this is usually called multi-source—is to switch between masters in a time-share fashion as illustrated in Figure 1. The schema used to pick the master to replicate can vary, but it is common to use a round robin schedule.

The steps necessary to switch master are:

Stop reading events from the master and empty the relay log. To stop reading events from the master, it is necessary to ensure that there are no outstanding events in the relay log before switching to another master. If this is not done, some will not be applied and will have to be re-fetched from the master.

Stop the I/O thread.

Wait for the events in the relay log to be applied.

Stop the SQL thread.

Save away the replication information.

Fetch the saved information about the next master to replicate from.

Change master using the new information.

Start the slave threads.

Simple, right? So, let's make an implementation! So, what pieces do we need?

To handle the periodic switching, we use an SQL event for executing the above procedure.

We need a table to store the state of each master. The table should contain all the necessary information for configuring the master, including the binlog position.

We need to be able to store what master we're currently replicating from.

We need two tables: a table my_masters to record information about the available masters and a table current_master that keeps information about the current master. The my_masters table will contain information on how to connect to the masters as well as the last seen position. We assume that the user and password information is stored in the table and won't save away that information when switching master. To store the current master being replicated from, We cannot use a user defined variable—because each invocation of an event spawns a new session—so we store this information in a table.

Switching masters

To be able to execute a CHANGE MASTER statement with the information we need, it would be perfect to use a prepared statement, but unfortunately, the CHANGE MASTER statement is one of those statements that cannot be used inside a prepared statement, so we have to build the statement dynamically. To make it easier, we create a change_master procedure that does the job of building, preparing, executing, and deallocating a prepared statement. We also allow the file name and position passed to be NULL, in which case we start replication without these parameters, essentially starting from the beginning of the masters binary log.

The last step is to create the event that switch master for us. As a specific feature, we implement the event handling so that we can add and remove rows from the my_masters table and the event will just pick the next one in order. To solve this, we use queries to pick the next one in order based on the index of the last used master and then an additional query to handle the case of a wrap-around with a missing table at index 1.

To allow the table to be changed while the events are executing, we place all the updates of our tables into a transaction. That way, any updates done to the table while the event is executing will not affect the logic for picking the next table.

There are some extra logic added to handle the case that there are "holes" in the index numbers: it is possible that there is no master with index 1 and it is possible that the next master does not have the next index in sequence. This also allow the server ID of the master to be used, but in the current implementation, we use a simple index instead.

SELECT idx INTO l_next_idx FROM my_masters
WHERE idx > (SELECT idx FROM current_master)
ORDER BY idx LIMIT 1;

Find the next master in turn. To handle that masters have been removed, we will pick the next one index-wise. Wrap-around is handled by using the default of 1 above.

SELECT idx INTO l_next_idx FROM my_masters
WHERE idx >= l_next_idx
ORDER BY idx LIMIT 1;

If we did a wrap-around, it might be the case that master with index 1 does not exist (the default for l_next_idx), so then we have to scan and find the first index that exists which is equal to or greater than l_next_idx.

Tuesday, April 12, 2011

A common request is to have replication crash-safe in the sense that the replication progress information always is in sync with what has actually been applied to the database, even in the event of a crash. Although transactions are not lost if the server crashes, it could require some tweaking to bring the slaves up again.

In the latest MySQL 5.6 milestone development release, the replication team has implemented crash-safety for the slave by adding the ability of committing the replication information together with the transaction (see Figure 1). This means that replication information will always be consistent with has been applied to the database, even in the event of a server crash. Also, some fixes were done on the master to ensure that it recovers correctly.

If you're familiar with replication, you know that the replication information is stored in two files: master.info and relay-log.info.
The update of these files are arranged so that they are updated after the transaction had been applied. This means that if you have a crash between the transaction commit and the update of the files, the replication progress information would be wrong.
In other words, a transaction cannot be lost this way, but there is a risk that a transaction could be applied yet another time.
The usual way to avoid this is to have a primary key on all your tables. In that case, a repeated update of the table would cause the slave to stop, and you would have to use SQL_SLAVE_SKIP_COUNTER to skip the transaction and get the slave up and running again. This is better than losing a transaction, but it is nevertheless a nuisance.
Removing the primary key to prevent the slave from stopping will only solve the problem partially: it means that the transaction would be applied twice, which would both place a burden on the application to handle dual entries and also require that the tables to be cleaned regularly.
Both of these approches require either manual intervention or scripting support to handle. This does not affect reliability, but it is so much easier to handle if the replication information is committed in the same transaction as the data being updated.

Crash-safe masters

Two problems related to crash-safe replication has been fixed in the master, both of which could cause some annoyance when the master recovered.

If the master crashed when a binary log was rotated, it was possible that some orphan binlog files ended up in the binary log index file. This was fixed in 5.1 but is also a piece in the pussle of having crash-safe replication.

Writing to the binary log is not an atomic operation, and if a crash occured while writing to the binary log, there were a possibility of a partial event at the end of the binary log.

Now, the master recovers from this by truncating the binary log to the last known good position, removing the partially written transaction and rolling back the outstanding transactions in the storage engines.

Figure 1. Moving position information update into transaction

Crash-safe slaves

Several different solutions for implementing crash-safety—or transactional replication, as it is sometimes known as—have been proposed, with Google's TransactionalReplication patch being the most known. This solution stores the replication positions in the InnoDB transaction log, but the MySQL replication team decided to instead implement crash-safety by moving the replication progress information into system tables. This is a more flexible solution and has several advantages compared to storing the positions in the InnoDB transaction log:

If the replication information and data is stored in the same storage engine, it will allow both the data and the replication position to be updated as a single transaction, which means that it is crash-safe.

If the replication information and data is stored in different storage engines, but both support XA, they can still be committed as a single transaction.

The replication information is flushed to disk together with the transaction data. Hence writing the replication information directly to the InnoDB redo log does not offer a speed advantage, but does not prevent the user from reading the replication progress information easily.

The tables can be read from a normal session using SQL commands, which also means that it can be incorporated into such things as stored procedures and stored functions.

Table 1. slave_master_info

Field

Line in file

Slave status column

Master_id

Number_of_lines

1

Master_log_name

2

Master_Log_File

Master_log_pos

3

Read_Master_Log_Pos

Host

3

Master_Host

User_name

4

Master_User

User_password

5

Port

6

Master_Port

Connect_retry

7

Connect_Retry

Enabled_ssl

8

Master_SSL_Allowed

Ssl_ca

9

Master_SSL_CA_File

Ssl_capath

10

Master_SSL_CA_Path

Ssl_cert

11

Master_SSL_Cert

Ssl_cipher

12

Master_SSL_Cipher

Ssl_key

13

Master_SSL_Key

Ssl_verify_servert_cert

14

Master_SSL_Verify_Server_Cert

Heartbeat

15

Bind

16

Master_Bind

Ignored_server_ids

17

Replicate_Ignore_Server_Ids

Uuid

18

Master_UUID

Retry_count

19

Master_Retry_Count

In addition to giving us crash-safe slaves the last of these advantages should not be taken lightly. Being able to handle replication from pure SQL put some of the key features in the hands of application developers.

As previously mentioned, the replication information is stored in two files:

master.info

This file contain information about the connection to the master—such as hostname, user, and password—but also information about how much of the binary log that has been transferred to the slave.

relay-log.info

This file contain information about the current state of replication, that is, how much of the relay log that has been applied.

Options to select replication information repository

In order to make the solution flexible, we introduced a general API for adding replication information repositories. This means that we can support multiple types of repositories for replication information, but currently, only the old system using files master.info and relay-log.info and the system using tables slave_master_info and slave_relay_log_info is supported. In order to select what type of repository to use, two new options were added. These options are also available as server variables.

Both of the variables can be set to either FILE or TABLE. If the variable is set to TABLE the new table-based system will be used and if it is set to FILE, the old file-based system will be used. The default is FILE, so make sure to set the value if you want to use the table-based system.

Table 2. slave_relay_log_info

Field

Line in file

Slave status column

Master_id

Number_of_lines

1

Relay_log_name

2

Relay_Log_File

Relay_log_pos

3

Relay_Log_Pos

Master_log_name

4

Relay_Master_Log_File

Master_log_pos

5

Exec_Master_Log_Pos

Sql_delay

6

SQL_Delay

If you look in Table 1 and Table 2 you can see the column names used for the tables as well as the line number in the corresponding file and the column name in the output of SHOW SLAVE STATUS. Since we are using tables, the column names are used for storing the data in the table, but when using a file, the column names are only used to identify the correct row to update and the value is inserted at the line number given in the table.

The format of the tables have been extended with an additional field that is not present in the files but which is present in the table: the Master_id field. The reason we added this is to make it possible to extend the server to track multiple masters. Note that we currently have no definite plans to add multi-source support, but as good engineers we do not want these tables to be a hindrance to adding multi-source.

Selecting replication repository engine

In contrast with most of the system tables in the server, the replication repositories can be configured to use any storage engine you prefer. The advantage of this is that you can select the same engine for the replication repositories as the data you're managing. If you do that, both the data and the replication information will be committed as a single transaction.

The new tables are created at installation using the mysql_install_db script, as usual, and the default engine for these tables are are the same as for all system tables: MyISAM. As you know MyISAM is not very transactional, so it is necessary to set this to use InnoDB instead if you really want crash-safety. To change the engine for these tables you can just use a normal ALTER TABLE.

Note that this works for these tables because they were designed to allow any storage engine to be used for them, but it does not mean that you can change the storage engine for other system tables and expect it to work.

Event processing

This implementation of crash-safe slaves work naturally with both statement-based and row-based replication and there is nothing special that needs to be done in the normal cases. However, these tables interleave with the normal processing in a little different ways.

To understand how transactions are processed by the SQL thread, let us consider the following example transaction:

This transaction will be written to the binary log and then sent over to the slave and written to the relay log in the usual way. Once it is read from the relay log for execution, it will be executed as if an update statement where added to the end of the transaction, before the commit:

In this example, there is a number of pseudo-server variables (that is, they don't exist for real) that have the same name as the corresponding field in the result set from SHOW SLAVE STATUS. As you can see, the update of the position information is now inside the transcation and will be committed with the transaction, so if both articles and mysql.slave_relay_log_info are in the same transactional engine, they will be committed as a unit.

This works well for the SQL thread, but what about the I/O thread? There are no transactions executed here at all, so when is the information in this table committed?

Since a commit to the table is expensive—in the same way as syncing a file to disk is expensive when using files as replication information repository—the updates of the slave_master_info table is not updated with each processed event. Depending on the value of sync_master_info there are a few alternatives.

If sync_master_info = 0

In this case, the slave_master_info table is just updated when the slave starts or stops (for any reason, including errors), if the relay log is rotated, or if you execute a CHANGE MASTER command.

If sync_master_info > 0

Then the slave_master_info table will be updated every sync_master_info event.

This means that while the slave is running, you cannot really see how much data has been read to the slave without stopping it. If it is important to see how the slave progress in reading events from the master, then you have to set sync_master_info to some non-zero value, but you should be aware that there is a cost associated with doing this.

This does not usually pose a problem since the times you need to read the master replication information on a running replication is far and few between. It is much more common to read it when the slave has stopped for some reason: to figure out where the error is or to perform a master fail-over.

Closing remarks

We would be very interested in hearing any comments you have on this feature and how it is implemented. If you want to try this out for yourselves then you can download the MySQL 5.6 Milestone Development Release where all this is implemented from the MySQL Developer Zone (dev.mysql.com).
If you want to find out more details, the section Slave Status Logs in the MySQL 5.6 reference manual will provide you with all the information.
This is one of the features that presented by Lars Thalmann April 11, 2011 (yesterday) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and April 12, 2011 (today) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.

Monday, April 11, 2011

MySQL replication is fast, easy to use, and reliable, but once it
breaks, it can be very hard to figure out what the problem is. One of the concerns often raised is that events are corrupted, either through failing hardware, network failure, or software bugs. Even though it is possible to handle errors during transfer over the network using an SSL connection, errors here is rarely the problem. A more common problem (relatively) is that the events are corrupted either due to a software bug, or hardware error.

To be able to better handle corrupted events, the replication team has added replication event checksums to MySQL 5.6 Milestone Development Release.
The replication event checksums are added to each event as it is
written to the binary log and are used to check that nothing happened with the event on the way to the slave. Since the checksums are added to all events in the binary log on the master and transfered both over the network and written to the relay log on the slave, it is possible to track events corrupted events both because of hardware problems,
network failures, and software bugs.

Figure 1. Master and Slave with Threads

The checksum used is a CRC-32 checksum, more precisely ISO-3309, which
is the one supplied with zlib. This is
an efficient checksum algorithm, but there is of course a penalty
since the checksum needs to be generated. At this time, we don't have
any measurements on the performance impact.

If you look at Figure 1 you can see an illustration of how events
propagate through the replication system. In the figure, the points
where a checksum could be generated or checked are marked
with numbers. In the diagram, you can see the threads that handle the
processing of events, and an outgoing arrow from a thread can generate
a checksum while an arrow going into a thread can validate a
checksum. Note, however, that for pragmatic reasons not all
validations or generations can be done.

To enable validation or generation three new options were introduced:

binlog_checksum

This option is used to control checksum generation. Currently,
it can accept two different values: NONE and
CRC32, with NONE being default (for
backward compatibility).

Setting binlog_checksum
to NONE means that no checksum is generated, while
setting it to CRC32 means that an ISO-3309 CRC-32
checksum is added to each binary log event.

This means that a checksum will be generated by the session thread
and written to the binary log, that is, at point 1 in
Figure 1.

master_verify_checksum

This option can be set to either 0 or 1 (with default being 0)
and indicates that the master should verify any events read from
the binary log on the master, corresponding to point 2 in
Figure 1. In addition to being read from the binary log by
the dump thread events are also read when a SHOW BINLOG
EVENTS is issued at the master and a check is done at this
time as well.

Setting this flag can be useful to verify that the event really
written to the binary log is uncorrupted, but it is typically not
needed in a replication setting since the slave should verify the
event on reception.

slave_sql_verify_checksum

Similar to master_verify_checksum, this option can
be set to either 0 or 1 (but defaults to 1) and indicates that the
SQL thread should verify the checksum when reading it
from the relay log on the slave.
Note that this means that the I/O thread writes a checksum to the
event written to the relay log, regardless of whether it received
an event with a checksum or not.

This means that this option will enable verification at point 5 in
Figure 1 and also enable generation of a checksum at point
4 in the figure.

If you payed attention, you probably noticed that there is no checking
for point 3 in the figure. This is not necessary since the checksum is
verified when the event is written to the relay log at point 4, and
the I/O thread just does a straight copy of the event (potentially
adding a checksum, as noted above).

So, how does it look when we encounter a checksum error? Let's try it
out and see what happens.
We start by generating a simple binary log with checksums turned on
and see what we get.

Here, everything looks as before, so no sign of a checksum here, but
let's edit the binlog file directly and change the 's' in 'Mats' to a
'z' and see what happens. First with
MASTER_VERIFY_CHECKSUM set to 0, and then with it set to
1.

Now, the error message generated is not the crystal clear, but there
were an I/O error when reading the binary log: the checksum
verification failed. You can see this because I could show the content
of the binary log with MASTER_VERIFY_CHECKSUM set to 0,
but not when set to 1. Since the checksum is checked when reading
events from the binary log, we get a checksum failure when using
SHOW BINLOG EVENTS.

So, if we restore the error and verify that it is correct by issuing a
SHOW BINLOG EVENTS again, we can try to send it over to
the slave and see what happens. The steps to do this (in case you want
to try yourself) is:

Start the I/O thread and let it create the relay log using
START SLAVE IO_THREAD.

Stop the slave using STOP SLAVE (this is necessary
since the slave buffers part of the relay log).

Manually edit the relay log to corrupt one event (I replaced the
's' with a 'z'.

Start the slave using START SLAVE.

The result when doing this is an error, as you can see below. Removing
the corruption and starting the slave again will apply the events as
expected.

Now, this is all very nice, but if you have a corruption, you also
want to find out where the corruption is—and that preferably
without having to start the server. To handle this, the
mysqlbinlog program was extended to print the CRC
checksum (if there is one) and also to verify it if you give the
verify-binlog-checksum option to it.

As you can see, an error is emitted for the offending event, and you
can also see the CRC checksum value (which is 32 bits) in the output
above, and it corresponds to the position where the slave stopped for
my corrupted binary log.

This is just the beginning: there are many things that can be done
using checksums, and many new things that are now possible to
implement. If you think that this is a useful feature, please let us
know, and if you think that it needs to be enhanced, changed, or
extended, we would also like to hear from you.

Closing remarks

We would be very interested in hearing any comments you have on this
feature and how it is implemented. If you want to try this out for
yourselves then you can download the MySQL 5.6 Milestone Development
Release where all this is implemented from the MySQL Developer Zone
(dev.mysql.com).

If you want to find out the details, the reference documentation for
the replication checksum can be found together with the options
mentioned above:

This is one of the features that are presented by Lars Thalmann today (April 11, 2011) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and tomorrow (April 12, 2011) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.