Overview
--------
Parallel replication is the idea of executing multiple transactions (or more
precisely event groups from the binlog) in parallel in multiple threads on the
slave, in order to make the slave applier more scalable to multiple CPU cores
and disk spindles.
The main problem with parallel replication is identifying transactions that
are *independent*, meaning that they can be safely executed in parallel on the
slave without dead-locking each other or producing different results than on
the master.
This worklog describes one way of identifying such independence. MWL#169
describes another. Apart from discovering independent transactions, it is also
necessary to implement a mechanism for actually distributing events to
different execution threads, coordinating commit order, handling errors and
crash recovery, etc. It is expected that different ways of determining
independent transactions, such as this worklog and MWL#169, will share the
same mechanism for actual parallel execution. It should even be possible to
combine different ways to discover independence at the same time.
Idea
----
MWL#116 introduces group commit, where multiple transactions are committed
together and also written to the binlog together. This works by collecting
parallel transactions that are waiting inside COMMIT (or autocommit) for
access to the binary log.
The idea in this worklog is that all transactions participating in a single
group commit must necessarily be independent, and can be safely applied in
parallel on a slave.
Suppose we are about to write N transactions T_1, ..., T_N into the binlog in
group commit. Then all statements for each transaction have completed, but
none of them are committed yet. This means that (1) there can be no
conflicting row locks between any two T_i and T_j, and (2) no transaction T_i
was able to see any changes from T_j (for i<>j and assuming isolation level
greater than read uncommitted).
Another way to see this is to note that on the master, the order of T_1, ...,
T_N is essentially random and depends only on how the kernel decided to
schedule the threads during the COMMIT. Any ordering would have been possible
just by having different thread scheduling delays when the transactions are
put in the ordered list used for group commit; there is no inter-thread
dependencies at that point in the code. So any order of the transactions in
the binlog would have been possible, and thus any order of slave
execution. Therefore, as long as COMMIT cannot change what modifications the
transaction already did, any execution order of the transactions on the slave
is permissible.
Implementation
--------------
I suggest that when this feature is enabled, we introduce two new binlog
events, group_commit_begin and group_commit_end. We write a group_commit_begin
event just before all the events for the transactions in a group commit, and a
group_commit_end event just after.
We can use the "ignorable events" from MySQL 5.5 for these events; this will
allow a slave that does not implement these events to safely ignore them.
If needed, we can further increase compatibility by
- Having a server option to enable/disable the writing of these two events.
- Add a flag that the slave sets to request the sending of these events, and
only send them if requested (similar to MWL#47).
It still needs to be decided how much of this to do.
The slave can then collect all the event groups between the group_commit_begin
and group_commit_end pair, and schedule them in parallel.
Limitations
-----------
The main limitation of this approach is that on the slave, it can only run
transactions in parallel if they happened to commit at roughly the same time
on the master.
This means that there will be little opportunity for parallelism on the slave
if
1. The master has low write load
2. The master is mainly running few but long transactions
3. The master has fast fsync() performance for the commit
In the case (1), there is in any case less need for scaling the event
execution.
In the case (2), another method (for example MWL#169) will be needed to
achieve efficient parallel replication.
For (3), we can improve parallelism by deliberately delaying commits on the
master. By inserting a short sleep just before group commit, we allow more
independent transactions time to participate, giving more opportunity for
parallelism on the slave. At the same time, we reduce fsync() load on the
master, potentially freeing more I/O capacity for other work. For example, we
might add an option to rate-limit COMMIT fsync() to 25 per seconds or
whatever. Care will be needed when using this option, as if the master load
has insufficient parallelism, such rate-limitation will reduce throughput on
the master.
The MWL#163 innodb_release_locks_early option causes transactions to
effectively be committed early in COMMIT during the prepare phase. This allows
another transaction to see and/or modify rows also modified by the first
transaction, and still participate in the same group commit. This means that
when innodb_release_locks_early is set, the transactions cannot be safely used
for parallel replication (eg. in this case we must disable the writing of
group_commit_begin / group_commit_end).
Events that modify non-transactional tables or do DDL causes more
complications, as they are written directly to the binlog and make changes
visible immediately to other transactions (basically they are not
transactional). To simplify things, we will not declare any such events
independent with respect to any other transactions according to this
worklog. There are many useful applications of parallel replications where
such events are rare.