Friday, April 30, 2010

It is with interest that I read Kristian'sthreeblogs
on the binary log group commit. In the article, he mentions InnoDB's
prepare_commit_mutex as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.

The prepare_commit_mutex is used to ensure that the
order of transactions in the binary log is the same as the order of
transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so
removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of
transactions.

To understand how to solve the problem, it is necessary to take a
closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.

The transaction data is stored in a per-thread transaction
cache and the transaction size is the size of the data
in the transaction cache.
In addition, each transaction will have a transaction binlog
position (or just transaction position) where the
transaction data is written in the binary log.

The procedure can be outlined in the following steps:

Prepare InnoDB [ha_prepare]:

Write prepare record to log buffer

fsync() log file to disk (this can currently do
group commit)

Take prepare_commit_mutex

Log transaction to binary log [TC_LOG_BINLOG::log_xid]:

Lock binary log

Write transaction data to binary log

Sync binary log based on sync_binlog. This forces
the binlog to always fsync() (no group commit) due to
prepare_commit_mutex

Unlock binary log

Commit InnoDB:

Release prepare_commit_mutex

Write commit record to log buffer

Sync log buffer to disk (this can currently do group commit)

InnoDB locks are released

There are mainly two problems with this approach:

The InnoDB row level and table level locks are released very
late in the sequence, which affects concurrency. Ideally, we need
to release the locks very early, preferably as soon as we have
prepared InnoDB.

It is not possible to perform a group commit in step 2

As you can see here, the prepare of the storage engines (in this
case just InnoDB) is done before the binary log mutex is taken, and
that means that if the prepare_commit_mutex is removed it
is possible for another thread to overtake a transaction so that the
prepare and writing to the binary log is done in different order.

To solve this, Mark suggests using a queue or a ticket system to
ensure that transactions are committed in the same order, but we
actually already have such a system that we can use to assign tickets:
namely the binary log.

The idea is to allocate space in the binary log for the transaction to
be written. This gives us a sequence number that we can use to order
the transactions.

In this post, I will outline an approach that Harrison and I have
discussed, which we think will solve the problems mentioned above. In
this post, I will outline the procedure during normal operations, in
the next post I will discuss recovery, and in the third post (but
likely not the last on the subject), I will discuss some optimizations
that can be done.

I want to emphasize that the fact that we have a worklog does not
involve any guarantees or promises of what, when, or even if any
patches will be pushed to any release of MySQL.

In Worklog #4007 an
approach for writing the binary log is suggested where space is
allocated for the transaction in the binary log before actually
starting to write it. In addition to avoiding unnecessary locking of
the binary log, it also allow us to use the binary log to order the
transactions in-place. We will use this idea of reserving space in the
binary log to implement the binary log group commit.

By re-structuring the procedure above slightly, we can ensure that
the transactions are written in the same order in both the InnoDB
transaction log and the binary log.

There are two ways to re-structure the code: one simple and one more
complicated that potentially can render better performance. To
simplify the presentation, it is assumed that pre-allocation is
handled elsewhere, for example using Worklog
#4925. In a real implementation, pre-allocation can either be
handled when a new binlog file is created, or when transaction data is
being written to the binary log.

The sequential write approach

Figure 1. Sequential binary log group commit

In the sequential write approach, the transactions are still written
to the binary log in order and the code is just re-ordered to avoid
keeping mutexes when calling fsync().
To describe the algorithm, three shared variables are introduced to
keep track of the status of replication:

Next_Available

This variable keeps track of where a new transaction can be written

Last_Committed

This variable keeps track of the last committed transaction,
meaning that all transactions preceding this position is actually
on disc. This variable is not necessary in the real implementation,
but it is kept here to simplify the presentation of the algorithm.

Last_Complete

This variable keeps track of the last complete transaction. All
transactions preceding this point is actually written to the binary
log, but are not necessarily flushed to disc yet.

You can see an illustration of how the variables are used with the
binary log in Figure 1 where you can also see three threads each
waiting to write a transaction. Both variables are initially is set
to the beginning of the binary log and it is always true that Last_Committed ≤
Last_Complete ≤ Next_Available.
The procedure can be described in the following steps:

Lock the binary log

Save value of Next_Available in a variable
Trans_Pos and increase
Next_Available with the size of the transaction.

Prepare InnoDB:

Write prepare record to log buffer (but do not
fsync() buffer here)

Release row locks

Unlock binary log

Post prepare InnoDB:

fsync() log file to disk, which can now be done
using group commit since no mutex is held.

Log transaction to binary log:

Wait until Last_Complete =
Trans_Pos. (This can be implemented using a
condition variable and a mutex.)

Write transaction data to binary log using
pwrite. At this point, it is not really necessary to
use pwrite since the transaction data is simply
appended, but it will be used in the second algorithm, so we
introduce it here.

Update Last_Complete to
Trans_Pos + transaction size.

Broadcast the the new position to all waiting threads to wake
them up.

Call fsync() to persist binary log on disk. This
can now be group committed.

Commit InnoDB:

Write commit record to log buffer

Sync log buffer to disk, which currently can be group
committed.

To implement group commit, it is sufficient to have a condition
variable and wait for that for a specified interval. Once the interval
has passed, the transaction data can call fsync(), after
which it broadcasts the fact that data has been flushed to disc to
other waiting threads so that they can skip this. Typically, the code
looks something along these lines (we ignore checking error codes here
to simplify the description):

Step 6a requires a condition variable and a mutex when waiting
for Last_Complete to reach
Trans_Pos. Since there is just a single condition
variable, it is necessary to broadcast a wakeup to all
waiting threads, which each will evaluate the condition just to find
a single thread that should continue, while the other threads go to
sleep again.

This means that the condition will be checked O(N2) times to commit N transactions.
This is a waste of resources, especially if there is a lot of
threads waiting, and if we can avoid this, we can gain
performance.

Since the thread has a good position in the binary log where it
could write, it could just as well start writing instead of
waiting. It will not interfere with any other threads, regardless if
locks are kept or not.

These observations lead us to the second approach, that of writing
transaction data to the binary log in parallel.

A parallel write approach

Figure 2. Parallel binary log group commit

In this approach, each session is allowed to write to the binary log
at the same time using pwrite since the space for the
transaction data has already been allocated when preparing the
engines. Figure 2 illustrates how the binary log is filled in (grey
areas) by multiple threads at the same time. Similar to the
sequential write approach, we still have the
Last_Complete, Last_Committed, and
Next_Available variables.

Each thread does not have to wait for other threads before writing,
but it does have to wait for the other threads to
commit.
This is necessary since we required the order of commits in the InnoDB
log and the binary log to be the same. In reality, this does not pose
a problem since the I/O is buffered, hence the writes are done to
in-memory file buffers.

The algorithms look quite similar to the sequential write approach,
but notice that in step 6, the transaction data is simply written to
the binary log using pwrite.

Lock the binary log

Save value of Next_Available in a local variable
Trans_Pos and increase
Next_Available with the size of the transaction.

Prepare InnoDB:

Write prepare record to log buffer (but do not
fsync() buffer here)

Release row locks

Unlock binary log

Post prepare InnoDB:

fsync() log file to disk, which can now be done
using group commit since no mutex is held.

Log transaction to binary log:

Write transaction data to binary log using
pwrite. There is no need to keep a lock to protect
the binary log here since all threads will write to different
positions.

Wait until Last_Complete =
Trans_Pos.

Update Last_Complete to Trans_Pos +
transaction size.

Broadcast the the new position to all waiting threads to wake
them up.

Call fsync() to persist binary log on disk. This
can now be group committed.

Commit InnoDB:

Write commit record to log

Sync log file to disk

This new algorithm has some advantages, but there are a few things to note:

When a transaction is committed, it is guaranteed that
Trans_Pos ≥ Last_Committed for all
threads (recall that Trans_Pos is a thread-local
variable).

Writes are done in parallel, but when waiting for the condition
in step 6b still requires a broadcast to wake up all waiting
threads, while only one will be allowed to proceed. This means that
we still have the O(N2)
complexity of the sequential algorithm. However, for the parallel
algorithm it is possible to improve the performance significantly,
which we will demonstrate in the third part where we will discuss
optimizations to the algorithms.

Recovery in the sequential algorithm is comparably simple since
there are no partially written transactions. If you consider that a
crash can occur in the situation described in Figure 2, it is
necessary to device a method for correctly recovering. This we will
discuss in the second part of these posts.

Tuesday, April 13, 2010

The MySQL Conference and Expo started with me and Lars Thalmann doing the replication tutorial. Unfortunately, we cannot at this time distribute the slides (please watch the replication tutorial page at the conference site), but there is a replication tutorial package for easy setup of server to play around with—including some sample scripts—and a paper that both explains how the package can be used as well as giving some example setups.