From: Mats Kindahl
Date: February 1 2013 12:15pm
Subject: Re: reducing fsyncs during handlerton->prepare and handlerton->commit
in 5.6
List-Archive: http://lists.mysql.com/internals/38714
Message-Id: <510BB1E3.8040509@oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
On 02/01/2013 11:26 AM, Zardosht Kasheff wrote:
> inlining
>
> On Fri, Feb 1, 2013 at 2:37 AM, Mats Kindahl wrote:
>> On 01/31/2013 10:28 PM, Zardosht Kasheff wrote:
>>> Thank you for the detailed reply.
>>>
>>> I want to confirm that I understand the contract:
>>> - when flush logs is called, the engine must ensure that any
>>> transaction committed up until that point is recovered as committed
>>> after a crash. No such transaction can come up in the prepared state.
>> Yes, that is correct.
>>
>>> The reason I ask this is that as of now, on flush logs, we flush all
>>> of our data to disk, and this seems like overkill. Instead, if we just
>>> fsync our recovery log, that will satisfy the above contract.
>>>
>>> Can we do this?
>> It looks like it should work. I don't know the details of your
>> implementation (and details can make a big difference), but if you by
>> using the recovery log can ensure that no transactions committed before
>> the flush_logs() show up as "prepared" on recovery, you should be safe.
>> (Note: on recovery, the storage engine is asked for all prepared
>> transactions, and this set should not include any transactions that are
>> in any binary logs except the last one.)
>>
>> Just a questions: how do you make it possible to flush all data to disk
>> if, e.g., the user want to take a backup? You don't have to use FLUSH
>> LOGS for this (which would call flush_logs()), but it has to be possible.
> Our engine, TokuDB, has the concept of a checkpoint that flushes all
> data to disk. From what has been said in this thread, I think doing a
> checkpoint on flush logs is overkill, especially when overturning a
> binary log.
Yes, I agree.
> I think you are asking what mechanism will we have for the
> user to induce a checkpoint. I don't have an answer for that yet, we
> need to investigate. But it sounds like a user experience question.
Yes, that's what it is.
/Matz
>
>> /Matz
>>
>>> Thanks
>>> -Zardosht
>>>
>>> On Thu, Jan 31, 2013 at 4:12 PM, Mats Kindahl wrote:
>>>> On 01/31/2013 05:09 PM, Zardosht Kasheff wrote:
>>>>> Thanks a lot Kristian and Mats.
>>>>>
>>>>> I am learning that I know a lot less than I thought I knew. To help my
>>>>> understanding, I will focus in this thread on MySQL 5.6. I will focus
>>>>> on MariaDB in another thread.
>>>>>
>>>>> I want to make sure my understanding is correct. Is what I write below accurate?
>>>>>
>>>>> In MySQL 5.5, we have the following APIs:
>>>>> - handlerton->prepare
>>>>> - handlerton->commit
>>>>> - handlerton->flush_logs.
>>>>> We need to fsync on prepare and commit. But what does flush_logs need
>>>>> to do? According to comments, flush_logs runs a checkpoint on the
>>>>> system, which is pretty expensive. Is this accurate?
>>>> The flush_logs() function is called before rotating the binary log and
>>>> when doing an explicit FLUSH LOGS. It give the storage engine a chance
>>>> to flush any in-memory buffers, so yes, it does a checkpoint and it is
>>>> expensive. However, it is just once for each binary log.
>>>>
>>>> This means that each time a binary log is rotated, the system grinds to
>>>> a halt, which is not very nice. We have been discussing ways to avoid this.
>>>>
>>>>> In MySQL 5.6, we have the same APIs, but the contract has changed. We
>>>>> still always fsync on prepare, that remains the same. For commit, if
>>>>> HA_IGNORE_DURABILITY is set, we should not fsync, otherwise we may
>>>>> have poor performance.
>>>> Not poorer than without group commits, but yes, if you sync with every
>>>> commit, you will have poor performance.
>>>>
>>>>> If HA_IGNORE_DURABILITY is not set, then we
>>>>> must fsync on commit.
>>>> Correct. The HA_IGNORE_DURABILITY says that the server "handles the
>>>> durability".
>>>>
>>>>> I do not know what flush_logs needs to do.
>>>>>
>>>>> My last question is the following:
>>>>> - what should flush_logs do?
>>>>> - what is the purpose/contract of flush_logs? Under what scenarios is
>>>>> it meant to be called?
>>>> The flush_logs should create a checkpoint, just as you said above. It is
>>>> called on binary log rotate and on explicit FLUSH LOGS (actually also on
>>>> ALTER TABLE, under some circumstances: see sql_table.cc).
>>>>
>>>>> The comments in MySQL 5.5 and 5.6 imply that a checkpoint is run on
>>>>> the system. This is what we do as well. This sounds expensive, because
>>>>> IIUC, a checkpoint writes all dirty nodes to disk. But looking at the
>>>>> implementation, it seems that flush_logs only ensures that the redo
>>>>> log is synced up to the proper lsn, and if not, syncs it. Essentially,
>>>>> it just fsyncs the log.
>>>> I assume that you have been looking at InnoDB. The reason a checkpoint
>>>> is done (by flushing the log) is that the recovery procedure only looks
>>>> in the last binary log, which means that you might lose committed
>>>> transactions on a crash, which is not OK.
>>>>
>>>> If you have prepared and committed some transaction but do not flush the
>>>> log on disk before a rotate, it might be that on recovery it is listed
>>>> as prepared (because the commit record was not written to disk) but the
>>>> recovery procedure will not find it in the binary log (because it was
>>>> not in the last one, it was in the preceding one) and hence it will be
>>>> rolled back.
>>>>
>>>> Poof! Transaction gone.
>>>>
>>>> /Matz
>>>>
>>>>> Is this accurate? If so I think we need to modify our engine to not
>>>>> checkpoint and just fsync our recovery log.
>>>>>
>>>>> Thanks
>>>>> -Zardosht
>>>>>
>>>>> On Thu, Jan 31, 2013 at 4:41 AM, Kristian Nielsen
>>>>> wrote:
>>>>>> Mats Kindahl writes:
>>>>>>
>>>>>>> In MySQL 5.6, there are no new APIs that you *have* to comply with.
>>>>>> But MySQL 5.6 serialises calls to the commit handlerton method, the next one
>>>>>> cannot start before the previous one completes. So if you did fsync() with
>>>>>> group commit before in commit, your group commit will no longer work in 5.6
>>>>>> and you will get a serious performance regression if you do not honour the
>>>>>> HA_IGNORE_DURABILITY flag. I consider that breaking the storage engine API, as
>>>>>> you see Mats and I disagree a bit on that point :-)
>>>>>>
>>>>>> Anyway, it should be easy to do for you. MySQL 5.6 sets HA_IGNORE_DURABILITY,
>>>>>> this has similar semantics to when MariaDB 5.3+ calls the commit_ordered()
>>>>>> method. So you can probably use the same code for both with a small amount of
>>>>>> #ifdef.
>>>>>>
>>>>>> Note that for MySQL 5.6 you need to implement also the flush_logs() method to
>>>>>> fsync() all prior commits durably to disk (if you did not already implement
>>>>>> it). What happens is basically that in MySQL crash recovery looks only at the
>>>>>> last binlog file written. So it calls flush_logs() before creating a new
>>>>>> binlog, and storage engine must ensure that all commits become durable at that
>>>>>> point. Otherwise commits may be lost if a crash happens just after binlog
>>>>>> rotation.
>>>>>>
>>>>>> This is actually the _only_ reason that fsync() was ever needed in commit, to
>>>>>> ensure that it is done when binlog is rotated. So it is rather silly that we
>>>>>> have done it for _every_ commit for so long. Anyway, it will be fixed now.
>>>>>>
>>>>>> - Kristian.
>>>> --
>>>> Senior Principal Software Developer
>>>> Oracle, MySQL Department
>>>>
>> --
>> Senior Principal Software Developer
>> Oracle, MySQL Department
>>
>>
>> --
>> MySQL Internals Mailing List
>> For list archives: http://lists.mysql.com/internals
>> To unsubscribe: http://lists.mysql.com/internals
>>
--
Senior Principal Software Developer
Oracle, MySQL Department