Oracle Blog

All About InnoDB

Monday Jul 15, 2013

Introduction

InnoDB is a
general-purpose storage
engine that balances high reliability and high performance. It is
a transactional storage engine and is fully ACID compliant, as would
be expected from any relational database. The durability guarantee
provided by InnoDB is made possible by the redo logs.

This article will provide an overview of the redo log subsystem or
log subsystem of InnoDB. We will look at the following details:

The global log system object, which provides access to important
data structures and information.

The mini-transaction (mtr), using which all redo log records
are created.

The global in-memory log buffer (or just log buffer), into which
the redo logs are written to from the mini transaction buffer.
This log buffer will be periodically flushed to the log file on
disk.

The redo log files on the disk and their high level internal
structure.

We will discuss about the concept of LSN, and how the various
values of LSN are used to implement the write-ahead logging (WAL).

Redo Log Generation

In the article, Data Organization in InnoDB,
we already saw that the user data files of InnoDB (.ibd files) are
considered to be a sequence of equal sized pages. These pages are
uniquely identified within the InnoDB system using the space_id,
page_no combination. If we want to read or modify any page, it needs
to be loaded into memory. So there are two copies of the pages - on
disk and in memory. Here are the high level steps in redo log
generation.

Any changes to a page is first done to the in-memory copy of the
page. The page that is modified in memory and not yet flushed to disk
is marked as the dirty page in memory.

An associated redo log is generated in memory, in the local mini
transaction (mtr) buffer. This will then be transferred to the global
in-memory redo log buffer.

The redo log record is written from the redo log buffer in memory
to the redo log file on disk. This is subsequently flushed. These
two steps are considered separate - writing to the redo log file and
flushing the redo log file to the disk. This is to account for the
file buffering done by the operating system.

The dirty page is flushed from memory to disk at some later point
of time as part of the checkpointing operation.

The order of these steps are important. The redo log record of a
change must be flushed to disk before flushing the corresponding dirty
page to the disk. This is the concept
of write-ahead
logging (WAL).

The generated redo log record will contain information necessary
to repeat the same operation later during a database recovery. So the
redo log record will contain information about the original set of
pages and what has been changed in them. Using the redo log record
the set of dirty pages will be re-created during database
recovery.

Redo Log Files

By default, InnoDB creates two redo log files (or just log
files) ib_logfile0 and ib_logfile1 within the data
directory of MySQL. In MySQL versions 5.6.8 and above, the default
size of each redo log file is 48MB each. This can be configured by
the user by making use
of
innodb_log_file_size server option. The number of log files is
controlled
by innodb_log_files_in_group
server option.

A log group consists of a number of log files, each of same size.
As of MySQL 5.6, InnoDB supports only one log group. So I'll not
discuss this further.

The redo log files are used in a circular fashion. This means
that the redo logs are written from the beginning to end of first redo
log file, then it is continued to be written into the next log file,
and so on till it reaches the last redo log file. Once the last redo
log file has been written, then redo logs are again written from the
first redo log file.

The log files are viewed as a sequence of blocks called "log
blocks" whose size is given by OS_FILE_LOG_BLOCK_SIZE which is equal
to 512 bytes. Each log file has a header whose size is given by
LOG_FILE_HDR_SIZE, which is defined as 4*OS_FILE_LOG_BLOCK_SIZE.

Redo Log File Header

Each redo log file contains a header occupying four log blocks
with the following information:

The first 4 bytes contain the log group number to which the log
file belongs.

The next 8 bytes contain the lsn of the start of data in this log
file.

First checkpoint field located in the beginning of the second log
block.

Second checkpoint field located in the beginning of the fourth
log block.

The checkpoint field mentioned above contains information
necessary to identify a checkpoint - a checkpoint number, the LSN of
the checkpoint, checksum information and more.

Log Blocks

A redo log file can be viewed as a sequence of log blocks. All
log blocks, except the ones belonging to the log file header, contain
a header and a footer. The header is of size LOG_BLOCK_HDR_SIZE bytes
(which is defined as 12 bytes). The log block trailer is of size
LOG_BLOCK_TRL_SIZE (4 bytes). The log block header contains the
following information:

The log block number. This field is of 4 bytes.

Number of bytes of log written to this block. This field is of
2 bytes.

Offset of the first start of an mtr log record group in this
log block or 0 if none

The checkpoint number to which this log block belongs

The log block trailer contains checksum of the log block
contents.

Log Sequence Number (LSN)

The log sequence number (LSN) is an important concept. The LSN is
an offset into the redo log file. Within InnoDB the log sequence
number is represented by the type lsn_t, which is an 8-byte unsigned
integer. There are different LSN values that are of interest. The
following table lists some of the LSN values discussed in this
article. (Note: The log_sys is the global log system object
which is explained in next section.)

LSN

Description

log_sys->lsn

The redo log record that will be generated next will make use of
this lsn

log_sys->flushed_to_disk_lsn

The redo log file is flushed upto this LSN. All redo log
records whose LSN is < flushed_to_disk_lsn is safely on
the disk.

log_sys->write_lsn

There is a currently running write operation which will write
upto this LSN.

log_sys->current_flush_lsn

There is a currently running write + flush operation which will
write upto this LSN.

The LSN is what that links the dirty page, the redo log record,
and the redo log file. Each redo log record when it is copied to the
in-memory log buffer, it gets an associated LSN. When each database
page is modified, redo log records are generated. So each database
page is also associated to an LSN. The page lsn is stored in a header
for each page. The page lsn gives the lsn upto which the redo log
file must be flushed before flushing the page.

Global Log System Object

The global log system object log_sys (of type log_t)
holds important information related to log subsystem of InnoDB. Here
only those information related to redo log buffer and redo log files
are discussed. The global log_sys identifies the "active
area" of the log buffer, which contains redo logs that are yet to be
safely written on the disk. It also identifies the area in the redo
log file, into which the active area of the log buffer will be
written/flushed.

/** Redo log buffer */structlog_t{lsn_tlsn;/*!< log sequence number */ulintbuf_free;/*!< first free offset within the log
buffer */byte*buf_ptr;/* unaligned log buffer */byte*buf;/*!< log buffer */ulintbuf_size;/*!< log buffer size in bytes */ulintmax_buf_free;/*!< recommended maximum value of
buf_free, after which the buffer is
flushed */ulintbuf_next_to_write;/*!< first offset in the log buffer
where the byte content may not exist
written to file, e.g., the start
offset of a log record catenated
later; this is advanced when a flush
operation is completed to all the log
groups */lsn_twrite_lsn;/*!< end lsn for the current running
write */ulintwrite_end_offset;/*!< the data in buffer has
been written up to this offset
when the current write ends:
this field will then be copied
to buf_next_to_write */lsn_tcurrent_flush_lsn;/*!< end lsn for the current running
write + flush operation */lsn_tflushed_to_disk_lsn;/*!< how far we have written the log
AND flushed to disk */};

The member log_sys->buf points to the in-memory redo
log buffer. This is the buffer into which an mtr_commit() writes
the redo log records. The size of this buffer is given
by log_sys->buf_size in bytes.

The member log_sys->buf_free is the offset within the
in-memory redo log buffer, where the next redo log record will be
written. This is the end offset for the next redo log flush to
disk.

The member log_sys->buf_next_to_write is the offset
from where the redo log records are not yet written to the redo log
file. When the redo log buffer is flushed to disk next time, it
will flush from this location. This is the start offset for the next
redo log flush to disk.

The member log_sys->flushed_to_disk_lsn marks the lsn
upto which we have written to the log file on the disk and flushed
it. So upto this lsn, the redo logs are safely on the disk. This
will always be less than or equal to log_sys->write_lsn
and log_sys->lsn.

The member log_sys->lsn represents the current lsn.
This member will be updated whenever we do mtr_commit(). The
function mtr_commit() writes the bunch of redo log records generated
in the mini transaction to the global or system wide redo log buffer
in memory. This will always be greater than or equal
to log_sys->flushed_to_disk_lsn
and log_sys->write_lsn. This will be the LSN of the redo
log record written at log_sys->buf_free.

The member log_sys->write_lsn represents the end lsn
of a currently running redo log buffer write operation. This will
be greater than or equal to log_sys->flushed_to_disk_lsn
and less than or equal to log_sys->lsn.

The member log_sys->current_flush_lsn represents the
end lsn of a currently running redo log buffer write + flush
operation. This will be mostly equal
to log_sys->write_lsn.

The global log_sys object points to various positions in
the in-memory redo log buffer and on-disk redo log files. The
following picture shows the locations pointed to by the
global log_sys object. The picture clearly shows that the
redo log buffer maps to a specific portion of the redo log file.

Global In-memory Redo Log Buffer

The in-memory redo log buffer is global and all redo logs
generated by user transactions will be written into this buffer. The
size of this buffer is configurable and is given by
the innodb_log_buffer_size.
The default size of this redo log buffer is 8MB.

When a transaction is running and if it is modifying the database,
then it will generate redo logs and populate this log buffer. This
log buffer will be written or flushed to the log file, either when the
transaction commits, or when the log buffer gets full.

When the redo log buffer is full, and there is not enough space
for an mtr_commit(), which will transfer a group of redo log
records to log buffer, then a synchronous flush of the log buffer
is done to the redo log file by using the function
log_buffer_flush_to_disk(). This function will write the log buffer
from log_sys->buf_next_to_write
to log_sys->buf_free. In terms of LSN, the function
log_buffer_flush_to_disk() flushes the redo logs
from log_sys->flushed_to_disk_lsn
to log_sys->lsn.

The mini transaction (mtr)

A mini transaction (mtr) must be used to generate all the redo log
records. A mini transaction contains a local buffer (called the mini
transaction buffer) into which the generated redo log records will be
stored. If we need to generate a group of redo log records such that
either all make it to the redo log file or none makes it, then we need
to put them in a single mini transaction. Apart from the redo log
records, the mini transaction also maintains a list of pages that has
been modified (dirty pages).

The normal usage of a mini transaction is as follows:

Create a mini transaction object of type mtr_t

Start the mini transaction with mtr_start() function. This
will initialize the mini transaction buffer.

Generate the redo log records by making use of
mlog_write_ulint() family of functions.

Commit the mini transaction with mtr_commit()
function. This will transfer the redo log records from mini
transaction buffer to the global redo log buffer. The list of
dirty pages is added to the buffer pool flush list.

The definition of a mini transaction is given here for your
reference. The member mtr_t::log contains the mini transaction buffer
which will hold the redo log records, and the member mtr_t::memo
contains a list of pages dirtied by this mini transaction.

/* Mini-transaction handle and buffer */structmtr_t{dyn_array_tmemo;/*!< memo stack for locks etc. */dyn_array_tlog;/*!< mini-transaction log */unsignedinside_ibuf:1;/*!< TRUE if inside ibuf changes */unsignedmodifications:1;/*!< TRUE if the mini-transaction
modified buffer pool pages */unsignedmade_dirty:1;/*!< TRUE if mtr has made at least
one buffer pool page dirty */ulintn_log_recs;/* count of how many page initial log records
have been written to the mtr log */ulintn_freed_pages;/* number of pages that have been freed in
this mini-transaction */ulintlog_mode;/* specifies which operations should be
logged; default value MTR_LOG_ALL */lsn_tstart_lsn;/* start lsn of the possible log entry for
this mtr */lsn_tend_lsn;/* end lsn of the possible log entry for
this mtr */};

Redo Log Record Types

When we modify a database page, a redo log record is generated.
This redo log record either contains what information has been changed
in the page (physical redo log) or how to perform the change again
(logical redo log). InnoDB uses a combination of physical redo logs
and logical redo logs.

To understand this consider an operation like page
re-organization. If we generated a physical redo log record for this
operation, it is likely that the redo log record generated will be
equal to the page size. But instead InnoDB uses logical redo log
record for this operation, which will reduce the redo log record size
significantly. In the case of logical redo log record to represent a
page reorganize, all we need is the information to uniquely identify a
page, and "type" of operation which is page reorganize.

So each redo log record has a type. A redo log record type helps
to identify the function that will be used to apply or execute the
redo log during recovery. The contents of the redo log record must
then contain all the arguments or parameters needed by the
function.

Life cycle of a redo log record

The life cycle of a redo log record is as follows:

The redo log record is first created by a mini transaction and
stored in the mini transaction buffer. It has information necessary
to redo the same operation again in the time of database
recovery.

When mtr_commit() is done, the redo log record is transferred to
the global in-memory redo log buffer. If necessary, the redo log
buffer will be flushed to the redo log file, to make space for the new
redo log record.

The redo log record has a specific lsn associated with it. This
association is established during mtr_commit(), when the redo log
record is transferred from mtr buffer to log buffer. Once the lsn is
established for a redo log record, then its location in the redo log
file is also established.

The redo log record is then transferred from the log buffer to
redo log file when it is written + flushed. This means that the redo
log record is now durably stored on the disk.

Each redo log record has an associated list of dirty pages. This
relationship is established via LSN. A redo log record must be
flushed to disk, before its associated dirty pages. A redo log record
can be discarded only after all the associated dirty pages are flushed
to the disk.

A redo log record can be used to re-create its associated list of
dirty pages. This happens during database recovery.

Conclusion

This article provided an overview of the redo log subsystem of
InnoDB storage engine. The major data structures used in the redo log
subsystem are the mini transaction (mtr), the in-memory redo log
buffer and the on-disk redo log files. The InnoDB storage engine
tracks many LSN values to ensure that the write-ahead logging (WAL)
happens correctly.

Since data loss is unacceptable for an RDBMS, the redo log
subsystem is very critical for a successful RDBMS. And since redo
logs are generated synchronously with DML operations, it is important
to do it efficiently. The size of redo logs must be kept as minimal
as possible.